

## VECTOR DSP, FPU EXTEND XTENSA

Tensilica's Configurable Core Extensions Launch Processing Power Into Stratosphere By Steve Leibson {6/19/00-02}

The development team at Tensilica believes in choice. Oh brother, do they believe in choice. Not content with a *simple* 32-bit RISC core, or even a configurable core comprising optional bits and configurable pieces, Xtensa III (the core's third incarnation) now

sports a 32-bit IEEE-754 floating-point unit and a 32 x 32-bit hardware multiplier (supplementing the existing 16 x 16-bit multiplier), keeping the pressure on Arc Cores, one of Tensilica's closest competitors in the configurable-core arena. As Figure 1 shows, Tensilica also threw in the kitchen sink in the form of a high-performance vector DSP unit called Vectra that triples silicon usage and doubles power consumption. Tensilica's hardware and software generators deliver the Xtensa III core design as a synthesizable HDL file along with the additional puzzle pieces needed to plug the core into a system. The extra pieces include synthesis scripts, test suites, bus-functional models, a configured GNU C/C++ compiler, an assembler, a simulator, a debugger, and an OS kit for leading commercial RTOSs.

An ASIC designer configures Xtensa via a set of secure, password-protected Web pages on Tensilica's site. HTML check boxes add or delete predefined function units, while drop-down boxes size memories and set parameters on configurable function units. An extension language based on Verilog called TIE (Tensilica instruction extension language) allows customers to extend



Figure 1. Additions to Tensilica's configurable Xtensa processor core include an FPU, a 32  $\times$  32-bit multiplier, and a vector DSP unit.

the basic core design by defining new operations and register files, and the corresponding instruction mnemonics and binary opcodes for these functions. At any time, an Xtensa user can analyze gate count and projected power usage from yet another Web page. In theory, users can therefore iterate core designs to achieve a balance between cost and performance. Practically speaking, the sub-mm<sup>2</sup> real estate needs and low power consumption of the basic Xtensa core may obviate the need for much fine-tuning at today's commercially economical lithography levels. An ASIC designer looking to eke out maximum performance, however, may well take advantage of Xtensa's ability to accept opcode grafts. Once the core configuration gels, one click starts the machinery in motion, and within an hour or two, Tensilica's server dishes up the resulting HDL and software files. (See MPR 3/8/99-02, "Tensilica CPU Bends to Designers' Will" for a more complete description of the original Xtensa architecture).

Rollout of Xtensa III, including the Vectra DSP enhancements, lags archrival ARC Cores' announcement of its V3 dual-MAC DSP core enhancements by more than a year. The delay seems to have given Tensilica's architects ample time to plan a counterstrike against ARC's venture into DSP territory. Table 1 compares Vectra's performance favorably with that of TI's 'C62xx, one of the fastest standalone DSPs in the market. (Tensilica prepared this table, so take the figures with a grain of salt.) Vectra attains this performance through four MAC/ALU units linked by multiple 160-bit buses to DSP-specific register files, data RAM, caches, and ROM.

As Figure 2 shows, each of Vectra's four MAC/ALU units contains a multiplier, an add/subtract module, a separate ALU, and a shift/select module. The three 160-bit buses can dump a maximum 60 bytes/cycle into the four MAC/ALU units, although the practical limit is probably closer to 32 bytes/cycle. A fourth 160-bit bus extracts computed results and funnels them to Vectra's 128-bit local memory through a saturation unit at the rate of 16 bytes/cycle. Fortunately, the original design of the Xtensa architecture allows external memory interfaces as wide as 128 bits. Otherwise Vectra would likely starve to death or die of constipation. However, there are substantial additional system costs associated with wide external buses and memory arrays that must be paid to realize Vectra's full performance potential.

| Processor/Core         | FIR Filter<br>32 x 128<br>(cycles) | FFT 256-Pt<br>complex<br>(cycles) |
|------------------------|------------------------------------|-----------------------------------|
| Xtensa Vectra          | 1,240                              | 2,653                             |
| TI C6203               | 2,061                              | 2,707                             |
| "Typical dual-MAC" DSP | >2,050                             | >5,000                            |
| TI C549                | >4,100                             | >8,000                            |

**Table 1.** Xtensa's Vectra DSP delivers performance comparable to top-of-the-line standalone DSPs. (Source: Tensilica)

Like the basic RISC integer core, Vectra will also be configurable. The ASIC designer can configure the Vectra DSP unit for 8-, 16-, or 24-bit operation, although at introduction in September only the 16-bit version of Vectra will be available. DSP configuration options include vector length, register and memory data precision, and multiplier format. The 8-bit option configures Vectra's ALU for 10- and 20-bit operations with an 8 x 8-bit multiplier. The 16-bit option sets the ALU configuration for 20- and 40-bit operations with a 16 x 16-bit multiplier. The 24-bit option sets the ALU for 32- and 64-bit operations with a 24-bit multiplier. Consequently, Tensilica claims that Vectra is optimized for 16-bit communications applications, 24-bit audio applications, and 8- and 16-bit imaging applications.

Responding to requests to push Xtensa beyond integer applications, Tensilica has added a floating-point unit to the list of selectable core options. Xtensa III's FPU adds sixteen 32-bit floating-point registers to the processor core's base register set. The FPU is pipelined and operates at a sustained rate of two floating-point operations/cycle. Floating-point add, subtract, multiply, MAC, and multiply-subtract operations have a four-cycle latency; floating-point loads and conversions have a two-cycle latency; moves and compares have single-cycle latency.

## Performance Always Costs Something

The easy ability to plop a fully supported FPU or a firebreathing DSP engine like Vectra into an ASIC with one click on a Web page may lull a system architect into forgetting the costs associated with that choice. Relative real-estate consumption jumps substantially with the addition of Xtensa's newest enhancements. Because Xtensa is a synthesizable core, the exact amount of silicon it requires depends on the synthesis tool. In a 0.18-micron process, the base processor core consumes 0.7–1.0mm<sup>2</sup>. Xtensa III's FPU adds approximately 20K–25K gates and something under 1.5mm<sup>2</sup> in a 0.18-micron process, thus more than doubling the size of the basic core. Adding Vectra to the base core more than triples silicon usage to 2.9–3.5mm<sup>2</sup>. With both the FPU and DSP unit added, the core still consumes only 4mm<sup>2</sup>.

Power consumption also jumps with the Xtensa III enhancements. In a 0.18-micron process, adding the Vectra DSP to the base Xtensa core doubles power dissipation from 0.4mW/MHz to 0.8mW/MHz. At the core's maximum projected clock rate of 320MHz (in the 0.18-micron process), core power dissipation doubles from 128mW to 256mW. Memory power dissipation will also increase, because the bandwidth requirements of the DSP functions are much higher than those of the base RISC core, although Vectra's large register file ameliorates the need to access memory for intermediate processing steps.

Realistically, the extra power and real estate may make no substantive difference in the overall system design. These days, the processor core consumes a decreasing portion of the overall silicon and power budgets. Besides, the processing power boost provided by the Vectra DSP is truly impressive for the silicon and power expended. By comparison, the power dissipation for TI's 'C6201B running at 1.8V and 200MHz is 1.8W, versus the 100mW or so that Vectra adds to Xtensa (excluding memory).

## A Fly in the Configurable Ointment

Raw computational performance is not worth much if the associated software cannot harness the power of the hardware. Here is where Tensilica's advantage and Achilles' heel both reside. Tensilica's hardware and software generators automatically create the core HDL code, synthesis scripts, test-bench files, and development tools needed to support the customized Xtensa core. The Xtensa hardware design and software are very closely coupled-the unavoidable price for extensive configurability. Tensilica's advantage is that system designers receive all the necessary support tools along with the core design through one contract with one vendor. The hidden disadvantage of this arrangement is that there is little or no hope of independent third-party tool support for configurable cores like Xtensa, precisely because of the extremely close linkage between the configurable hardware and the associated software. The entangled nature of this arrangement and the resulting lack of competition in development tools mean that Xtensa may not have best-of-class software support unless Tensilica provides it.

Take for example the GNU C/C++ compiler produced by Tensilica's software generator. Recently released scores from EEMBC's telecom benchmarks indicate that the GNU compiler may be somewhat off the pace with respect to the speed of the resulting compiled code (see MDR 5/01/00-2, "EEMBC Releases First Benchmarks"). The first release of the EEMBC benchmarks shows NEC's VR5000 outperforming IDT's RC64575 by 30% at the same clock frequency. Both processors are implementations of the MIPS R5000 core. The biggest difference between these two processors in the EEMBC tests seems to be the compiler used to generate the test code. NEC used the Green Hills Multi2000 compiler, and IDT used a GNU compiler. Vectra certainly exacerbates this problem by dropping a large and complex DSP into the equation, but all configurable cores have the same problem: single-vendor tool support has potential disadvantages.

The only way for Tensilica (or any other configurablecore vendor for that matter) to address this problem is to partner with a leading commercial compiler vendor. That's precisely what ARC Cores did last year when it purchased compiler vendor Metaware (see *MDR* 4/10/00-3, "ARC Cores Builds IP Library"). Considering the licensing fees associated with configurable cores and the competitive postures of ARC and Tensilica, such a licensing deal is certainly not out of the question, if only for the competitive advantage the arrangement would provide Tensilica.

## Price & Availability

Full release of the Xtensa III components is scheduled for September. As with all IP processor cores, pricing is negotiable. More information about Xtensa is available on Tensilica's Web site at *www.tensilica.com*.

On the plus side of the equation, no company is as motivated to exploit the full potential of the Xtensa core (including the Vectra DSP) as Tensilica is. The core vendor has the most to gain by boosting the core performance beyond the reach of the competition. Furthermore, some core enhancements, such as Xtensa's FPU, are easier to support than others. Tensilica added C-level support for the FPU by simply mapping the hardware floating-point unit's features into C's "float" data type-a natural extension. As for the Vectra DSP, Tensilica has implemented automatic vectorization so that the compiler recognizes simple nested loops written in C that operate on arrays and produces efficient code that exploits the vector-oriented features of the DSP. Tensilica will also supply verified source libraries for common DSP algorithms, including FFTs, FIR filters, and Viterbi decoding.



Figure 2. The configurable Vectra DSP unit, shown here in its 16-bit configuration, has four MAC/ALU function units capable of delivering four MACs or eight additions per clock cycle. The MAC/ALU function units connect to registers and memory via four 160-bit buses.

Tensilica's Xtensa III is another shot in the war among processor-core vendors, albeit a big one. Expect a response from the other core vendors, especially ARC Cores. (In fact, see the article on ARC's latest announcement in this issue of MPR.) Each announcement spurs the creation of ever more powerful and complex offerings by the other competitors. Although not yet moving at Internet speed, the healthy competition among core vendors ensures that ASIC and system designers will continue to have new and interesting processor choices for years to come.  $\diamondsuit$ 

To subscribe to Microprocessor Report, phone 408.328.3900 or visit www.MDRonline.com

© MICRODESIGN RESOURCES 🔷 JUNE 19, 2000 🔷 MICROPROCESSOR REPORT