# Lexra Adds DSP Extensions LX5280 Core With Radiax Instructions Offers High Performance

#### by Krishna Yarlagadda

Lexra is expanding its portfolio of embedded options by adding digital-signal-processing (DSP) capabilities to its latest MIPS-like core. Lexra's DSP instruction set, dubbed Radiax, will appear first in the LX5280, which is designed for low-cost, low-power system-on-a-chip (SOC) devices.

Potential applications include third-generation cell phones, DSL modems, wireless base stations, and voice-over-IP (VoIP) gateways. Lexra has signed up 15 licensees for its cores but is not ready to disclose their names. (Lexra is not a MIPS licensee itself and has skirmished with Mips Technologies over various legal issues. Lexra's cores execute virtually all MIPS instructions except unaligned loads and stores.)

Radiax is a set of 36 new instructions that enable realtime DSP. The standard MIPS instruction set is well suited for control code, and the Radiax instructions are worthwhile additions. Lexra plans to license Radiax to other MIPS licensees royalty-free.

Adding extensions to existing processor architectures has become a popular trend with desktop CPUs, and it's now moving into the embedded-processor arena. Other embedded processors, such as Hitachi's SH-DSP (see MPR 12/4/95 p. 10), ARC Cores' v3.0 (see MPR 5/31/99, p. 16), and the ARM9E (see MPR 6/21/99, p. 11) have also adopted multimedia extensions.

Those processors, however, dispatch only one instruction per cycle; the LX5280 dispatches two. Lexra's superscalar design is similar to that taken by ZSP with its 16401, but it's distinctly different from the VLIW approach used by Texas Instruments in the C6000 family (see MPR 2/17/97, p. 14) and by Lucent and Motorola in their StarCore architecture (see MPR 5/10/99, p. 13). Superscalar designs are more



Figure 1. The LX5280 adds Radiax instructions and a new pipe to Lexra's earlier uniscalar LX4180.

common in desktop CPUs, because the cost and powerconsumption penalties for larger chips are not as severe.

### Superscalar Architecture Aids Parallelism

As Figure 1 shows, the LX5280 adds a second pipe to the earlier LX4180 to increase instruction-level parallelism. This new pipe allows the processor to execute memory operations and DSP instructions simultaneously, thus improving throughput in DSP inner loops. The LX5280's superscalar RALU (register file and ALU) includes an eight-port (four read/four write) general-purpose register file, approximately doubling the resources of the scalar LX4180.

The LX5280 fetches two instructions every cycle and tries to dispatch both after checking for resource conflicts and data dependencies. Implementing these checks in hardware adds complexity, which can increase cycle time and power dissipation. But unlike VLIW machines that place this burden on the programmer or compiler, Lexra's superscalar approach is easier to program in assembly language, which is often necessary to meet DSP performance goals.

The load/store pipe (pipe A) executes data-access and all other standard instructions except multiplies and divides. Pipe B executes MACs, divides, and other standard instructions. The datapath is 32 bits wide, but since few DSP algorithms need more than 16 bits of precision, the LX5280 implements two 16-bit MAC and ALU operations in a singleinstruction, multiple-data (SIMD) fashion.

MAC-intensive DSP algorithms, like convolution, will use pipe A to load a pair of operands into a register while executing two MACs in pipe B on earlier data. Each pipe has an ALU and a nearly independent control section. Pipe A supports pointer postmodification and circular buffer addressing, including start (cbs0–cbs2) and end (cbe0–cbe2) registers. Pipe A also handles coprocessor operations and sequencing instructions (branches and jumps). Both pipes can execute all ALU operations, so there's plenty of horsepower for computationally intensive programs. The customengine interface is available for special operations in pipe A.

Decoupling register loads from the MAC allows loop unrolling and takes advantage of the 32 general-purpose registers for temporary storage. An undesirable consequence of loop unrolling, however, is code expansion. Lexra's design is not as efficient as specialized 32-bit or conventional 16-bit DSP architectures, such Analog Devices' Sharc or TI's C54X, which perform arithmetic on memory-based operands. But the LX5280 implements MIPS16-compatible code compression (see MPR 10/28/96, p. 40), which Lexra believes can offset the expansion of loop code. The CPU issues compressed instructions only to pipe A.

## **Radiax Instructions Boost Efficiency**

The LX5280 has MAC instructions, dual 16-bit versions of its MIPS-I and specialized ALU operations, post-modified pointers with circular-buffer support, zero-overhead loop counters, conditional moves, and prioritized low-overhead interrupts, and it supports load/store of two registers with one instruction. For high-fidelity DSP arithmetic, it also supports guard bits, saturation arithmetic, and rounding.

These features are encoded in a new single I-format opcode called the Lexop, which is identified by the six most significant bits of the opcode (0x1F). Lexops use the MIPS special-opcode R-format. Using the six-bit suboperation field of the R-format, there is room for 64 new instructions. The Lexop identifier can be changed using six pins on the LX5280 core, avoiding possible incompatibility with future extensions to the MIPS ISA by Mips itself. Table 1 shows all of the Radiax instructions.

As Figure 2 shows, the datapath consists of two 16-bit MAC units, a 40-bit add/subtract/dual-round unit with optional saturate, and a divide unit. Each MAC unit has a  $16 \times 16$ -bit multiplier, a 32-bit product register (not visible to programmers), and four 40-bit accumulators with optional saturation. MAC 1 targets one of the four accumulators (mNH), while MAC 0 targets mNL.

If both MACs operate in parallel, the accumulatorregister pair represented by mN is targeted. Each MAC takes three cycles and is fully pipelined to deliver a new product every cycle. Thus, there are two delay slots for multiply or multiply-accumulate. For example:

| Cycle 1: MADD2 m1H, r2, r3 |                       |  |  |
|----------------------------|-----------------------|--|--|
| Cycle 2: delay slot1       | //new m1H unavailable |  |  |
| Cycle 3: delay slot2       | //new m1H unavailable |  |  |
| Cycle 4: MFA r3, m1H       | //new m1H available   |  |  |

The accumulator (m1H) can be referenced by an MFA instruction (move from accumulator) immediately following the MADD2 instruction, but it will incur a two-cycle stall until the m1H is ready in cycle 4. In DSP algorithms such as FIR filters, these delay slots do not impact performance, because many products are accumulated before being stored.

The CPU can execute one  $32 \times 32$ -bit multiply in a single MAC unit with a five-cycle latency. By using both datapaths, two  $32 \times 32$ -bit multiplies can be initiated every four cycles. Complex multiplies (16-bit real, 16-bit imaginary) use both MAC units with a three-cycle latency. A new complex multiply can be initiated every two cycles.

Compared with typical MAC units, the LX5280's MAC includes several useful features, such as guard bits, fractional arithmetic, saturation, rounding, and output scaling. These features are selected by opcodes and/or mode bits in the MMD (MAC mode) control register.

Accumulation has 40 bits of precision with eight guard bits for overflow protection. Fractional arithmetic is implemented by the program's interpretation of the 16-, 32-, or 40-bit quantities, and it's controlled by a bit in the MMD register. When fractional mode is selected, the MAC units shift the result of any Radiax multiply left by one bit, to maintain the alignment of the implied radix point. Since -1 can be

| Instruction   |                                   | Description                           | Instruction                       |                      | Description                               |
|---------------|-----------------------------------|---------------------------------------|-----------------------------------|----------------------|-------------------------------------------|
| Dual 16-bit M | ual 16-bit MAC-Related Operations |                                       | Extensions to MIPS ALU Operations |                      |                                           |
| MULTA2        | {mD, mDh, mDl}, rS, rT            | 16-bit dual multiply                  | SLLV2                             | rD, rT, rS           | Dual left logical shifts                  |
| MULNA2        | {mD, mDh, mDl}, rS, rT            | 16-bit dual multiply and negate       | SRLV2                             | rD, rT, rS           | Dual right logical shifts                 |
| CMULTA        | mD, rS, rT                        | Complex mult (16b real, 16b imag)     | SRAV2                             | rD, rT, rS           | Dual right arithmetic shifts              |
| MADDA2[.S]    | {mD, mDh, mDl}, rS, rT            | 16-bit dual multiply-add w/saturate   | ADDR[.S]                          | rD, rS, rT           | Add, optional saturation                  |
| MSUBA2[.S]    | {mD, mDh, mDl}, rS, rT            | 16-bit dual multiply-sub w/saturate   | SUBR[.S]                          | rD, rS, rT           | Subtract, optional saturation             |
| RNDA2         | {mT, mTh, mTl} [,n]               | Dual rnd to 16-bits w/opt right shift | ADDR2[.S]                         | rD, rS, rT           | Dual add, optional saturation             |
| MFA2          | rD, mT [,n]                       | 16-bit dual move from accumulator     | SUBR2[.S]                         | rD, rS, rT           | Dual subtract, optional saturation        |
|               |                                   | with optional right shift             | SLTR2                             | rD, rS, rT           | Dual set on less than                     |
| 32-bit MAC-F  | Related Operations                |                                       | New ALU C                         | perations            |                                           |
| MULTA(U)      | mD, rS, rT                        | 32-bit multiply (signed/unsigned)     | MIN                               | rD, rS, rT           | Minimum                                   |
| MADDA[.S]     | mD, rS, rT                        | 32-bit multiply-add w/saturate        | MAX                               | rD, rS, rT           | Maximum                                   |
| MSUBA[.S]     | mD, rS, rT                        | 32-bit multiply-sub w/saturate        | MIN2                              | rD, rS, rT           | Dual minimum                              |
| DIVA(U)       | mD, rS, rT                        | 32-bit divide (signed/unsigned)       | MAX2                              | rD, rS, rT           | Dual maximum                              |
| ADDMA[.S]     | mD{h,l}, mS{h,l}, mT{h,l}         | Add accumulators, optional saturate   | ABSR[.S]                          | rD, rT               | Absolute, optional saturation             |
| SUBMA[.S]     | mD{h,l}, mS{h,l}, mT{h,l}         | Sub accumulators, optional saturate   | ABSR2[.S]                         | rD, rT               | Dual absolute, optional saturation        |
| MFA           | rD, {mTh, mTl} [,n]               | 32b move from accum w/opt rt shift    | MUX2{XX}                          | rD, rS, rT           | Dual MUX                                  |
| MTA[.G]       | rS, {mD, mDh, mDl}                | Move to accumulator (or guard bits)   | XX = [.H⊢                         | I],[.HL],[.LH],[.LL] | (select rS or rT halfwords into rD)       |
| Vector Addres | sing Operations                   |                                       | NORM                              | rD, rT               | Normalize                                 |
| LT            | rT, displacement (base)           | Load twinword                         | BITREV                            | rD, rT, rS           | Bit reverse rT, logical right shift by rS |
| ST            | rT, displacement (base)           | Store twinword                        | Conditional                       | Operations           |                                           |
| LopP[.Cn]     | rT, (pointer) stride              | Load w/ptr incr, opt circular buffer  | CMVEQZ[.H                         | H] [.L] rD, rS, rT   | Conditional move (based on rT = 0)        |
| Lop = {LB, L  | .BU, LH, LHU, LW, LT}             | (byte, halfword, twin; sign/unsign)   | CMVNEZ[.H                         | l] [.L] rD, rS, rT   | Conditional move (based on $rT \neq 0$ )  |
| SopP[.Cn]     | rT, (pointer) stride              | Store w/ptr incr, opt circular buffer |                                   |                      |                                           |
| Sop = {SB, S  | H, SW, ST}                        | (byte, halfword, twin)                |                                   |                      |                                           |

 Table 1. Radiax instructions add powerful DSP features to the MIPS instruction set.

represented in fractional format but +1 cannot, the dual MAC detects when both operands of a multiply are equal to -1 and generates the approximately correct product: zero sign bit (representing a positive result) and all ones for the remaining bits.

The accumulation units can add the product to, or subtract it from, one of the eight accumulator registers. They can perform this operation with or without saturation. The LX5280 instructions include a multiply-add and a multiplysub, with and without saturation. A bit in the MMD register determines whether saturation is performed on the full 40 bits or on only 32. The latter capability is useful for emulating the results from architectures without guard bits.

A round instruction works on one or two accumulator registers to reduce precision prior to storage. The rounding mode is selectable in the MMD register. The output scaler is used to right-shift (scale) the accumulator when it is transferred to the general register file.

The dual MAC units also execute the 32-bit MULT(U) and DIV(U) instructions in the MIPS ISA. For MULT(U), one of the 16-bit MAC datapaths works iteratively to produce the 64-bit product in five cycles. The MMD mode bits have no effect on the operations of the standard MIPS instructions. The dual MAC units require the help of a separate divider to execute the 32-bit DIV(U) operation in 19 cycles. The 32-bit quotient is loaded into the lower 32 bits of the m0L, m1L, m2L or m3L, and the remainder is loaded into the upper 32 bits of the other accumulator in the target pair. There is no special support for fractional arithmetic for the divide operations.

The LX5280 has a seven-stage pipeline, two more than the conventional MIPS pipeline, to ease overall timing of the core. One stage was added for decoding multiple instructions and the other to simplify interfacing with different



Figure 2. LX5280 has two MAC units and a division accelerator.

| Algorithm              | Cycles                                 | Example (200 MHz) |
|------------------------|----------------------------------------|-------------------|
| Dot Product            | 0.5N + 8                               | 0.68 µsec         |
| (2 vectors, length N)  | 0.511 + 0                              | (N = 256)         |
| Real FIR (N coef-      | (0.5N + 3) M                           | 9.5 µsec          |
| ficients, IVI samples) | ,                                      | (N = 32, M = 100) |
| Complex FIR (N coef-   | (2N + 4) M                             | 34 µsec           |
| ficients, M samples)   |                                        | (N = 32, M = 100) |
| Complex FFT            | log <sub>2</sub> N (3N + 5) + 0.5N +20 | 156 "µsec         |
| (N-point, radix-2)     | 109210(310 + 5) + 0.510 + 20           | (N = 1,024)       |

Table 2. Key DSP loops show good performance in simulation.

cache sizes and memory-cell types. This helps to boost the clock frequency, but it also increases the chip area, power dissipation, and branch-prediction penalty.

#### Improved Performance, Better Roadmap

Lexra expects a performance-optimized LX5280 to run at 200 MHz in a typical 0.18-micron IC process. The company expects the core area (excluding memories) to be 6 mm<sup>2</sup> and the power dissipation to be 225 mW at 1.8 V. A lower-power chip could run at 50 MHz and dissipate 20 mW at 1.0 V in the same process.

Table 2 shows the cycle count and execution time of some key DSP loops in simulation. These results are impressive compared with other embedded-processor cores such as the ARM9E, and they are comparable to those of leading DSPs. Lexra expects the synthesizable RTL, along with a configuration tool, synthesis and timing scripts, testability scripts, and a regression suite, to be available in November. A layout version with a GDS2 database will be ready next February.

Lexra recently announced the LX4280 (see MPR 8/2/99 p. 13), a superscalar processor similar to the LX5280 but without DSP extensions. Lexra may also decide to implement a scalar version of the LX5280 with Radiax for cost-sensitive markets. Another possibility is a 64-bit datapath version of Radiax, extending the current SIMD width for higher performance. It could use either scalar or superscalar pipelines.

Green Hills' Multi will be the first third-party development tool to support Radiax with an assembler, optimizing C/C++ compiler, source-level debugger, execution profiler, and version-control tools. California Advanced Software Tools is developing a cycle-accurate instructionset simulator. Embedded Performance (EPI) will also support the LX5280 with a source-level debugger and EJTAG emulator.

The LX5280 with Radiax blurs the line between DSPs and microcontrollers. If Lexra can deliver on its promises, the LX5280 will make a strong impact on the embedded market.

Krishna Yarlagadda is a leading figure in the DSP community, having founded three DSP companies, ZSP, AVAJ, and Hellosoft. He was also a key contributor to the SuperSparc and UltraSparc programs at Sun. Krishna contributes regularly to industry DSP journals and conferences.