

## NEC DECANTS MERLOT

New MP98 Media Processor Hits 1 GOPS at 1W By Peter N. Glaskowsky {3/20/00-01}

NEC has described a single-chip four-way multiprocessor for multimedia applications, marking the company's entry into the increasingly crowded media-processor market. The company's paper at February's International Solid-State Circuits Conference (ISSCC) and

information on NEC's Web site describe the unique architecture of the new MP98 media processor (formerly codenamed Merlot) as well as other details that are worth a closer look.

Key to the new architecture is a control-flow model NEC dubs FOPE, for fork-once parallel execution. In the FOPE model, a thread has exactly one opportunity to issue an explicit fork, enabling parallel execution of multiple instances of the thread. For example, the first time through a loop, a thread may fork off a second instance to handle the second iteration of the loop. The first instance may not issue another fork, but the second instance may, if it knows the loop will be executed a third time—and so on. Threads thus created may be executed speculatively, and their results discarded if the thread is aborted. NEC says this model greatly reduces the complexity of the control-flow circuitry compared with more general execution models, allowing a complete hardware solution in a relatively simple device.

The MP98, the first implementation of the FOPE model, is a 125MHz media processor with four processor cores, a large shared register file, a 64K instruction cache, and an eight-bank 64K data cache. Each of the four cores can issue two instructions per clock, allowing NEC to claim a peak execution rate of 1 GOPS (billion operations per second) on 32-bit data. Execution of each thread is done in program order, avoiding the additional complexity of out-of-order execution.

Both pipelines in each core have 32-bit ALUs that can operate in a two-wide 16-bit SIMD configuration. The A pipe adds a 32-bit divide unit as well as a 32-bit multiplier that also supports a two-wide 16-bit SIMD mode. The A-pipe multiply unit can also perform multiply-add



Figure 1. NEC's MP98 features four superscalar processing elements, caches, and external interfaces for SDRAM and PCI.

2

## Price & Availability

NEC has not announced products based on the MP98 design. For more information, visit NEC's Web site at www.labs.nec.co.jp/MP98/.

operations. Including the effect of the SIMD and MAC functions, the MP98's peak performance is up to 3 GOPS on 16-bit data.

Figure 1 shows a block diagram of the MP98, which also includes a memory-management unit (unusual among media processors, since most are designed to act as coprocessors for more conventional CPUs), a two-channel SDRAM controller, and a 32-bit PCI interface.

The MP98's register file incorporates specific adaptations for the FOPE model. Because a fork operation creates a new thread that inherits the register values from its parent, the chip includes special logic to duplicate the register values from one processing element (PE) to another in a single cycle. This duplication can occur in only one direction, from a lower-numbered PE to the next-highernumbered PE.

Four independent register sets are provided, one for each PE. Each PE has 32 architectural registers, each 32 bits long, but each PE can access registers from the other PEs. To support full-speed read access, each of the four sets has four read ports and two write ports. The duplication logic



Figure 2. The NEC MP98 media processor contains about 14 million transistors. The chip is about 110mm<sup>2</sup> in size and built in a 0.15micron process.

requires an extra set of registers for each PE, doubling the total size of the register file to 1,024 bytes. NEC says the register file occupies 17.2mm<sup>2</sup> in a 0.15-micron process. Access times are 2.1ns and 5.0ns for writes and reads, respectively, at 1.3V. The extra complexity of the duplication logic, however, partially offsets the control-logic simplifications allowed by the FOPE model.

NEC also included special logic to eliminate registerread requests when the read will be satisfied by operand forwarding within the core. The company says this eliminates 39% of register accesses in an MPEG-2 decoding program, reducing power consumption within the register bank.

A store-reservation buffer (SRB) between the PEs and the data cache allows speculative stores to be buffered and later committed or aborted. The SRB contains six entries per PE in NEC's first implementation. Each entry holds one 32-bit data word with its associated address for one speculative store. Only stores that may affect the state of other threads are buffered. Store instructions identify whether the store is local to a single thread; if so, the store may be forwarded to the cache. When not otherwise needed, each SRB entry acts as a single fully-associative cache entry, slightly reducing cache traffic.

The 64K data cache itself is divided into eight banks to improve the odds of interleaving accesses from the PEs and external (PCI) agents. The cache is four-way set-associative, and each bank has a single read/write port. Up to six banks may be active simultaneously when there is no bank contention. The instruction cache is also 64K in size and fourway set-associative.

The MP98's SDRAM controller manages one or two 64-bit interfaces with support for error-correcting code (ECC). The single-channel option reduces cost when bandwidth needs are modest, while the two-channel option will be better suited to more demanding applications. The MP98 also includes a standard 32-bit, 33MHz PCI interface for connecting to peripheral chips.

## Power Consumption Is Low

NEC designed the MP98 for low-power operation. According to NEC's simulations, the chip should reach its 125MHz target speed at 1.3V, but supply voltages from 1.2V to 1.8V should yield lower and higher speeds. All I/O will use 3.3V signaling. Figure 2 shows a die plot of the 500-pin chip, which consumes only 0.22mW in the sleep state and 9.56mW when the caches are enabled. Fully powered up but idle, the chip consumes just 23.2mW. NEC says the chip's original unoptimized design required 1.07W when executing a representative MPEG-2 inverse discrete-cosine transform decoding algorithm. The extra logic to eliminate register reads due to operand forwarding and to use the SRB as a small cache-among other things-reduced power consumption for this algorithm to 0.92W.

3

Though NEC has not described any plans to produce the MP98 or use it in specific applications, the new design appears to be competitive in overall complexity, performance, and power consumption with commercial media processors such as Fujitsu's FR-V (see *MPR 08/02/99-04*, "Fujitsu FR-V Architecture Bets On VLIW"). With 14 million transistors in a 0.15-micron process, the 110mm<sup>2</sup> device should be reasonably affordable to manufacture. Performance should be more than adequate for consumer-electronics products such as DVD players, also one of Fujitsu's target markets. If NEC decides to put this new architecture into production, we would expect to see the first chips later this year or early in 2001. ♢

To subscribe to Microprocessor Report, phone 408.328.3900 or visit www.MDRonline.com