# PowerPC 602 Aims for Consumer Products 3DO Multiplayer Is First Design Win for 603 Derivative

#### by Linley Gwennap

Leaving no stone unturned in the search for higher volumes, the PowerPC partners have developed a new processor intended for high-end consumer products. This category—which includes emerging markets such as game machines, set-top boxes, and PDAs—is ripe for RISC chips: the performance demands are too great for traditional CISC devices, and there are fewer softwarecompatibility requirements. To gain design wins, however, RISC contenders must shed excess cost and power requirements while retaining superior performance.

The PowerPC 602, developed jointly by IBM and Motorola, aims to do just that. The new processor was designed quickly by starting with the low-end but powerful 603 (*see* **071402.PDF**) and removing features to further reduce cost and power consumption. The changes include reducing the size of the on-chip caches, eliminating superscalar dispatch, and simplifying the system bus. As a result, the 602 uses half the power of the 603 and has one-third its manufacturing cost.



Figure 1. The PowerPC 602 issues one instruction per cycle to one of four function units. Branches are predicted earlier and removed from the instruction stream.

Although several of these changes reduce performance, the 602 caches have been modified to handle stores more quickly. In addition, the integer multiply latencies have been significantly improved. These changes will boost the performance of the chip in graphics-intensive environments.

By no coincidence, the first design win for the 602 is 3DO's next-generation M2 accelerator (*see 0812MSB.PDF*). The 3DO system generates complex 3D graphics on the fly, requiring significant compute power, yet the current unit sells for \$399 (including a CD-ROM drive), leaving little budget for the processor subsystem. The speedy yet inexpensive 602 should fill the bill for 3DO's next generation and may find its way into similar products.

#### Function Units Simplified for Low Cost

Like other embedded PowerPC chips—Motorola's 505 and IBM's 403 (*see 080601.PDF*)—the 602 is limited to issuing one instruction per cycle, compared with two for the 603. Sticking to a scalar design greatly simplifies the logic for instruction dispatch, branch prediction, and instruction completion; the die area of these components was reduced by about half. One instruction per cycle is taken from the instruction cache and loaded into a fourentry queue; the bottom entry is eligible for dispatch on each cycle.

In some situations, the 602 can execute two instructions per cycle, as it retains the 603's ability to preprocess branches, a technique known as branch folding. When a branch is fetched from the instruction cache, its target address is predicted and the fetch stream is immediately redirected if necessary. The branch itself is not stored in the instruction queue; thus, the dispatch unit never sees the branch. But because the instruction cache can provide only one instruction per cycle, the chip cannot sustain a superscalar execution rate.

As Figure 1 shows, the 602 retains the separate function units of the 603, allowing the processor to continue dispatching instructions even if one unit is busy with a long-latency operation. The new design also retains the 603's completion unit, which allows instructions to execute out of order while completing in order. The 602 has a four-entry completion unit, limiting the number of instructions that can execute out of order to three. The 602 can complete one instruction per cycle.

Unlike the 403, the 602 has a floating-point unit. The 602's FPU, however, stores and calculates only single-precision (SP) values. This change nearly halves the size of the 603's double-precision FPU while maintaining

#### MICROPROCESSOR REPORT

similar performance on SP values. Specifically, the 602 can execute SP multiply-accumulate instructions in a single cycle with three cycles of latency. Table 1 shows the performance of other floating-point operations. Double-precision instructions are trapped and emulated in software, maintaining full PowerPC compatibility.

The integer unit is taken directly from the 603. The only change is to enhance performance rather than decrease size: the 602 includes an improved multiply circuit. Because multimedia applications often perform integer multiplication, particularly with small operands, the 602 is designed to execute an integer multiply in a single cycle, with a two-cycle latency, if one operand contains 8 or fewer significant bits (that is, the upper 24 bits are either all zeroes or all ones). Table 1 shows the cycle times of other integer multiply operations.

The 603's load/store unit was simplified by removing hardware support for complex PowerPC instructions such as string moves. The 602 will trap these instructions, which can be emulated in software, if desired.

#### Simple Yet Fast Cache Design

The 602 cache subsystem has been completely revamped from the 603's design. The most obvious change is the reduction in capacity. The separate instruction and data caches are each 4K, half the size of the 603's caches. Although this change greatly reduces performance on typical desktop software, the 602 will perform well on most embedded applications, which typically have a smaller working set.

To further reduce the cache area, the 602 eliminates the read and write buffers found in the 603. These buffers transfer data into and out of the 603's data cache in a single cycle, reducing the number of cycles in which the cache is occupied during a cache refill. Normally, eliminating these buffers would stall the cache during the entire refill operation.

To improve performance in these situations, both the instruction and data caches in the 602 can supply data during the dead cycles in a refill (that is, when the cache is otherwise unoccupied), a technique known as hit under miss. Because the 602 allows some out-of-order execution, this feature allows the CPU to execute up to three additional instructions during the refill process. The designers estimate that the hit-under-miss design boosts overall performance by 3%.

Stores that hit in the cache execute in one cycle on the 602, half the time taken by the 603. This change aids graphics programs that generate images, producing a large number of stores. The 602 also supports unaligned loads and stores in hardware. If the unaligned data is contained within an aligned doubleword, it can be accessed in a single cycle, since the data cache provides an aligned 64-bit value and can extract an arbitrary 32-bit word from this value. If the unaligned value crosses a

|                                  | 602       |           | 603       |           |
|----------------------------------|-----------|-----------|-----------|-----------|
|                                  | Thruput   | Latency   | Thruput   | Latency   |
| Integer multiply, $8 \times 32$  | 1 cycle   | 2 cycles  | 2 cycles  | 2 cycles  |
| Integer multiply, $16 \times 32$ | 2 cycles  | 3 cycles  | 3 cycles  | 3 cycles  |
| Integer multiply, $32 \times 32$ | 4 cycles  | 5 cycles  | 5 cycles  | 5 cycles  |
| Integer divide                   | 37 cycles | 37 cycles | 37 cycles | 37 cycles |
| FP add (SP)                      | 1 cycle   | 3 cycles  | 1 cycle   | 3 cycles  |
| FP multiply (SP)                 | 1 cycle   | 3 cycles  | 1 cycle   | 3 cycles  |
| FP multiply-add (SP)             | 1 cycle   | 3 cycles  | 1 cycle   | 3 cycles  |
| FP divide (SP)                   | 16 cycles | 18 cycles | 16 cycles | 18 cycles |

Table 1. The 602 is the same as the 603 on FP multiplication and significantly faster on integer multiplication.

doubleword boundary, the 602 takes two cycles to deliver the requested value.

The instruction TLB and data TLB each have 32 entries, half the size of the 603's units. Each TLB also includes four block entries capable of mapping 128K–256M each. Table 2 summarizes the differences between the 602 and 603.

## Narrow Bus Reduces System Cost

Unlike the other 60x processors, the 602 implements a multiplexed system bus. This bus combines 32 address bits and 64 data bits on the same set of signals, reducing the pin count of the 602 as well as all other devices that connect to the bus, and reducing board space as well. These factors all reduce overall system cost.

The downside of the multiplexed bus is lower performance due to the overhead of sending the address and data at different times on the same wires. A burst read

|                                | 602                | 603                |  |
|--------------------------------|--------------------|--------------------|--|
| CDLL Cleak Speed (max)         | 66 MHz             |                    |  |
| CPU Clock Speed (max)          |                    | 80 MHz             |  |
| Instruction Fetch per Cycle    | 1 instr            | 2 instr            |  |
| Instruction Queue              | 4 entry            | 8 entry            |  |
| Instruction Dispatch per Cycle | 1 + branch         | 2 + branch         |  |
| Completion Unit                | 4 entry            | 5 entry            |  |
| Integer Unit                   | fast multiply      | standard           |  |
| Floating-Point Unit            | single precision   | single / double    |  |
| Load/Store Unit                | simplified         | standard           |  |
| Instruction / Data Cache Size  | 4K / 4K            | 8K / 8K            |  |
| Load/Store Buffers             | no                 | yes                |  |
| Hit under Miss                 | yes                | no                 |  |
| Store Latency                  | 1 cycle            | 2 cycles           |  |
| Instruction / Data TLB Size    | 32 / 32 entry      | 64 / 64 entry      |  |
| Bus Type                       | multiplexed        | nonmux'd           |  |
| Bus Data Width                 | 64 bits            | 32 or 64 bits      |  |
| Power Dissipation (typ)        | 1.2 W              | 2.5 W              |  |
| Package Type                   | PQFP-144           | CQFP-240           |  |
| Process Type                   | 0.65µ, 4M          | 0.65µ, 4M          |  |
| Die Size                       | 50 mm <sup>2</sup> | 85 mm <sup>2</sup> |  |
| Estimated Mfg Cost             | \$14*              | \$45*              |  |
| SPECint92 (est)                | 40 int             | 75 int             |  |
| First Volume Shipments         | 2H95               | 3Q94               |  |

Table 2. The 602 is similar to the 603 in many aspects, but most of the storage units have been halved in capacity. (Source: Somerset except \*MDR estimates)

# Price & Availability

Motorola and IBM have not announced pricing for the PowerPC 602. They expect to begin volume production in 2H95. Contact Motorola at 800.845.MOTO or IBM Microelectronics at 800.IBM.0181.

transaction requires seven cycles to load 32 bytes, but only four of these cycles actually transfer data; a burst store takes five or six cycles. At 33 MHz, the maximum sustainable load bandwidth is 152 Mbytes/s, which should be adequate for a low-cost memory subsystem.

The bus operates at 1/2 or 1/3 of the CPU speed. The 602 takes its input from the bus clock and uses a PLL to derive the CPU clock frequency. The initial 602 parts will operate at 66 MHz; given the speed of the base 603 design, 80-MHz parts are likely to follow, with even faster parts resulting from future process shrinks.

Like the 603, the 602 supports a three-state subset of the MESI consistency protocol. This subset is adequate for maintaining consistency with DMA devices, but the 602 does not support multiprocessor configurations.

The 602 also borrows power-management features from its predecessor. Function blocks are not clocked when they are inactive, keeping the typical power dissipation to less than 1.2 W using a 3.3-V supply. Power can be further reduced by placing the 602 in doze, nap, or sleep modes.

The new chip uses the same 0.65-micron four-layermetal process as the 603 and 604. Dubbed CMOS-5L, it



Figure 2. The PowerPC 602, with 1.0 million transistors, measures  $7.1 \times 7.1$  mm in a 0.65-micron four-layer-metal CMOS process.

packs the 602's 1.0 million transistors onto a die measuring 50 mm<sup>2</sup>, as Figure 2 shows. According to the MPR Cost Model, the 602 costs just \$14 to build, about a third of the estimated manufacturing cost of the 603.

Along with the smaller die, a major cost savings comes from packaging the chip in a 144-pin PQFP. The low power dissipation of the 602 obviates the need for the more expensive ceramic packaging used by other 6xx-family chips. The package cost is also reduced by the multiplexed bus, which eliminates 32 pins.

### A Powerful Multimedia Engine

With a fast integer multiplier and high-performance single-precision FPU, the 602 is ready to handle demanding multimedia applications such as sound processing, graphics, and animation. Yet the low cost of the chip allows it to compete with dedicated DSPs, and its PowerPC compatibility will make code development much easier. Furthermore, applications developed for the 602 will also run on desktop PowerPC systems.

This feature set makes the 602 well-suited for the 3DO game machine. A similar application is Apple's Pippin, a game machine that offers compatibility with the Power Macintosh. The first Pippin systems will use the 603, but the 602 appears to be a logical cost-reduction option for the second generation.

The performance of the 602, particularly the fast integer multiplier, could allow it to be used in PDAs, performing some DSP functions in the CPU. It is expensive for this market, however, and has no on-chip peripherals. Motorola is preparing a specialized PowerPC chip with integrated peripherals for the PDA market (*see* **0901MSB.PDF**). The partners may also develop an FPUless version of the 602 for this market.

Motorola and IBM estimate that the 602, without a level-two cache, will deliver 40 SPECint92 at 66 MHz. This performance is impressive for such a tiny processor but makes it unlikely that the 602 will be used in a traditional desktop system. Apple's 68K emulation will do poorly on the 602 due to the small caches, and even in native mode, the chip will lag Pentium PCs.

The vendors did not provide a Dhrystone MIPS rating for the 602, but it should be well above the 40–50 MIPS of the 403 and 505. Neither vendor has announced pricing; we expect the 602 to list for about \$50 and sell for less than \$30 in very high volumes. The 403 and 505 both list for about \$50 but will drop in price by 2H95.

With both IBM and Motorola working to expand their own embedded PowerPC lines, and with designs like the 602 coming from the Somerset Design Center, there will be a plethora of embedded PowerPC options available by the end of this year. If all goes well, the 602 could give PowerPC a few high-volume design wins in emerging consumer products, pushing the architecture into the stratosphere of the embedded market.  $\blacklozenge$