# **ARM9 Doubles ARM Performance in '98** *Five-Stage Pipeline, Dual Caches Make ARM9 More Conventional*



## by Jim Turley

With the formal announcement of ARM9TDMI at October's Microprocessor Forum, Advanced RISC

Machines plugs a hole in its midrange line. The new core will bridge the gap between today's ARM7 and StrongArm processors when it enters production in the middle of 1998. ARM expects ARM9 parts to run at up to 150 MHz, far faster than most ARM7 chips today but still well below the 200+ MHz speeds of StrongArm.

To gain the extra speed, ARM has finally extended its characteristically short three-stage pipeline to five stages and moved to separate instruction and data caches. These changes will allow midrange ARM chips to reach the same triple-digit clock speeds as MIPS, PowerPC, SuperH, and other competitive 32-bit processors. Because StrongArm is single sourced, for most customers ARM9 marks the new high end of the ARM product line.

#### Pipeline Grows, Latency Stays About Equal

Ian Devereux, chief engineer of the ARM9 development team, described the changes made between ARM7 and ARM9. (The little-used ARM8 has been produced since 1996 but is not part of the main product roadmap.) Basically, ARM came to the same conclusion as generations of other CPU designers: to reach higher clock rates, you have to extend the pipeline. Thus, ARM7's claustrophobic three-stage pipeline has given way to a more conventional five-stage pipe.

ARM9's pipeline now looks identical to StrongArm's (see MPR 11/13/95, p. 16). The trivial fetch stage from ARM7 remains, but the overstuffed execute stage is now spread among no fewer than four stages, with some work being



Figure 1. ARM9's five-stage pipeline relieves congestion in the third stage of the company's earlier ARM7 microarchitecture.

done in the decode stage. Figure 1 maps the differences between the ARM7 and ARM9 pipelines.

In the new core, Thumb-to-ARM instruction "decompression," or mapping, is skipped completely. Rather than serially translate 16-bit Thumb instructions into their 32-bit equivalents and then decode the standard ARM instructions, ARM9TDMI simply decodes the incoming instructions— ARM or Thumb—in one go. (In ARM's nomenclature, the "TDMI" suffix identifies the Thumb compression, debug unit, hardware multiplier, and in-circuit emulator tap, all of which are mandatory elements of the ARM9 core.)

With these changes, Devereux predicted ARM9 will reach 120–150 MHz in commodity 0.35-micron processes. These speeds are three times faster than those of most ARM7 chips, with the exception of VLSI Technology's extensively replumbed 120-MHz part (see MPR 10/6/97, p. 9).

Even with the longer pipeline, ARM9 executes nearly all instructions in the same number of cycles as ARM7. One exception is integer multiply, which now takes one cycle longer than usual due to a critical path forwarding results. Multiply-accumulate instructions, however, hide this extra latency in the final addition, so MAC operations perform as before. Loads and stores are now fully pipelined, although there remains a one-cycle load-use penalty.

Mispredicted branches incur a three-cycle penalty, just the same as ARM7 but one cycle longer than either ARM8 or StrongArm. ARM8 includes static branch prediction, and StrongArm calculates branch addresses in the decode stage, two enhancements that ARM9 does not include. Devereux said his team felt the additional gates weren't worth the marginal increase in per-clock performance.

In addition to the pipeline alterations, ARM9 splits ARM7's unified internal bus and cache in two, giving the new core a Harvard architecture for the first time. The separate buses allow ARM9 to avoid conflicts between instruction fetches and operand transfers much more easily than its predecessor, cutting the load/store time to a single cycle. It also means ASIC designers now have four internal buses (two addresses, instructions, and data) to deal with.

Overall, these changes give ARM9 an average clocksper-instruction ratio of 1.5, according to ARM, substantially better than ARM7's 1.9 CPI but a shade worse than the 1.4 CPI for ARM8 and StrongArm. The lack of the latter chips' branch enhancements accounts for the small backward progress in performance per clock cycle. On the other hand, ARM's conditional execution can often eliminate trivial branch constructs. CPI is, of course, only half of the performance equation, and ARM9's faster clock speeds are its biggest benefit over earlier generations. Two Design Macros Include Caches, Bus, MMUs Symbios Logic and staunch ARM supporter VLSI Technology are the only announced licensees for ARM9 so far. Both plan to make ARM9 cores ready in 1Q98. They'll also offer two larger, more complete macros, the ARM910T and the 940T. The two are identical except for their MMUs.

The 910T will pair the ARM9TDMI core with a pair of 4K caches. The chip will also have a write buffer, a generic internal bus interface, and an MMU for Windows CE. The 940T strips down the 910T's MMU to a set of eight "region descriptors" that control caching properties on selectable address boundaries. Both macros will have Thumb code compression as standard equipment.

In VLSI's 0.35-micron process, the 940T will measure

15 mm<sup>2</sup>, according to that company. The ARM9 core itself accounts for only 4 mm<sup>2</sup> of that total. At 150 MHz, the core superset will draw 675 mW from its 3.0-V supply.

While two-thirds of a watt is hardly power hungry, it is five times what the SA-110 uses at the same speed. The SA-110 has the advantages of Digital's spectacular fabrication process, the extensive handtweaking Digital performed on its wonder chip, and, of course, the chip's split power supply, with the core running at just 1.65 V.

Compared with somewhat more pedestrian processors, the 940T still shines. Within the 100 MHz-and-up club, Hitachi's SH7708 compares well at 700 mW and has natural code-density benefits similar to those of Thumb.

#### ARM7 Expected to Survive at Low End

ARM expects to see first silicon of the ARM9 core and the 940T macro within the next few weeks. Production is optimistically scheduled for 1Q98, but 2H98 seems more likely. The 910T, with its more involved MMU, should tape out in 2Q98, which may put production into early 1999.

That schedule leaves ASIC designers with another year to work on ARM7 designs before they can realistically consider starting up an ARM9-based chip. For most ARM users, ARM9 will rapidly eclipse ARM7, just as ARM7 did the now forgotten ARM6. For the devoted few for whom cost is the ultimate goal, however, ARM7 will live on.

Even though the ARM9 core is about one-third larger than ARM7 in the same process, it still measures just 4 mm<sup>2</sup>, which is too small for most designers to worry about. Within the context of a sizable ASIC, the extra gates are irrelevant.

Power consumption between the two is also comparable. Even though ARM9 is more complex, its designers got more creative about conserving power. Performance is obviously better with the newer version, but in terms of mW/MHz, ARM9 is not radically more power hungry than ARM7. At similar clock rates, the two are about equal. The ARM9TDMI core will begin sampling from Symbios Logic and VLSI Technology in 1Q98. For more information, call ARM (Cambridge, U.K.) 44.1223.400.400 or visit *www.arm.com/CoInfo/PressRel/ARM9Launch.* 

All of this would spell a rapid displacement of ARM7 by ARM9 starting late in 1998, if so many ARM customers weren't fanatically concerned about cost. For makers of highvolume automotive and wireless equipment, a few pennies or a few milliwatts are important. In antilock brake (ABS) or

> mobile GSM units, ARM9's caches, buses, and performance aren't yet worth even a little extra silicon.

> ARM8, the forgotten processor, was never intended for ASIC development. It was essentially a one-off, designed for a particular customer (ARM part-owner Acorn) for computer systems. With neither the Thumb compatibility of ARM7 nor the performance of StrongArm, ARM8 was a detour in the company's roadmap.

> The other unknown quantity is Strong-Arm. Never a force in the ASIC business, the high-end version has lost some of its imposing presence now that its future is a subject of negotiation between designer Digital and presumptive owner Intel. Theoretically, ARM9 fills the space between ARM7 and StrongArm. In most designers'

minds, however, ARM9 is now the high end. ARM will continue to nurse ARM7 at the low end for its extremely costsensitive customers.

### **Enabling Faster Toys**

.MUSTACCHI

ARM and its chips have become a seemingly unstoppable force in the narrow but expanding market for portable communications gear. With ARM9, the company is laying the groundwork for a big jump in the capability of such devices. Cellular-telephone vendors, among others, are predicting vastly increased responsibilities for the humble cell phone: organizer, two-way text pager, Web browser, e-mail terminal, GPS locator, and more. Although the RF portion of such devices will remain fairly consistent, the computing requirement will increase considerably. Hence, the need for faster portable processors.

Faster, more capable, and more expensive consumer items such as these will be able to carry the burden of a somewhat more expensive microprocessor license. That, in turn, will give ARM—and other CPU vendors—room to increase revenue and continue on the rapid growth curve embedded vendors have been enjoying.

Ian Devereux of ARM (Advanced RISC Machines) elaborates on the changes made to ARM9.