# Fujitsu Extends SparcLite Family FPU and DMA "Racetrack" Let 86934 Sustain 55 MFLOPS

#### by Curtis P. Feigel and Morris Enfield

Aiming at high-end graphics and imaging products, Fujitsu has revealed the 86934, the latest in its Sparc-Lite series of RISC processors. Also called SparcPix, the device is the first of the family to incorporate an FPU. It teams a set of FIFOs, a DMA controller, and an SDRAM interface to transfer FPU data at 480 Mbytes/s, allowing the 60-MHz embedded processor to sustain 55 MFLOPS.

The SPARC architecture has been a staple of the workstation market since Sun created it in the mid-1980s. The '934 is the continuation of the SparcLite embedded processor line that Fujitsu began with the 86930 (see MPR 11/14/90, p. 1.). Its core is similar to the previous four variants, but the '934 incorporates a large 8K instruction cache and is the first to be implemented in a process smaller than 0.8-micron—one of the factors that allows it to run at 60 MHz. It maintains software compatibility with other SPARC processors.

Although the processor issues only one instruction per clock, it supports concurrent operation of the integer and floating-point units, so math operations with long latency need not get in the way of integer execution. To reduce interrupt latency, the integer-divide operation (which takes many cycles) may be paused by an interrupt and resumed once the interrupt is handled.

## FPU and DMA Work in Concert

Fujitsu analyzed graphics-related embedded applications and found that, at the high end, they typically deal with large arrays—so the '934 is built with vector computations in mind. These types of applications handle much more data than can fit within an on-chip cache. The company realized that a high-bandwidth pathway into and out of the FPU would be critical to sustaining fast math operations.

Figure 1 shows Fujitsu's solution to the problem: what it calls a data "racetrack." The FPU has direct access to a set of six FIFOs, each 64 words deep, that are filled and emptied by DMA transfers from the external SDRAM (synchronous DRAM) (*see 070205.PDF*) array. The end entry of each FIFO is mapped to an FPU register. Once the DMA controller is set up, it appears to the FPU that these registers automatically load operands and store results without any intervention from the integer unit.

A typical application might set up the FIFOs as two sets of three buffers, with each set handling two operands and one result. The FPU operates on one set of FIFOs while the DMA controller empties and refills the other set. Thus, the DMA controller can relieve application programs from the chores of address generation and data movement, significantly improving vector performance and array-type operations.

As an example, Fujitsu coded the inner loop of a typical proportional-integral motion-control application once using the '934's DMA controller to feed the FPU and once without it. The standard code executed in 32 cycles and produced a control output every 0.53 µs. The DMAfed code executed in 10 cycles and produced a control output every 0.16 µs—a 320% improvement.

The '934's FPU implements the standard SPARC v8 floating-point instructions to maintain compatibility, but the FIFO registers can be accessed only via instructions that are an enhancement to the standard set. This means that, to get full performance, programmers will have to recode their applications.

## **RISC and DSP Merge in Vectors**

Fujitsu's data racetrack is a compelling concept. For typical image-processing applications, such as page composition on high-end network printers, different processes may each handle every pixel just a few times. If each datum were handled hundreds or thousands of times, the algorithm could be accelerated by a faster processor. In fact, there are some intermediate variables that are handled repeatedly and may benefit from on-



Figure 1. The '934's FIFOs, DMA controller, and SDRAM interface form a data "racetrack" that can automatically feed operands to the FPU and store the results, allowing the chip to sustain 55 MFLOPS.

|                      | 00004  | 00000 | 00000   | 00004   | 00000      |
|----------------------|--------|-------|---------|---------|------------|
|                      | 86934  | 86933 | 86932   | 86931   | 86930      |
| Introduction Date    | 1994   | 1992  | 1993    | 1992    | 1991       |
| Clock Speed (MHz)    | 30, 60 | 20    | 20, 40  | 20, 40  | 20, 30, 40 |
| MIPS (sustained)*    | 28, 56 | 17    | 19, 37  | 19, 37  | 19, 28, 37 |
| MFLOPS               | 25–55  | 0.5   | 0.5–1.0 | 0.5–1.0 | 0.5–1.0    |
| Instruction Cache    | 8K     | none  | 8K      | 2K      | 2K         |
| Data Cache           | 2K     | none  | 2K      | 2K      | 2K         |
| Supply Voltage       | 3.3 V  | 5 V   | 5 V     | 5 V     | 5 V        |
| DRAM Support         | yes    | yes   | yes     | yes     | yes        |
| SDRAM Support        | yes    | no    | no      | no      | no         |
| Timers               | 1      | 1     | 1       | 5       | 1          |
| Serial Ports         | none   | none  | none    | 2       | none       |
| Interrupt Controller | yes    | no    | no      | yes     | no         |
| DMA Controller       | 2 ch.  | no    | 2 ch.   | no      | no         |
| FPU                  | yes    | no    | no      | no      | no         |
| MMU                  | no     | no    | yes     | no      | no         |
| Price (1,000s)       | \$80   | \$18  | \$65    | \$45    | \$35, \$42 |
|                      | \$110  |       | \$95    | \$65    | \$50       |

Table 1. The 86934's speed and price place it squarely at the highend of the SparcLite line. \*Based on the Dhrystone 2.1 benchmark. (Source: Fujitsu)

chip cache. But an image contains too many pixels to cache, which led Fujitsu to borrow a concept from DSPs, where pixel source and destination data must be expected to reside off chip in large memory that is both fast and affordable, and where there must be an efficient interface between the processor and this large memory.

Mapping the FIFOs into normal FPU register space retains the general-purpose character of the device, and it provides easier access for high-level languages with vector preprocessors and feature-cognizant compilers. When a preprocessor can recognize a data set as a vector structure, it can assign the set to a FIFO and initiate system service calls for DMA between memory and FIFO as needed. Data that is fetched to FIFO and used a few times doesn't thrash the cache.

A preprocessor and compiler have significant influence on performance. To get maximum performance from DSPs, direct-control coding is routine. To buffer the user from some of the difficulty, DSP vendors offer handcoded libraries of common functions. Fujitsu plans to introduce such software tools for the '934 but is not ready to reveal its partners just yet.

The racetrack arrangement achieves a balance between processing and data-transfer speed, and it eliminates, for vector operations, the need for loads and stores. The FPU's three-stage pipeline executes most single- and double-precision floating-point operations in one cycle (the exceptions are divide and square root).

In theory, this rate requires the racetrack to handle data at 720 Mbytes/s (two 32-bit operands and one 32-bit result per clock at 60 MHz). The 64-bit-wide SDRAM interface, in burst mode, operates at 480 Mbytes/s, which Fujitsu claims is sufficient to realize 52–56 MFLOPS on real code. To maximize throughput, the BIU (bus inter-

face unit) supports four-word bursts to the FIFOs, and the DMA controller's two channels support contiguousblock and chained-block transfers.

## **Bus Interface Supports SDRAM**

As with previous SparcLite processors, the '934 uses separate 32-bit address and data buses. The new device can run its bus interface at the processor's internal speed or at half that rate. With its prefetch and write buffers, loads and stores can occur as fast as every clock while the instruction unit operates out of cache.

Like its predecessors, the '934's BIU supports programmable wait-state generation, address decoding for chip-select outputs, and same-page detection for page-mode DRAMs. Unlike previous SparcLite chips, when the bus has been granted to another master, the '934's chip-select logic monitors the address bus, so designers need not provide external chip selects.

The '934 also provides an interface that permits direct connection to as many as four banks of SDRAM, each of which may be 32 or 64 bits wide. The SDRAM controller can handle devices up to 64 Mbits, and memory size can range from 2M up to 512M.

In addition to its 32 address lines, the chip provides four lines to identify alternate address spaces, allowing protected regions that isolate applications programs from operating-system or kernel routines. The processor can boot from ROMs that are 8, 16, or 32 bits wide.

## Lockable Caches Help Embedded Apps

As in the other SparcLite processors, the '934's data and instruction caches are accessed from independent internal instruction and data buses (although the buses are combined for external access). The 8K instruction cache uses a 32-byte line, while the 2K data cache has a 16-byte line. Both caches are two-way set associative and employ a write-through policy and an LRU replacement algorithm.

The processor provides global control bits that can lock either or both caches, preserving all valid entries in the locked space. One entire set of cache lines also can be locked while the other set continues to function as a direct-mapped cache. In addition, a local cache-locking mode permits dynamic locking of selected instructions or data entries. This set of features lets software establish a deterministic response to interrupts and guarantee low-latency access to critical code.

To enhance throughput, the processor has a prefetch buffer and a write buffer (both single-level). A control register allows the programmer to choose between burst mode and prefetch mode. Burst mode accesses memory to completely fill a four-word cache line, but prefetch mode works across cache lines: on an instruction-cache miss, the prefetch buffer performs a read-

#### MICROPROCESSOR REPORT

ahead of the next sequential instruction, ignoring cacheline boundaries. The processor resolves data-cache misses one word at a time, filling each 32-bit word singly from the prefetch buffer rather than filling an entire cache line.

## Core Is Common Across Family

Table 1 shows that the SparcLite family now has five members. At the low end is the 86933, which has no cache and runs at 20 MHz. The midrange includes the 40-MHz 86930, with 2K each of instruction and data cache, and the 86931, which also incorporates an interrupt controller, four timers, and four serial interfaces. The former holder of the high-end position, the 40-MHz 86932, has a 2K data cache and a larger 8K instruction cache as well as a two-channel DMA controller and an MMU with a 16-entry TLB.

The new '934 is derived from the 86932 with an altered mix of peripheral units, as Figure 1 shows. Internally, the '934 implements the SPARC v8 architecture to maintain software compatibility with its siblings and with its cousins used in workstations. But the '934 is designed specifically for math-intensive embedded applications, as is demonstrated by its two-order-of-magnitude advantage over the '932 in MFLOPS.

While the performance difference is mostly due to the floating-point hardware, some of the gain comes from a shrink in process size and a higher clock speed. Earlier SparcLite processors are implemented in a 0.8micron, three-layer-metal CMOS process and operate at 5V. In contrast, the new '934 is built in a 0.5-micron, three-layer-metal process known as CS-50 (*see* **080504.PDF**) that lets its clock run 50% faster. The chip dissipates less than 3 W at 60 MHz. Initially, it will be packaged in a 256-pin ceramic QFP, but a plastic package will be offered later.

#### Range of Tools Already in Place

The '934 incorporates an industry-standard JTAG port for boundary-scan testing, but it also supports hardware emulation with on-chip breakpoint and single-step logic. It includes a 10-bit bus (eight data bits) that lets external hardware trace operations between the core and the cache even if no external address is generated. The design is carried over from the original developed for the first SparcLite, the '930, in collaboration with Step Engineering (Sunnyvale, Calif.); it works with that company's ICE (in-circuit emulator), supplying information into a trace buffer in the ICE.

According to Fujitsu, because the narrow bus runs at processor speed, it suffices to provide full tracing of branches (which are limited to 22 bits of offset from the program counter). If jumps (which can require the bus to transmit more address information) occur close together

## Price & Availability

Samples of the 60-MHz 86934 SparcLite processor are available now at a cost of \$190. Production units will cost \$110 in 1,000-unit quantities and are promised for late in 3Q94. For further information, contact Fujitsu at 408.456.1260; fax 408.526.9515.

in the code, the hardware can slow the processor until all the trace information is transmitted. With the processor's five-stage pipeline, it is difficult to have two jumps that would cause this problem to happen.

Users can develop applications on two leading platforms: PC compatibles and SPARC workstations. Sun workstations are widespread and make a useful development platform with the addition of suitable compiler tools. Compilers, ROM-able monitors, real time operating systems, debuggers, and other development tools are available from numerous sources. Fujitsu also has SparcLite evaluation boards as well as Verilog behavioral models of the processor.

## Speed Comes at a Price

The '934 is not the only embedded processor with an FPU: the AMD 29050 and the Intel i960KB are the volume leaders in this league. LSI's 33050 and IDT's R3081, both based on the MIPS architecture, also have FPUs, as does NEC's V810. The i960KB costs about \$30, much less than the '934, but sustains only 5.2 MFLOPS, about one-tenth the floating-point performance of the Fujitsu chip. The MIPS processors and the NEC chip also fall far behind the '934 in FP performance, although these other chips carry low prices similar to that of the i960KB.

The 29050 provides a much better match for the SparcLite chip on FP benchmarks but still lags for many applications. At \$140 in 10,000-piece lots, it is also more expensive than the '934, which lists for \$110 in 1,000-unit quantities, giving Fujitsu a significant price/performance edge.

Besides the V810, the 86934 is the only Japanese embedded processor with an FPU. Clearly, the '934 is aimed at a niche market, but this fits Fujitsu's approach: as the only vendor of SPARC processors for embedded applications, it carefully focuses on selected market segments where it can do well. The company works closely with a few key accounts to provide devices targeted for their particular applications. It will continue in this vein and promises to describe new members of the SparcLite family at this year's Microprocessor Forum. The fact that the company manages to continue evolving the processor indicates that a suitable market for it exists, despite its low profile. ◆