ARTIC FAQ: Cashing in on the ARTIC960 caches

History
The Problem
The Solution
The Implementation
Other Cache Features

Content by RadiSys Corporation (original archived HERE). Edited by Major Tom.


History

The original Micro Channel ARTIC960 adapter used an Intel 80960 CA processor. The CA processor had a 1 KB instruction cache and no data cache. After the Micro Channel ARTIC960 adapter was released, it was quickly upgraded to the 80960 CF processor which has a 4 KB instruction cache and a 1 KB data cache. The ARTIC960 PCI adapter also uses the 80960 CF. The ARTIC960 and ARTIC960 PCI adapters both had a split memory architecture with a bank of memory optimized for the 80960 (instruction memory) and a bank of memory optimized for other local bus masters (packet memory). Initially, the ARTIC960 kernel enabled the instruction cache on both the CA and CF processors, but it left the data cache disabled on the CF based adapters. The June, 1996 release (1) (or later) of the ARTIC960 firmware changes this by offering options to enable and control use of the 80960 CF data cache.

The ARTCIC960Hx adapter uses an 80960HD processor. It has 16 KB of instruction cache and 8 KB of data cache. The ARTIC960 firmware supports options to enable and control use of the 80960 HD data cache. The ARTCIC960Rx adapter uses an 80960RP processor. It has 4 KB of instruction cache and 2 KB of data cache. The ARTIC960 firmware does not support the ARTIC960 RP data cache. Both the ARTIC960Hx PCI and ARTIC960Rx PCI adapters use a single bank of memory for both instructions and data.

The Problem

Neither the 80960 CA, CF, nor the HD, have cache snoop logic. On an ARTIC960 adapter, this means that accesses by the other local bus masters (Miami & Vero on the Cx based cards, Miami PCI-PCI & Volant on the Hx card) are not detected by the 80960. This presents a cache coherency problem for the adapter. Prior to the June, 1996 ARTIC960 firmware release, only the instruction cache was enabled on the ARTIC960 adapters. Since no local busmaster accesses the instruction memory, the only cache coherency problems were after a new process load from the system. In this case, the Miami chip acts as a master on the adapter local bus and writes the new process to adapter memory. To handle this situation, the kernel invalidates the instruction cache whenever a new process is started. This ensures that no stale data is in the instruction cache. The data cache, however, is still disabled in this scenario.

The problem with enabling the 80960 data cache is that the other masters on the adapter local bus constantly access the packet memory on the adapter. If the data cache was simply enabled, software would have to manually manage the cache to ensure that it remained coherent as the various masters on the bus all vied for access to memory. For the majority of applications, the overhead of managing this cache coherency in software would have been tremendous and would have offset any potential performance gains from the data cache.

The Solution

To solve this problem and reap the benefits of the 80960 data cache, some hardware assistance in managing the cache was needed. An engineering change (2) was made to the Micro Channel ARTIC960 MCA adapter and was put on all ARTIC960 PCI adapters that provided this assist.(3) This hardware assist was also designed into the ARTIC960Hx PCI adapter.

There are 2 memories on the ARTIC960Cx based adapters, instruction and packet. Instruction memory is optimized for accesses by the 80960 and packet memory is optimized for accesses from masters on the local bus (Miami and Vero). Packet memory starts at address 0x20000000 on the adapter local bus and instruction memory starts at 0x22000000. The 80960 CA/CF bus controller divides the flat 4 GB memory space into 16 256 MB regions. Each region can be independently configured in terms of bus width, number of wait states, byte ordering, caching, etc. Both the ARTIC960 instruction and packet memory reside in region 2. The ARTIC960 firmware enables the instruction cache and disables the data cache for region 2.

On the ARTIC960 Hx, there is only a single memory which resides at 0xA0000000. It is used for both instructions and data. The 80960 Hx bus controller also divides the flat 4 GB memory space into 16 256 MB regions. However, the 80960 Hx memory controller only allows the physical characteristics of these regions (bus width, wait states, etc) to be controlled individually. The endian of the memory and data cache are controlled through a separate set of logical (as opposed to physical) configuration registers.

The hardware assist involves remapping physical memory into multiple regions on the bus from the 80960 viewpoint. For the ARTIC960Cx based adapters, the assist maps the physical memory in region 2 to region A also (as depicted in the diagram to the right). Put another way, references by the 80960 to region A are remapped by the adapter to region 2. This allows the 80960 to access the same physical memory location via either region 2 or A. For example, the memory at 0x22001020 can also be accessed at 0xA2001020.

On the ARTIC960 Hx PCI adapter, a similar hardware assist is used. In this case, the memory within the 256MB 0xA00000000-0xBFFFFFFFF region is divided into four 128MB chunks and is mapped 4 times. This allows the 80960 Hx to access the same physical memory location via an address of 0xA0xxxxxx, 0xA8xxxxxx, 0xB0xxxxxx or 0xB8xxxxxx. For example, the memory at 0xA0002060 can also be accessed at 0xA8002060, 0xB0002060 or 0xB8002060.

So now that there are multiple mappings of memory, how does the firmware take advantage of the data cache? The June, 1996 release (or later) of the ARTIC960 firmware takes advantage of this hardware assist to allow use of the 80960 data cache by configuring one mapping without data cache enabled and one with the data cache enabled. Additionally on the ARTIC960Hx, the firmware makes use of the 2 extra regions to create a little & big endian mapping, each with a data cache disabled & enabled.

What about the ARTIC960Rx PCI adapter? Since the memory controller for the 80960RP is totally integrated into the RP, the simple multiple memory mapping hardware assist can't be used. Although the 80960RP also has logical memory configuration capability, the lack of a dual memory mapping makes the implementation much more involved. At this point in time, we have chosen not to support the data cache on the ARTIC960Rx PCI adapter.

The Implementation

Now that the ARTIC960 adapter has a memory region with data cache enabled, how does software take advantage of it? First of all, if you do nothing, things work just as they did before and the data cache is not used, thus ensuring backwards compatibility. The first step to take advantage of the data cache is to tell the kernel to enable it by passing the kernel the parameter "DATA_CACHE=YES" when it is loaded. The default for the "DATA_CACHE" kernel parameter is NO. The next step to immediately benefit from the data cache is to tell the loader (ricload) to load all the ARTIC960 firmware (ric_kern.rel, ric_mcio.rel, ric_scb.rel) with full data caching. This is done via the new parameter "-dn" which is passed to ricload. The -dn switch currently has 4 valid values for n.

n Meaning
0 None of the process is loaded as cached (the default).
1 The process stack will be cached.
2 The process data section will be cached.
3 Both the process stack and data sections will be cached.

The loader will use this data cache parameter to determine how the process' memory should be allocated and relocate the process accordingly during the load.

Virtually all ARTIC960 processes should be able to use the -d 3 parameter when loading to have both their data section and stack loaded and relocated to the cache-able region. The only exception to this would be if the process was allowing a master other than the 80960 to directly access its default data section or stack. For typical processes, this is not done since the buffers accessed by these devices are normally allocated separate from the process load image to ensure that they are optimally located in packet memory.(4)

So, most applications can take immediate advantage of the data cache by loading both the ARTIC960 firmware and the application process' with both the stack and data sections cached. Note that this can be done without any changes to the existing application code. Additionally, for those applications willing to make minor changes, the kernel's memory allocation API's now have an additional option bit (MEM_DCACHE) to request the cached address for the allocated memory. For allocated memory that is only accessed by the 80960, simply adding this bit to the CreateMem() or MallocMem() call options will cause that allocated memory to be cached.

Other Cache Features

Beyond the data cache features in the ARTIC960 firmware, a few other kernel parameters allow the application to fine tune how the 80960 caches are used.

Stack Frame Caching

One of the features of the 80960 is that it internally caches the local register set (a stack frame) to improve the performance of call's and return's. On a typical subroutine call, the local register set, or stack frame, is not pushed physically into memory. Instead, it is merely pushed into the internal register cache. Only when the successive subroutine calls fill the register frame cache do frames spill out into physical memory. Conversely, only when successive returns empty the register frame cache are frames fetched back from memory into the frame cache. The number of frames cached can be configured by the kernel REG_CACHE parameter. The default setting for this parameter is REG_CACHE=7. The allowable range of values for REG_CACHE is 5-15 inclusive.

The obvious immediate temptation is to configure REG_CACHE to its maximum value of 15; however, there is a drawback that can cause such a setting to actually degrade performance. On a process context switch, the kernel switches process stacks. To do this, it must flush the 80960 internal stack frame cache. If the configured value of REG_CACHE is too high, the benefits of the frame caching may be negated by the drawbacks at context switch time. The optimal setting for REG_CACHE is highly application dependent. If the application processes in general make many nested subroutine calls and run for long durations before yielding to a context switch, higher values may improve performance. If the application doesn't nest subroutine calls very deeply or runs for only short periods before yielding to a context switch, lower values should improve performance. How closely a process stays to a median call depth and how often context switches occur are the key factors affecting the optimal setting for REG_CACHE.

Note also that configuring REG_CACHE with values greater than 5 uses 16 bytes of the 80960 internal data RAM for each stack cache frame greater than 5. For instance, the kernel default setting of 7 uses 32 bytes of the 80960 internal data RAM.

Instruction Caching

The ARTIC960 kernel also has the option to operate with the instruction cache disabled. This can be controlled with the kernel INSTR_CACHE parameter. The default setting is INSTR_CACHE=YES. To disable the instruction cache, simply load the kernel with the parameter INSTR_CACHE=NO.

Another set of options available on the ARTIC960Hx & ARTIC960Rx PCI adapters is the capability to pin certain portions of the kernel in one "way" of the instruction cache. These options are PIN_KERN_INT_CODE=YES to pin code critical to interrupt intensive applications (like first level interrupt handlers) and PIN_KERN_PROC_CODE=YES to pin code critical to process-intensive applications (like the kernel dispatching routines). Both of these parameters default to NO if unspecified.

In all cases, developers should test & measure the performance effects of the caching options individually with their applications to determine what configuration best suits the application. In some cases, improper configuration can actually hurt total system performance.

Notes

  • (1) The June, 1996 release is ARTIC960 AIX Support Version 1.1.4 or ARTIC960 OS/2 Support Version 1.1.1.
  • (2) To determine if an ARTIC960 MCA adapter has the data cache engineering change, the status utility (ricstat) can be used to display the configuration of the adapter.
  • (3) To fully take advantage of the kernel data cache capabilities, the adapter ROM version must be at least 1.5.2 for ARTIC960 MCA or 2.0.4 for ARTIC960 PCI.
  • (4) Note that on ARTIC960 MCA & ARTIC960 PCI, the DMA channel descriptor blocks (CDB's) and posted status areas must also be allocated in non-cached memory since they are directly accessed by Miami and Vero.
  • Content created and/or collected by:
    Louis F. Ohland, Peter H. Wendt, David L. Beem, William R. Walsh, Tatsuo Sunagawa, Tomáš Slavotínek, Jim Shorney, Tim N. Clarke, Kevin Bowling, and many others.

    Ardent Tool of Capitalism is maintained by Tomáš Slavotínek.
    Last update: 29 Sep 2024 - Changelog | About | Legal & Contact