Table of Contents
The June, 1996 release(1) (or later) of the ARTIC960 firmware changes this by offering options to enable and control use of the 80960 CF data cache.
The ARTCIC960Hx adapter uses an 80960HD processor. It has 16 KB of instruction cache and 8 KB of data cache. The ARTIC960 firmware supports options to enable and control use of the 80960 HD data cache. The ARTCIC960Rx adapter uses an 80960RP processor. It has 4 KB of instruction cache and 2 KB of data cache. The ARTIC960 firmware does not support the ARTIC960 RP data cache. Both the ARTIC960Hx PCI and ARTIC960Rx PCI adapters use a single bank of memory for both instructions and data.
The problem with enabling the 80960 data cache is that the other masters on the adapter local bus constantly access the packet memory on the adapter. If the data cache was simply enabled, software would have to manually manage the cache to ensure that it remained coherent as the various masters on the bus all vied for access to memory. For the majority of applications, the overhead of managing this cache coherency in software would have been tremendous and would have offset any potential performance gains from the data cache.
To solve this problem and reap the benefits of the 80960 data cache, some hardware assistance in managing the cache was needed. An engineering change(2) was made to the microchannel ARTIC960 MCA adapter and was put on all ARTIC960 PCI adapters that provided this assist.(3) This hardware assist was also designed into the ARTIC960Hx PCI adapter.
There are 2 memories on the ARTIC960Cx based adapters, instruction and packet. Instruction memory is optimized for accesses by the 80960 and packet memory is optimized for accesses from masters on the local bus (Miami and Vero). Packet memory starts at address 0x20000000 on the adapter local bus and instruction memory starts at 0x22000000. The 80960 CA/CF bus controller divides the flat 4 GB memory space into 16 256 MB regions. Each region can be independently configured in terms of bus width, number of wait states, byte ordering, caching, etc. Both the ARTIC960 instruction and packet memory reside in region 2. The ARTIC960 firmware enables the instruction cache and disables the data cache for region 2.
On the ARTIC960 Hx, there is only a single memory which resides at 0xA0000000. It is used for both instructions and data. The 80960 Hx bus controller also divides the flat 4 GB memory space into 16 256 MB regions. However, the 80960 Hx memory controller only allows the physical characteristics of these regions (bus width, wait states, etc) to be controlled individually. The endian of the memory and data cache are controlled through a separate set of logical (as opposed to physical) configuration registers.
The hardware assist involves remapping physical memory into multiple regions on the bus from the 80960 viewpoint. For the ARTIC960Cx based adapters, the assist maps the physical memory in region 2 to region A also (as depicted in the diagram to the right). Put another way, references by the 80960 to region A are remapped by the adapter to region 2. This allows the 80960 to access the same physical memory location via either region 2 or A. For example, the memory at 0x22001020 can also be accessed at 0xA2001020.
On the ARTIC960 Hx PCI adapter, a similar hardware assist is used. In this case, the memory within the 256MB 0xA00000000-0xBFFFFFFFF region is divided into four 128MB chunks and is mapped 4 times. This allows the 80960 Hx to access the same physical memory location via an address of 0xA0xxxxxx, 0xA8xxxxxx, 0xB0xxxxxx or 0xB8xxxxxx. For example, the memory at 0xA0002060 can also be accessed at 0xA8002060, 0xB0002060 or 0xB8002060.
So now that there are multiple mappings of memory, how does the firmware take advantage of the data cache? The June, 1996 release (or later) of the ARTIC960 firmware takes advantage of this hardware assist to allow use of the 80960 data cache by configuring one mapping without data cache enabled and one with the data cache enabled. Additionally on the ARTIC960Hx, the firmware makes use of the 2 extra regions to create a little & big endian mapping, each with a data cache disabled & enabled.
the ARTIC960Rx PCI adapter? Since the memory controller for the 80960RP
is totally integrated into the RP, the simple multiple memory mapping
hardware assist can't be used. Although the 80960RP also has logical memory
configuration capability, the lack of a dual memory mapping makes the
implementation much more involved. At this point in time, we have chosen
not to support the data cache on the ARTIC960Rx PCI adapter.
Now that the ARTIC960 adapter has a memory region with data cache enabled, how does software take advantage of it? First of all, if you do nothing, things work just as they did before and the data cache is not used, thus ensuring backwards compatibility. The first step to take advantage of the data cache is to tell the kernel to enable it by passing the kernel the parameter "DATA_CACHE=YES" when it is loaded. The default for the "DATA_CACHE" kernel parameter is NO. The next step to immediately benefit from the data cache is to tell the loader (ricload) to load all the ARTIC960 firmware (ric_kern.rel, ric_mcio.rel, ric_scb.rel) with full data caching. This is done via the new parameter "-dn" which is passed to ricload. The -dn switch currently has 4 valid values for n.
Virtually all ARTIC960 processes should be able to use the -d 3 parameter when loading to have both their data section and stack loaded and relocated to the cache-able region. The only exception to this would be if the process was allowing a master other than the 80960 to directly access its default data section or stack. For typical processes, this is not done since the buffers accessed by these devices are normally allocated separate from the process load image to ensure that they are optimally located in packet memory.(4)
So, most applications can take immediate advantage of the data cache by loading both the ARTIC960 firmware and the application process' with both the stack and data sections cached. Note that this can be done without any changes to the existing application code. Additionally, for those applications willing to make minor changes, the kernel's memory allocation API's now have an additional option bit (MEM_DCACHE) to request the cached address for the allocated memory. For allocated memory that is only accessed by the 80960, simply adding this bit to the CreateMem() or MallocMem() call options will cause that allocated memory to be cached.
Stack Frame CachingOne of the features of the 80960 is that it internally caches the local register set (a stack frame) to improve the performance of call's and return's. On a typical subroutine call, the local register set, or stack frame, is not pushed physically into memory. Instead, it is merely pushed into the internal register cache. Only when the successive subroutine calls fill the register frame cache do frames spill out into physical memory. Conversely, only when successive returns empty the register frame cache are frames fetched back from memory into the frame cache. The number of frames cached can be configured by the kernel REG_CACHE parameter. The default setting for this parameter is REG_CACHE=7. The allowable range of values for REG_CACHE is 5-15 inclusive.
The obvious immediate temptation is to configure REG_CACHE to its maximum value of 15; however, there is a drawback that can cause such a setting to actually degrade performance. On a process context switch, the kernel switches process stacks. To do this, it must flush the 80960 internal stack frame cache. If the configured value of REG_CACHE is too high, the benefits of the frame caching may be negated by the drawbacks at context switch time. The optimal setting for REG_CACHE is highly application dependent. If the application processes in general make many nested subroutine calls and run for long durations before yielding to a context switch, higher values may improve performance. If the application doesn't nest subroutine calls very deeply or runs for only short periods before yielding to a context switch, lower values should improve performance. How closely a process stays to a median call depth and how often context switches occur are the key factors affecting the optimal setting for REG_CACHE.
Note also that configuring REG_CACHE with values greater than 5 uses 16 bytes of the 80960 internal data RAM for each stack cache frame greater than 5. For instance, the kernel default setting of 7 uses 32 bytes of the 80960 internal data RAM.
Instruction CachingThe ARTIC960 kernel also has the option to operate with the instruction cache disabled. This can be controlled with the kernel INSTR_CACHE parameter. The default setting is INSTR_CACHE=YES. To disable the instruction cache, simply load the kernel with the parameter INSTR_CACHE=NO.
Another set of options available on the ARTIC960Hx & ARTIC960Rx PCI adapters is the capability to pin certain portions of the kernel in one "way" of the instruction cache. These options are PIN_KERN_INT_CODE=YES to pin code critical to interrupt intensive applications (like first level interrupt handlers) and PIN_KERN_PROC_CODE=YES to pin code critical to process-intensive applications (like the kernel dispatching routines). Both of these parameters default to NO if unspecified.
In all cases, developers should test & measure the performance effects of the caching options individiually with their applications to determine what configuration best suits the application. In some cases, improper configuration can actually hurt total system performance.
1999 RadiSys Corporation