History
The Problem
The Solution
The Implementation
Other Cache Features
Content by RadiSys Corporation (original archived HERE).
Edited by Major Tom.
History
The original
Micro Channel ARTIC960 adapter used an Intel 80960 CA processor. The CA processor
had a 1 KB instruction cache and no data cache. After the Micro Channel
ARTIC960 adapter was released, it was quickly upgraded to the 80960 CF processor
which has a 4 KB instruction cache and a 1 KB data cache. The ARTIC960 PCI
adapter also uses the 80960 CF. The ARTIC960 and ARTIC960 PCI adapters both
had a split memory architecture with a bank of memory optimized for the
80960 (instruction memory) and a bank of memory optimized for other local
bus masters (packet memory). Initially, the ARTIC960 kernel enabled the
instruction cache on both the CA and CF processors, but it left the data
cache disabled on the CF based adapters. The June, 1996 release
(1)
(or later) of the ARTIC960 firmware changes this by offering options to
enable and control use of the 80960 CF data cache.
The ARTCIC960Hx
adapter uses an 80960HD processor. It has 16 KB of instruction cache and
8 KB of data cache. The ARTIC960 firmware supports options to enable and
control use of the 80960 HD data cache. The ARTCIC960Rx adapter uses an
80960RP processor. It has 4 KB of instruction cache and 2 KB of data cache.
The ARTIC960 firmware does not support the ARTIC960 RP data cache. Both
the ARTIC960Hx PCI and ARTIC960Rx PCI adapters use a single bank of memory
for both instructions and data.
The Problem
Neither the
80960 CA, CF, nor the HD, have cache snoop logic. On an ARTIC960 adapter,
this means that accesses by the other local bus masters (Miami & Vero on
the Cx based cards, Miami PCI-PCI & Volant on the Hx card) are not detected
by the 80960. This presents a cache coherency problem for the adapter. Prior
to the June, 1996 ARTIC960 firmware release, only the instruction cache
was enabled on the ARTIC960 adapters. Since no local busmaster accesses
the instruction memory, the only cache coherency problems were after a new
process load from the system. In this case, the Miami chip acts as a master
on the adapter local bus and writes the new process to adapter memory. To
handle this situation, the kernel invalidates the instruction cache whenever
a new process is started. This ensures that no stale data is in the instruction
cache. The data cache, however, is still disabled in this scenario.
The problem
with enabling the 80960 data cache is that the other masters on the adapter
local bus constantly access the packet memory on the adapter. If the data
cache was simply enabled, software would have to manually manage the cache
to ensure that it remained coherent as the various masters on the bus
all vied for access to memory. For the majority of applications, the overhead
of managing this cache coherency in software would have been tremendous
and would have offset any potential performance gains from the data cache.
The Solution
To solve this problem and reap the benefits of the 80960 data cache, some
hardware assistance in managing the cache was needed. An engineering change
(2) was made to the Micro
Channel ARTIC960 MCA adapter and was put on all ARTIC960 PCI adapters that
provided this assist.(3)
This hardware assist was also designed into the ARTIC960Hx PCI adapter.
There are
2 memories on the ARTIC960Cx based adapters, instruction and packet. Instruction
memory is optimized for accesses by the 80960 and packet memory is optimized
for accesses from masters on the local bus (Miami and Vero). Packet memory
starts at address 0x20000000 on the adapter local bus and instruction
memory starts at 0x22000000. The 80960 CA/CF bus controller divides the
flat 4 GB memory space into 16 256 MB regions. Each region can be independently
configured in terms of bus width, number of wait states, byte ordering,
caching, etc. Both the ARTIC960 instruction and packet memory reside in
region 2. The ARTIC960 firmware enables the instruction cache and disables
the data cache for region 2.
On the ARTIC960
Hx, there is only a single memory which resides at 0xA0000000. It is used
for both instructions and data. The 80960 Hx bus controller also divides
the flat 4 GB memory space into 16 256 MB regions. However, the 80960
Hx memory controller only allows the physical characteristics of these
regions (bus width, wait states, etc) to be controlled individually. The
endian of the memory and data cache are controlled through a separate
set of logical (as opposed to physical) configuration registers.
The hardware assist involves remapping physical memory into multiple regions
on the bus from the 80960 viewpoint. For the ARTIC960Cx based adapters,
the assist maps the physical memory in region 2 to region A also (as depicted
in the diagram to the right). Put another way, references by the 80960
to region A are remapped by the adapter to region 2. This allows the 80960
to access the same physical memory location via either region 2 or A.
For example, the memory at 0x22001020 can also be accessed at 0xA2001020.
On the ARTIC960
Hx PCI adapter, a similar hardware assist is used. In this case, the memory
within the 256MB 0xA00000000-0xBFFFFFFFF region is divided into four 128MB
chunks and is mapped 4 times. This allows the 80960 Hx to access the same
physical memory location via an address of 0xA0xxxxxx, 0xA8xxxxxx, 0xB0xxxxxx
or 0xB8xxxxxx. For example, the memory at 0xA0002060 can also be accessed
at 0xA8002060, 0xB0002060 or 0xB8002060.
So now that
there are multiple mappings of memory, how does the firmware take advantage
of the data cache? The June, 1996 release (or later) of the ARTIC960 firmware
takes advantage of this hardware assist to allow use of the 80960 data
cache by configuring one mapping without data cache enabled and one with
the data cache enabled. Additionally on the ARTIC960Hx, the firmware makes
use of the 2 extra regions to create a little & big endian mapping, each
with a data cache disabled & enabled.
What about
the ARTIC960Rx PCI adapter? Since the memory controller for the 80960RP
is totally integrated into the RP, the simple multiple memory mapping
hardware assist can't be used. Although the 80960RP also has logical memory
configuration capability, the lack of a dual memory mapping makes the
implementation much more involved. At this point in time, we have chosen
not to support the data cache on the ARTIC960Rx PCI adapter.
The Implementation
Now that
the ARTIC960 adapter has a memory region with data cache enabled, how
does software take advantage of it? First of all, if you do nothing, things
work just as they did before and the data cache is not used, thus ensuring
backwards compatibility. The first step to take advantage of the data
cache is to tell the kernel to enable it by passing the kernel the parameter
"DATA_CACHE=YES" when it is loaded. The default for the "DATA_CACHE"
kernel parameter is NO. The next step to immediately benefit from
the data cache is to tell the loader (ricload) to load all the
ARTIC960 firmware (ric_kern.rel, ric_mcio.rel, ric_scb.rel) with full
data caching. This is done via the new parameter "-dn" which
is passed to ricload. The -dn switch currently has
4 valid values for n.
n |
Meaning |
0 |
None
of the process is loaded as cached (the default). |
1 |
The
process stack will be cached. |
2 |
The
process data section will be cached. |
3 |
Both
the process stack and data sections will be cached. |
The loader will
use this data cache parameter to determine how the process' memory should
be allocated and relocate the process accordingly during the load.
Virtually all ARTIC960 processes should be able to use the -d 3
parameter when loading to have both their data section and stack loaded and
relocated to the cache-able region. The only exception to this would be if the
process was allowing a master other than the 80960 to directly access its
default data section or stack. For typical processes, this is not done since
the buffers accessed by these devices are normally allocated separate from
the process load image to ensure that they are optimally located in packet
memory.(4)
So, most applications can take immediate advantage of the data cache by
loading both the ARTIC960 firmware and the application process' with both the
stack and data sections cached. Note that this can be done without any
changes to the existing application code. Additionally, for those applications
willing to make minor changes, the kernel's memory allocation API's now
have an additional option bit (MEM_DCACHE) to request the cached
address for the allocated memory. For allocated memory that is only accessed
by the 80960, simply adding this bit to the CreateMem() or
MallocMem() call options will cause that allocated memory to be
cached.
Other Cache Features
Beyond the data
cache features in the ARTIC960 firmware, a few other kernel parameters allow
the application to fine tune how the 80960 caches are used.
Stack Frame Caching
One of the features
of the 80960 is that it internally caches the local register set (a stack
frame) to improve the performance of call's and return's. On a typical subroutine
call, the local register set, or stack frame, is not pushed physically into
memory. Instead, it is merely pushed into the internal register cache. Only
when the successive subroutine calls fill the register frame cache do frames
spill out into physical memory. Conversely, only when successive returns
empty the register frame cache are frames fetched back from memory into
the frame cache. The number of frames cached can be configured by the kernel
REG_CACHE parameter. The default setting for this parameter is
REG_CACHE=7. The allowable range of values for REG_CACHE is
5-15 inclusive.
The obvious
immediate temptation is to configure REG_CACHE to its maximum value
of 15; however, there is a drawback that can cause such a setting to actually
degrade performance. On a process context switch, the kernel switches
process stacks. To do this, it must flush the 80960 internal stack frame
cache. If the configured value of REG_CACHE is too high, the benefits
of the frame caching may be negated by the drawbacks at context switch
time. The optimal setting for REG_CACHE is highly application dependent.
If the application processes in general make many nested subroutine calls
and run for long durations before yielding to a context switch, higher
values may improve performance. If the application doesn't nest subroutine
calls very deeply or runs for only short periods before yielding to a
context switch, lower values should improve performance. How closely a
process stays to a median call depth and how often context switches occur
are the key factors affecting the optimal setting for REG_CACHE.
Note also
that configuring REG_CACHE with values greater than 5 uses 16 bytes
of the 80960 internal data RAM for each stack cache frame greater than
5. For instance, the kernel default setting of 7 uses 32 bytes of the
80960 internal data RAM.
Instruction Caching
The ARTIC960
kernel also has the option to operate with the instruction cache disabled.
This can be controlled with the kernel INSTR_CACHE parameter. The
default setting is INSTR_CACHE=YES. To disable the instruction cache,
simply load the kernel with the parameter INSTR_CACHE=NO.
Another
set of options available on the ARTIC960Hx & ARTIC960Rx PCI adapters is
the capability to pin certain portions of the kernel in one "way" of the
instruction cache. These options are PIN_KERN_INT_CODE=YES to pin
code critical to interrupt intensive applications (like first level interrupt
handlers) and PIN_KERN_PROC_CODE=YES to pin code critical to process-intensive
applications (like the kernel dispatching routines). Both of these parameters
default to NO if unspecified.
In all cases,
developers should test & measure the performance effects of the caching
options individually with their applications to determine what configuration
best suits the application. In some cases, improper configuration can
actually hurt total system performance.
Notes
|