-----------------------------------------------------------------
FLOATING-POINT DIVISION WITH OPTIONAL CHECKING TO ENSURE FULL
RESULT PRECISION
-----------------------------------------------------------------
Questions pertaining to documents and code related to the Intel
FDIV software patch can be directed to software-support@intel.com
or FAXed to (408)765-5165 attention FDIV PATCH.
CONTENTS
--------
REVISION HISTORY
OBJECTIVE
BACKGROUND
PATCH PROCESS STRATEGY
THE CORE ALGORITHM
RECOMPILATION
ISVs - END-USER MODIFICATIONS
PATCH IMPLEMENTATION
IMPLEMENTATION VARIATIONS
LIBRARIES
TECHNICAL NOTES
Running the Patch on Processors Preceding the
Intel486(tm) Processor
Detecting the Floating-Point Unit
Mnemonic Interpretation
Scaling Factor
Scaling Exceptions
Precision Loss
FPU Status Word
RELATED PROOFS
Safety of the Logarithmic Instructions FYL2X and FYL2XP1
FYL2X
FYL2XP1
Identifying Problematic Divisors
TESTING AND VALIDATION
CYCLE COST CONSIDERATIONS
APPENDIX A Division with optional checking functions summary
APPENDIX B Division with optional checking files summary
APPENDIX C Patch code revision history
REVISION HISTORY
----------------
011395
------
Added revision history section to patch document.
Retitled the section "Running the Patch on an Intel386(tm) Pro-
cessor" to "Running the Patch on Processors Preceding the
Intel486(tm) Processor." Added text indicating the need to check
for the presence of a floating-point unit.
Added Table 2, Executions times of FDIV patch with memory
operands, under CYCLE COST CONSIDERATIONS.
Added APPENDIX C containing patch code revision history.
OBJECTIVE
---------
The following document describes an Intel-approved software ap-
proach to floating-point division that utilizes proved software
algorithms and existing hardware instructions. Using this ap-
proach overcomes the possibility of a reduction in precision due
to a floating-point division flaw in some steppings of the
Pentium(tm) processor.
The objectives of this approved approach are to provide a method
for floating-point division that
1. Ensures floating-point division result precision on
all Intel386 processors and beyond, in all precision
modes.
2. Has been optimized for efficient performance on
current and future Intel processors. This
optimization has been accomplished through the
hand-coding of assembly routines that include such
techniques as the elimination of branching code and
the avoidance of CPU stalls.
A number of software patches have been proposed that may be suit-
ed to avoiding a potential division flaw. Note that Intel's pro-
posed software workaround, or patch, does not disable the
floating-point unit on susceptible Pentium(tm) processors. Hand
coded and optimized assembly routines were developed that contin-
ue to utilize the hardware floating-point division operation with
additional operations executed only in the rare case that a given
division is known to be susceptible to a floating-point division
flaw.
The software patch presented is intended to be implemented at the
compilation and software development levels. End-users should
use recompiled code where available. Recompiled code with a
patch such as the one that Intel is providing will allow the
fastest patched executable speed. Patches that interrupt execut-
ables to override floating-point division operations with alter-
nate solutions incur the added expense interrupts. The disabling
of the floating-point unit as a patch for the floating-point
division flaw will slow all floating-point calculations, includ-
ing those such as FADD that are unaffected by the FDIV flaw.
The utilization of other patches, such as those that disable the
floating-point unit, can be of use to those who must execute code
that has not been coded or recompiled with incorporation of a
software patch.
BACKGROUND
----------
Certain steppings of the Intel Pentium(tm) processors have exhi-
bited a flaw in the floating-point unit that may result in some
loss of precision in division results. This precision loss can
manifest itself in bit positions 13 and beyond of the mantissa of
a floating-point division result and may occur in any of the
three (single, double, and extended) precisions, independent of
rounding mode. The floating-point division flaw can affect
floating-point division instructions such as FDIV and FDIVR as
well as functions utilizing hardware division instructions in-
cluding FPTAN, FPATAN, FPREM, and FPREM1. Because the flaw af-
fects a maximum of 5 sparsely populated divisor value ranges out
of 1024 possible ranges and particular combinations of operands,
precision is only affected in approximately 1 of 9 billion ran-
domly fed floating-point division operations.
Intel is determined to provide a safeguard against floating-point
division inaccuracies expediently and on all processors. To ac-
complish these goals, a unique collaboration was formed between
Intel and experts in the industry. Analysis and software patches
for the Pentium(tm) processor floating-point division flaw have
been devised at Intel utilizing expert input from Cleve Moler,
Terje Mathisen, Tim Coe, and Peter Tang. Coe has been able to
precisely simulate the FDIV flaw and provide proofs of correct-
ness for the techniques described in this document. Moler
developed a software adjustment technique. He has implemented an
FDIV patch in MATLAB and is currently verifying the result.
Mathisen devised a table-driven check of floating-point divisor
values, wrote an initial version of software patch code, and as-
sisted Intel with instruction-level optimization in assembly
code. Intel extended these techniques further to provide imple-
mentation flexibility to software developers and to minimize the
clock count of the floating-point division precision correction.
The resulting workaround can be implemented by replacing each
floating-point division instruction with a macro that expands in
line.
PATCH PROCESS STRATEGY
----------------------
To avoid slowing the execution of already correct floating-point
division operations, the software patch first asserts the possi-
bility of an imprecise result before executing a correction.
The recommended software solution involves a code expansion of
each FDIV-type instruction into a macro that includes a call to
an error checking and adjustment routine, described in more de-
tail later in this document. The routine includes several steps
needed to eliminate any potential precision loss from the
Pentium(tm) floating-point division flaw, as follows.
1. Test a global flag to indicate whether or not a
processor is flawed. If the processor is not a
Pentium(tm) processor containing the floating-point
division flaw, a normal floating-point hardware
division is applied to the original operands.
Otherwise,
2. Perform an operand range check for those processors
which contain the flaw.
a. If the range test indicates that a divisor
is not in a susceptible numeric range,
return the result of a normal floating-point
hardware division applied to the original
operands. Otherwise,
b. If the range test indicates that a divisor
is in a susceptible numeric range,
1. Perform a software adjustment of the
numerator and denominator.
2. Execute a hardware division with the
adjusted operands.
3. Return the full precision result of
the division.
The software patch should be generated and executed by default
rather than under a specific option flag. In particular, "blend-
ed code," code targeted towards multiple processors, should in-
corporate the software fix. This guarantees that even execution
of applications not optimized for the Pentium(tm) processor will
be protected against floating-point precision reduction should
they unexpectedly be executed on a Pentium(tm) processor with the
floating-point division flaw. An option should be provided to
disable the patch during execution.
In order to accommodate the requests of implementors of the Intel
software patch, several variations to the basic software correc-
tion and its implementation have been developed by Intel.
Developers can choose a suggested option that is technically
correct in their environment, minimally intrusive to their
current production schedules, and allows for the fastest tur-
naround. If a developer requires a solution not encompassed by
the existing patches, Intel can assist in developing appropriate
techniques.
THE CORE ALGORITHM
------------------
The core of the division with optional checking process is accom-
plished through several steps and should not be modified.
The core algorithm need only be executed on flawed Pentium(tm)
processors. The existence of a Pentium(tm) processor with the
floating-point division processor flaw is identified at run time
by executing a floating-point division instruction with operands
known to induce a loss of precision.
During the first part of the core algorithm, a range check is
performed. Only divisors in identifiable ranges indicate divi-
sion operations susceptible to a floating-point division result
precision loss. An early proof indicated that a maximum of 5 out
of 16 ranges, or 31% of ranges, of divisor values identify sus-
ceptible divisions. Tim Coe and Peter Tang have proved that
there are a maximum of 5 out of 1024 ranges of divisor values
that constitute susceptible divisions, or a potential of less
than 1%. Peter Tang independently verified a proof of 5 suscep-
tible numeric bands out of 128. That proof currently supports
the core algorithm as it was available at the start of patch code
testing and validation.
Consider this representation of a normal divisor.
+------+-----------+-----+--------+-------------------------+
| sign | exp | 1. | 1111 | 111 . . . . . . . . . . |
+------+-----------+-----+--------+-------------------------+
| |(RANGE) |
| +----------- mantissa -----------+
+-- zero if denormal
Figure 1.
RANGE refers to the 4 bits of the mantissa seen in the figure
above. In order for a reduction in precision to possibly affect
a floating-point division result, these 4 bits must be equal to
the decimal values 1, 4, 7, 10, or 13. Furthermore, the subse-
quent three bits must all be ones (i.e. equal to a 3-bit value of
7 decimal).
Consider the floating point number 14.999999. This number is
known to be a divisor susceptible to a reduced precision divi-
sion. Its hexadecimal value is 416FFFF. Its binary representa-
tion can be seen in Figure 2.
+------+-----------+-----+--------+-------------------------+
| 0 | 1000 0010 | 1. | 1101 | 111 1111 1111 1111 1111 |
+------+-----------+-----+--------+-------------------------+
| |(RANGE) |
| +----------- mantissa -----------+
+-- hidden bit
zero if denormal
Figure 2.
The first part of the range-check algorithm tests to see if the
divisor in question is a denormal number. If the divisor value
is found to be denormal, it is shifted 2^64 to the left to nor-
malize the value before continuing with the range check process.
Next, the three bits following RANGE are masked. If any of those
bits equals zero, the core algorithm executes a hardware
floating-point division with the original operands, then exits.
Using the Figure 2 example, 14.999999 is not denormal and the
three bits following RANGE are all ones. Therefore, the algo-
rithm continues.
Next, an efficient table-lookup scheme developed by Terje Math-
isen is employed to detect a divisor whose RANGE value is 1, 4,
7, 10, or 13. A table is initialized with 16 elements, as seen
in Figure 2. Positions 1, 4, 7, 10, and 13 are set to one while
the remaining positions are set to zero.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|0|1|0|0|1|0|0|1|0|0|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
element 0 element 15
Figure 3.
When the RANGE value is used as an index into the table, a one or
zero is returned. If a zero is returned, no precision reduction
will occur in a division result, and the core algorithm executes
a hardware floating-point division with the original operands,
then exits.
In the Figure 2 example, the 4 RANGE bits equal 13. Indexing 13
into the table (table[13]) returns a value of one, so the core
algorithm is not yet complete.
At this point in the algorithm, it is known that the divisor
falls into a numeric range that may be susceptible to the
Pentium(tm) processor floating-point division flaw. Both numera-
tor and divisor are multiplied by 15/16, which means that the
floating-point division itself is multiplied by 1 ((15/16) /
(15/16)). The scaling factor of 15/16 shifts the operand values
of the floating-point division into ranges known to be immune to
the floating-point division flaw. That is, 1 is shifted to 0, 4
to 3, 7 to 6, and so on. A hardware floating-point division with
the scaled operands is performed and the algorithm exits.
In single and double precision modes, division results will have
full precision and conform to IEEE standards. In extended modes,
at most one bit of precision may be lost due to the extra
floating-point operations on the operands.
RECOMPILATION
-------------
The Intel software patch can optimally be implemented at the com-
pilation level. The workaround can be implemented in a compiler
by following the guidelines under the section entitled PATCH IM-
PLEMENTATION. Compilers should be modified to generate a
software patch in place of each originally generated FDIV-type
instruction as described later. Source-level applications can
then be recompiled to incorporate the Intel workaround.
Though a compiler should generate patch code by default, an op-
tion should be available to disable patch code generation during
compilation. When such an option is specified, a compiler should
still be able to check for the presence of the floating-point
division flaw and issue a warning if present.
ISVs - END-USER MODIFICATIONS
-----------------------------
The presented software solution can be validly applied at the
source level; however, modifying an application at the source-
code level in this way can be error-prone and catalyze problems
in future generations of the application. Implementation of the
division_with_checking routine directly within a source-level ap-
plication is not straightforward as during assembly code genera-
tion of an application. Scanning specifically for floating-point
division symbols in a C program is a complicated process, for ex-
ample. Multiple divisions in a statement or within conditional
blocks prevent the automatic expansion of a macro in place of a
normal floating-point division expression. The software patch
should preferably be implemented at the compilation level, where
operations have been broken into single independent instructions
and immediately succeeding function calls can therefore be
tolerated.
It is recommended that ISVs recompile their applications with
compilers modified to generate the FDIV-instruction patch expan-
sion. This will guarantee that application users receive the
precision available from processors not containing the floating-
point unit flaw with minimal impact on application execution
speed.
Individuals with access to their own applications' low-level as-
sembly source codes can incorporate the provided macros for
floating-point division with checking much as a compiler would
before code output.
If during a particular run an individual is certain that an ap-
plication will not be affected by a reduction in floating-point
division accuracy, the fdiv_chk_flag global variable can be
turned off to maximize execution speed of that application.
PATCH IMPLEMENTATION
--------------------
A summary of some precision checking and restoring
(division_with_checking) functions is provided in Appendix A.
Refer to appropriate assembly source files for the actual Intel
division_with_checking routines. Throughout the remainder of
this document, the token division_with_checking is used to indi-
cate any of the floating-point division patch routines provided
by Intel.
The preferred method of retaining precision involves compiler in-
clusion of a conditionally executed division_with_checking rou-
tine at each FDIV instruction originally generated. This process
can be expanded by compilers at a very late compilation phase so
as to preserve compiler state and minimize the changes required
to existing compilers. Assembly code developers can directly in-
corporate calls to the patch code in their routines.
This late-phase design simplifies the implementation process for
compiler vendors by avoiding the disruption of the quality as-
surance process that would be caused by mid-compiler modifica-
tions. In addition, compiler optimization will proceed normally
since a call will not be inserted in place of an FDIV in inter-
mediate code. Insertion of a call during earlier phases of com-
pilation could potentially turn off optimization around
floating-point divisions.
In order to incur minimal cycle cost, the division_with_checking
routines should be called only during execution of applications
on individual processors known to exhibit the floating-point
division flaw. To this end, the software patch process includes
the use of conditional code enclosing the appropriate
division_with_checking routines.
In the assembly code example presented in this section, the first
operand of a two-operand instruction is the result destination.
Consider an FDIVR (reverse division) instruction. The format
used in section examples is
FDIVR DEST, SRC
DEST <- SRC / DEST
The actual instruction used in this section's example will be
opcode mnemonic
d8 fd fdivr st, st(5)
The top of the stack will hold the result of a floating-point
division of the fifth (zero-based) stack position by the top of
the stack.
At the affected FDIV-type instruction, macro expansion and gen-
eration of a conditional block around the fdivr instruction
should take place as seen below.
if (fdiv_chk_flag == 1) {
fdivr st, st(5)
}
else
division_with_checking FDIV expansion
The fdiv_chk_flag global variable is a three-state variable whose
value is set within the division_with_checking routines. It is
initialized to 0 when declared, set to 1 on processors not re-
quiring floating-point division with checking, or set to -1 when
the executing Pentium(tm) processor exhibits the floating-point
division flaw.
The value of fdiv_chk_flag is set by the function fdiv_detect
during the first invocation of a division_with_checking routine.
The fdiv_detect function stores a value to fdiv_chk_flag depend-
ing upon the status of the executing processor. It also returns
the new value of fdiv_chk_flag to the calling function. The
division_with_checking routine is called when fdiv_chk_flag is
not equal to 1, and proceeds when fdiv_chk_flag is equal to -1.
cmp fdiv_chk_flag, $1 ; compare global to 1
jne L1 ; if not 1 jump to L1
fdivr st, st(5) ; else do hw fdivr
jmp L2 ; then jump to L2
L1:
division_with_checking FDIVR expansion ; do fdivr w/checking
; during the first
; invocation, this
; routine else sets
; fdiv_chk_flag
L2:
Next, insert a call to an appropriate division_with_checking rou-
tine (in this example, fdiv_r) within the conditional block in
addition to the original FDIV-type instruction. The code for the
current example then resembles the sequence below.
cmp fdiv_chk_flag, $1 ; compare global to 1
jne L1 ; if not 1 jump to L1
fdivr st, st(5) ; else do hw fdivr
jmp L2 ; then jump to L2
L1:
call fdiv_r
L2:
Note that the caller must save and restore condition codes and
may need to save and restore register contents around calls to
division_with_checking functions. The eflags register contents
are destroyed by the division_with_checking routines, and the eax
register contents might be overwritten as specified below.
When performing register-register divisions, as in the current
example, the eax register is used to convey information to the
division_with_checking procedure to be executed. Therefore, the
caller must at times save and restore eax around invocations of
division_with_checking functions. Alternatively, the eax infor-
mation can be pushed onto the stack with simple modifications to
the relevant division_with_checking routines.
For a division with register operands, one operand is taken from
the top of the floating-point stack, and the stack position
number of the second operand needs to be recorded in the eax re-
gister along with additional information about the intended
floating-point division as pictured in Figure 4. Only the 6
lowest bits of the eax register are used for this initialization
so potential operation in 16-bit mode is still valid.
-----------------------------------------------------------------
5 4 3 2 1 0 Bit position
+---+---+---+---+---+---+
| | | | | | |
+---+---+---+---+---+---+
| | | | |
+-------+ | | |
Indicates | | +-- pop (DIVP)
stack | +-- reverse (DIVR)
position +-- True - result is at ST(bits 3 to 5)
False - result is at ST(0)
Figure 4.
Register Initialization of eax
for register-register FDIV patch
-----------------------------------------------------------------
Alternatively, when floating-point divisions involve memory
operands, the associated division_with_checking routines expect
that memory operands have been pushed onto the top of the user
stack. This process avoids prefix overrides. Push memory in-
structions can be used to accomplish the memory operand setup on
the user stack, eliminating the need for additional register as-
signment.
In the given example, the last six bits of eax should be set to
101010 (decimal 42). That is, refer to position 5 (101) in the
stack for the second operand, the top of stack will hold the
result (0), reverse division is specified (1), and no pop will be
executed (0).
cmp fdiv_chk_flag, $1 ; compare global to 1
jne L1 ; if not 1 jump to L1
fdivr st, st(5) ; else do hw fdivr
jmp L2 ; then jump to L2
L1:
push eax ; save eax
mov eax, $42 ; load eax for fdiv_r
call fdiv_r ; do div w/checking
fstp result ; get the div result
pop eax ; restore eax
L2:
The division_with_checking routines such as fdiv_r return the
division result on the floating-point stack.
In UNIX format, the destination and source operands are reversed.
Hence, the preceding example would be translated to the subse-
quent code.
** cmp $1, fdiv_chk_flag ; compare global to 1
jne L1 ; if not 1 jump to L1
** fdivr %st(5), %st ; else do hw fdivr
jmp L2 ; then jump to L2
L1:
** push %eax ; save eax
** mov $42, %eax ; load eax for fdiv_r
call fdiv_r ; do div w/checking
fstp result ; get the div result
** pop %eax ; restore eax
L2:
Asterisks indicate instructions modified during translation to
UNIX format.
IMPLEMENTATION VARIATIONS
-------------------------
The core algorithm that performs the divisions and assures accu-
racy should not be modified. There are subtle end cases that
must be accounted for to provide results equivalent to the FDIV
operation executing on processors not containing the floating-
point division flaw.
The previous sections describe the preferred software solution
for overcoming the Pentium(tm) processor floating-point division
flaw. This solution is likely to accommodate most environments.
A variation of the preferred resolution may be necessary, as
described in the following paragraphs.
If the fdiv_chk_flag global can be set in a program's startup
code, the checking routines can be modified to eliminate the set-
ting and testing of the fdiv_chk_flag global variable. This el-
iminates one mandatory call to a division_with_checking routine
and an additional compare within checking routines on processors
that are not susceptible to the floating-point division flaw.
There may be cases where testing a global variable is not practi-
cal and the first test of the software patch will be to see if
the divisor falls into a problematic range.
Intel has developed 16-bit DOS, 32-bit DOS, and UNIX versions of
the division_with_checking assembly routines. The code operates
in extended-precision mode. The control word is saved and re-
stored within the division_with_checking code. If it is known
that the processor is always operating in 80-bit precision mode,
the control word save and restore code can be deleted.
Performing the scaling and result adjustment for all floating-
point divisions falling within susceptible ranges without regard
to the presence of a floating-point division flaw on the execut-
ing processor is never recommended as this needlessly increases
processing time.
LIBRARIES
---------
Code needs to be modified so that floating-point instructions are
replaced with floating-point division macros that ensure full-
precision division results. It is essential that libraries as
well as hand-coded and compiler-generated code be made safe.
FDIV may occur in many library routines, especially the hyperbol-
ic routines. Other instructions including transcendentals that
may be present in library code need to be addressed. These are
currently known to include FPTAN, FPATAN, FPREM, and FPREM1. The
logarithmic instructions FYL2X and FYL2XP1 are safe from the
floating-point division flaw as is proved in the TECHNICAL NOTES
section.
An implementation of FPTAN using a hardware division instruction
as well as a 64-bit software version of FPATAN are available.
Software implementations of FPREM and FPREM1 have also been
developed.
TECHNICAL NOTES
---------------
Intel has developed 16-bit DOS, 32-bit DOS, and UNIX versions of
the division_with_checking assembly routines. The code operates
in extended-precision mode and is designed for Intel386(tm) pro-
cessors and beyond.
Hardware behavior is not identically mimicked in the FDIV wor-
karound code. This section includes explanations of technical
details and a summary of the differences between straightforward
hardware division and hardware division within the context of the
proposed software workaround.
Running the Patch on Processors Preceding the Intel486(tm) Pro-
cessor
-----------------------------------------------------------------
The Intel patch code has been specifically written to run on
Intel486(tm) processors and beyond. It should be guaranteed that
the the fdiv_chk_flag global variable is set before attempting to
execute any workaround code on processors earlier than the
Intel486(tm).
The fdiv_detect routine does a check for the Pentium(tm) proces-
sor floating-point division flaw with a sample division and can
be run on all processors in the Intel Architecture family having
floating-point units. This routine initializes fdiv_chk_flag.
Prior to executing fdiv_detect, the presence of a floating-point
unit must be established. This cannot be established in the
fdiv_detect routine as such a check would require the incorrect
reinitialization of the floating-point unit when checking for the
FDIV flaw.
Detecting the Floating-Point Unit
---------------------------------
Since some operating systems provide means of disabling the
floating-point unit, applications need to be aware that the in-
formation they need is whether the OS has enabled the FPU, rather
than whether the FPU exists.
Old 16-bit binaries typically handled the absence of an FPU with
built-in emulators. Most 32-bit operating systems provide emula-
tion capability so applications do not need to provide their own.
Hence, if a user requests that the operating system turn off the
floating-point unit on 32-bit operating systems, floating-point
operations will be emulated by the 32-bit OS. Alternatively, if
a user requests that the operating system turn off the floating-
point unit on 16-bit operating system, floating-point instruc-
tions will be skipped.
16-bit applications should continue to use the FINIT sequence to
detect if the floating-point unit is present. For compatibility
issues on older processors, the CPUID instruction should not be
used to check for an FPU.
For 32-bit applications where most environments already provide
FPU functionality by default, it is not necessary for applica-
tions to test for the presence of the FPU explicitly.
Mnemonic Interpretation
-----------------------
Mnemonics, opcodes, and their descriptions adhere to the
Pentium(tm) Processor User's Manual. In particular, the mnemonic
FDIVRP ST(x), ST represents the opcode DE F0+x and the mnemonic
FDIVP ST(x), ST represents the opcode DE F8+x. Note that the
UNIX assembler erroneously attaches each of these mnemonics to
the other's opcode (e.g. FDIVRP ST(x), ST represents DE F8+x).
Scaling Factor
--------------
The scaling factor of 15/16 was chosen to guarantee that an
operand lying within one of the five flaw-susceptible ranges of
numbers will be scaled to a safe region. This guarantee is
trivially proven by testing the endpoints of the five known
numeric bands. Refer to Statistical Analysis of Floating Point
Flaw In the Pentium(tm) Processor (1994) (Sharangpani, Barton)
for the boundaries of the potentially unsafe regions.
Scaling Exceptions
------------------
Because the scaling factor is less than one, it introduces the
possibility of an underflow when the numerator is multiplied by
it. If the result of the final division is to be either 32 or 64
bits, this can be addressed by performing the scaling in extended
precision. Since extended precision has a minimum exponent of
2^-16382, no single or double-precision input operand has the
possibility of becoming a denormal when multiplied. If the scal-
ing factor were greater than one, a similar argument shows that
overflow is not possible.
Unfortunately, the possibility of underflow persists for 80-bit
operations with numbers having magnitude less than (16/15)*2^-
16382. Masking the underflow exception while doing the scaling
avoids the trap that would ordinarily occur. However, the under-
flow bit is sticky, and hence a spurious underflow would still be
reflected.
Precision Loss
--------------
o Hardware division within the context of the FDIV software patch
employs different algorithms than a simple hardware division. In
order to avoid excessive performance degradation, a few varia-
tions in the resulting precision between the two division possi-
bilities may be observed.
o Newton-Raphson methods are typically less precise due to final
roundings. Because the precision-restoring algorithm in the FDIV
and FPTAN patch routines introduces two additional floating-point
operations to the computation of a division, the precision of the
operation is reduced by 1 ULP (unit of least precision). This
means that with the given algorithm, an 80-bit precision division
result is reduced from 64 bits of mantissa precision to 63. By
doing all scaling in extended precision and then dividing in ei-
ther single or double-precision accuracy, results of full preci-
sion are produced in single and double-precision modes.
o When applied to small denormal numerators, the FDIV patch code
may produce slightly different results in the least significant
binary digits. This potentially occurs when the numerator is
denormal, and hence has a reduced number of significant digits.
For example, when an extended precision denormal has 6 leading
zeros, that number only has 58 significant digits. When such a
number is used in a division, the result will only have 58 signi-
ficant digits.
o If the inputs to FDIV or FPTAN patch routines are not exactly
representable as singles or doubles, the result may differ by up
to 1 ULP. Exactly representable single and double operands will
produce exact results.
o The FPATAN patch routine result may differ in extended preci-
sion by as much as 3 ULPs. For single and double, the FPATAN
patch routine result precision may differ by as much as 1.5 ULPs.
FPU Status Word
---------------
o The assembly routines provided by Intel should not be called
from code with exceptions unmasked where the values of the flags
denormal, inexact, or underflow are utilized.
o Hardware division within the context of the FDIV software patch
employs different algorithms than a simple hardware division. In
order to avoid excessive performance degradation, a few varia-
tions in the resulting FPU status word between the two division
possibilities may be observed.
o The inexact flag after scaling and a hardware division may not
be the same as a hardware division of the original operands.
Sometimes using the original operands results in an inexact ex-
ception while using the scaled operands does not, and vice versa.
o The denormal bit may be set differently after division within
the context of any of the patch routines.
o The FDIV patch may set the underflow flag for divisions by
denormals when underflow would not otherwise be set.
o FDIV and FPTAN patch routines may set C1 differently when
called with precision control set to extended.
o FDIV and FPTAN patch routines may set C1 differently if the in-
put operands were not exactly representable as singles or doubles
and the precision control is set to single or double, respective-
ly. Exactly representable single and double operands will pro-
duce exact results.
o FPREM and FPREM1 patch routines may not set C0, C1, and C3
identically if the given instruction performs an incomplete
reduction.
o The patch code for FPATAN may set the values of C0, C2, and C3
differently than the hardware instruction. Similarly, the patch
for FPTAN may set C0 and C3 differently than the hardware. C0,
C2, and C3 are marked undefined for these instructions in the
reference manual, so proper existing code should not rely upon
specific values for them regardless.
RELATED PROOFS
--------------
Safety of the Logarithmic Instructions FYL2X and FYL2XP1
--------------------------------------------------------
Peter Tang has proved the immunity of FYL2X and FYL2XP1 from the
Pentium(tm) processor floating-point division flaw. Proofs fol-
low.
FYL2X
-----
The table-driven polynomial-base algorithm for FYL2X employs one
division for arguments x in the range 7/8 < x < 9/8 and one divi-
sion for arguments in the range |x - 1| >= 1/8. That is, 0 < x
<= 7/8 or x >= 9/8. The two divisions are used for argument
transformation. Division does not impact this algorithm.
For 7/8 < x < 9/8, the division is correct, and therefore FYL2X
is unaffected for input arguments in this range. The reason is
that for x in this range, the transformation used 1+x as the
denominator. This transformation is quite standard. The bit
pattern for 1+x in this range is either
2^(-1) * 1.0000????.... or 1.111?????.....
Both bit patterns are safe denominators.
For |x - 1| >= 1/8, the denominator has a bit pattern of 2^m *
1.b1 b2 b3 b4 b5 b6 b7 .... where (b6 b7) = (1 0) or (0 1). The
reason is that the denominator is obtained by x+c where c is ba-
sically the leading bits of x. Precisely, for
x = 2^k * 1. b1 b2 b3 b4 b5 ? ? ? ? ? ? ...
we have
c = 2^k * 1. b1 b2 b3 b4 b5 1 0 0 0 .... 0 0 0
Thus,
x + c = 2^(k+1) * 1. b1 b2 b3 b4 b5 b6 b7 ? ? ?
where (b6 b7) is (1 0) or (0 1). To be more explicit, x + c is
2^k * 1. b1 b2 b3 b4 b5 ? ? ? ? ? ? ...
+ 2^k * 1. b1 b2 b3 b4 b5 1 0 0 0 0 0 ...
---------------------------------------------------
2^k * 1 b1. b2 b3 b4 b5 0 ? ? ? ? ? ? ...
+ 2^k . 0 0 0 0 0 1 ? ? ? ? ? ...
---------------------------------------------------
2^k * 1 b1. b2 b3 b4 b5 b6 b7 ? ? ? ? ?
Where b6 and b7 cannot both be ones. (b7 == 1 implies the ?s
above are zeroes, making b6 = 0+0 = 0).
Since b5, b6, and b7 must all be ones when the flaw is encoun-
tered, this range is also safe.
FYL2XP1
-------
The algorithm is very much the same as FYL2X, and the impact of
division on it is also the same as it is on FYL2X.
For |x| < 1/8, a division is used where 2+x is a safe denomina-
tor. For |x| >= 1/8, overwrite x by 1+x and again use x+c as a
denominator.
Identifying Problematic Divisors
--------------------------------
Tim Coe has proved the validity of checking the bit patterns in-
dicating divisors at risk described previously in this document.
Tim Coe and Peter Tang are currently preparing a formal proof
that will submitted for publication in the near future. A main
thrust of the proof is to establish that the following two digit
sequences and P-D table accesses are the only paths to addressing
the flawed P-D entries:
For cases 1, 4, 7, 10, and 13 ==>
Cycle | Q digit | P-D entry | Minimum magnitude
| selected | accessed | of ignored partial
| | | remainder
--------------------------------------------------------
B-3 | -1 or -2 | no restriction | no restriction
| | |
B-2 | -2 | maximum entry | 125/512
| | for -2 digit |
| | |
B-1 | +2 | flawed P-D entry | 14/64
| | less 1/8 |
| | |
B | +2 ==> 0 | flawed P-D entry | 0
| | |
For cases 1, 7, and 13 ==>
Cycle | Q digit | P-D entry | Minimum magnitude
| selected | accessed | of ignored partial
| | | remainder
--------------------------------------------------------
B-3 | -1 or -2 | no restriction | no restriction
| | |
B-2 | -1 | maximum entry | 125/512
| | for -1 digit |
| | |
B-1 | +2 | flawed P-D entry | 14/64
| | less 1/8 |
| | |
B | +2 ==> 0 | flawed P-D entry | 0
| | |
The partial remainder has the form:
P-D table <> ignored
address <> portion
XXXX.XXXxxxxxxxxxxx...
xxxxxxxxxxx...
0 <= ignored < 1/4
Start from cycle B and work backwards. Either establish algebra-
icly that alternatives cannot occur or assume some alternative
can occur and derive a contradiction. Progressing backwards, es-
tablish two ones, then three, then four, then finally six ones in
positions 2^(-5) to 2^(-10) in the divisor are required to ad-
dress the flawed P-D entry. Use preliminary restrictions on the
divisor to establish earlier entries in the above tables, and
then use these facts to establish tighter restrictions on the
divisor.
TESTING AND VALIDATION
----------------------
Intel has performed multiple levels of testing and validation in-
cluding
1. Core routines
2. Compiler builds
3. Random tests
The Intel FDIV software patch has been incorporated into Intel's
own compiler and tested extensively for correctness. In addition
to recompilation and execution of the entire production compiler
test suite, specific division tests were designed using test vec-
tors containing billions of random division operand values.
These random division tests were recompiled with Intel's compiler
and all subsequently executed divisions completed without error,
and with full precision to the extent of the exceptions noted in
this document. Compiler vendors implementing Intel's software
patch will perform additional testing and validation of the pro-
cedure.
CYCLE COST CONSIDERATIONS
-------------------------
Performance impact from the software patch will be minimal. The
multiple steps proposed in the preceding sections are optimal as
they ensure absolute resolution of a possible FDIV flaw and pro-
vide the best possible performance. The cost on systems without
the flaw is insignificant. On the other hand, performing the
range test at all times would waste processor cycles on proces-
sors that do not exhibit the floating-point division flaw.
Table 1 includes execution times of tests run on various proces-
sors. Software tests consisting of multiple divisions performed
with and without the register-register FDIV software patch were
compiled using a recent Intel Reference Compiler for UNIX and
timed on the systems listed in the table columns. Table 2
displays execution times of the FDIV patch using memory operands.
The testing and timing code included a loop like the following.
for (i=0; i