Integers are a great thing, but they have some limitations. The two main limitations that they have are:
If we want to represent fractions, we could do so just like we do in the decimal system: include places with values less than 1. We do this by introducing a binary point (just like a decimal point) and then numbering positions to the right of the binary point with negative exponents:
binary point
v
| | | | | | | | | | | |
^
7 6 5 4 3 2 1 0 -1 -2 -3
2 2 2 2 2 2 2 2 2 2 2
Now we can represent, eg, 2.5 as
10.1 - which is (1 * 2) + (1 * 1/2).
The nice thing is that we can still store these numbers as integers -- we just change the way they are interpreted on input and output. Numbers which are stored internally as integers, but are given a given a binary point for input and output, are called fixed point numbers. They are fixed point because we assume that the binary point is fixed at some position.
For example, we could have a 32-bit fixed-point representation in which the binary point was to the right of bit number 1:
000000000000000000000000000000.0
or we could have a fixed-point representation in which we assumed the
binary point was to the left of bit number 31:
.0000000000000000000000000000000
What are we giving up by getting the ability to repesent fractions? We
are giving up range. For example, in the last case, we can only
repesent quantities between zero and one. In fact, if the binary point
is the right of bit number B, then we can represent (using unsigned
representation) the numbers 0 to 2^(32-B). To interpret any bit
pattern, just take its value as an unsigned integer and divide it by
2^B. So one way to think of a fixed-point number is an integer times
some negative power of two:
-B
A * 2
How is arithmetic done with fixed-point quantities? Addition is
straightforward - you can simply add the two bit patterns as integers
and the result has the binary point in the same place. However,
multiplication is more tricky. Consider:
x * 2^(-B) * y * 2^(-B) = xy * 2^(-2*B)
which means that the binary point is now at position 2B, rather than
position B as before. Here is an example. Using a fixed-point
representation in which the binary point is to the right of bit number
one, multiply 3.5 by 2.5:
11.1 = 3.5 (note that 111 = 7, and 3.5 = 7/2)
10.1 = 2.5 (note that 101 = 5, and 2.5 = 5/2)
Now, multiplying 111 by 101, we get 100011, so the binary point needs to
go at position 2 - 1000.11 - to give the right answer of 8.75.
01000.1
This is called renormalization - putting the result of a calculation
into the form expected for storage.
So the result we get is actually 1000.1, which is 8.5 - an error of 0.25. This occurs because with only one place to the right of the binary point, we can only represent 8.5 or 9.0 -- nothing in between.
Some terms:
Clearly representing numbers as different as 10^23 and 10^-23 using place value is very expensive - we'd need a lot of bits. So we take a lesson from the way scientists and engineers represent numbers - scientific notation. In scientific notation we represent a number as
exponent
significand * base
Since the exponent is a much smaller number than is the number itself,
we save a lot of storage this way. In all floating point
representations, the base is fixed (almost always it is 2) so we don't
need to store the base either.
Thus, another definition of precision would be: the number of digits in the significand.
To completely specify a floating point representation, we need four numbers: the base (B), the precision (p), e-min, and e-max (smallest and largest exponents).
Floating point representations are by nature inexact. There are two reasons for this inexactness:
Many representations could be used for the same number:
3.00 * 10^2
0.30 * 10^3
0.03 * 10^4
all of these encode the number 300. However, it is a bad thing to have
multiple representations for the same number, because of equality
comparisons. We want each number to have only one representation so
that we can compare for equality simply by comparing bit patterns.
For this reason we define what is called normalized form for a f.p. number. A number is in normalized form if its leftmost digit is non-zero (of course in binary this means that the leftmost digit is 1). Normalized form allows our f.p. representations to be unique.
Of course, this makes it impossible to represent zero, so we define zero as a special case. We use the exponent (e-min - 1) to signify that the number is zero. This is a good choice, since it will be smaller than any number and so won't need special treatment during comparisons. Of course it will need special treatment during arithmetic -- if the exponent is equal to e-min - 1 the number has to be treated as if it were zero.
1) Bring significands into alignment (same exponent)
2) Do fixed-point (ie integer) addition / subtraction
3) Do floating-point normalization
Examples.
x = 2.15 * 10^12 = 2.15 * 10^12
- y = 1.25 * 10^-5 = - 0.0000000000000000125 * 10^12
-------------------------------
2.1499999999999999875 * 10^12
which, when normalized, yields 2.15 * 10^12.
Another example:
x = 2.15 * 10^12 = 2.15 * 10^12
- y = 1.25 * 10^12 = - 1.25 * 10^12
-------------------------------
0.90 * 10^12
which, when normalized, yields 9.00 * 10^11.
x = 1.01 * 10^1
- y = 0.99 * 10^1
---------------
x - y = 0.02 * 10^1
which, when normalized, is 2.00 * 10^-1. However, the true (accurate)
answer is 1.70 * 10^-1. Our calculated answer is off by 30 ulps, and
is wrong in every position!
What happened -- well, notice that when we aligned the significands, we were forced to drop a digit from the representation of y. We lost an important digit, because x and y are very close in value, so they only differ significantly in the last digits.
When the CPU fetches an fp instruction, it simply passes it to the FPU for execution. Most fp instructions take longer than a single cycle, so the CPU goes on ahead and fetches and executes subsequent instructions while the FPU is working.
This is called instruction-level parallelism. The CPU and the FPU can be working in parallel (that is, simultaneously) on different instructions.
Now, what happens if we try to load from a register but the FPU isn't done yet, so the register doesn't have the value we expect? The FPU marks a register as invalid as soon as it is known to be the destination of an instruction. If the CPU tries to store from that register, it stalls (simply waits) until the FPU clears the invalid flag. So the programmer needn't worry about getting the correct data, but the program might run slowly without the programmer being able to tell why.
The FPU can also compare fp numbers. It has its own set of condition codes, and the CPU can test those fp condition codes by means of special conditional branches, called floating-point branch instructions.
The FPU has 32 registers, %f0 through %f31. There are no register windows in the FPU. To access the registers, use the regular load and store instructions: ld and st. So we could write code such as:
ld [%fp + 64], %f2
set var_m, %o0
st %f0, [%o0]
There are single, double, and extended precision fp ops. We will just
cover the single precision ops:
fadds
fsubs
fmuls
fdivs
fsqrts
fmovs moves data between fp registers
fnegs negate
fabss absolute value
fitos integer to single float
fstoi single float to integer
fcmps compare
There is an additional case to consider for comparisons. Suppose one of
the operands is NaN. Then, we cannot establish an ordering, so there is
an unordered case as well as a less than, equal, and greater than.
float, double, and
long double. There is not
guaranteed to be any relationship between these and the types defined in
IEEE 754, because computers with C compilers may not necessarily support
IEEE 754. However, on the SPARC, the correspondence is as follows:
ANSI C IEEE 754
------ --------
float single precision
double double precision
long double extended precisiion
Here is an example program that illustrates some floating point error
issues, using ANSI C.
#include <math.h>
main()
{
float f;
long double d;
float pi = 3.14159265358979323846f;
double dpi = 3.14159265358979323846;
long double ldpi = 3.14159265358979323846L;
printf(" actual = 3.14159265358979323846\n");
printf(" float pi = %.7f\n", pi);
printf(" double pi = %.16f\n", dpi);
printf("long double pi = %.35Lf\n", ldpi);
printf("\n");
printf(" float pi/10.0f = %.7f\n", pi/10.0f);
printf(" double pi/10.0 = %.16f\n", dpi/10.0);
printf("long double pi/10.0L = %.35Lf\n", ldpi/10.0L);
printf("\n");
printf(" float pi*0.1f = %.7f\n", pi*0.1f);
printf(" double pi*0.1 = %.16f\n", dpi*0.1);
printf("long double pi*0.1L = %.35Lf\n", ldpi*0.1L);
printf("\n");
d = ldpi*0.1L - ldpi/10.0L;
printf("ldpi*0.1L - ldpi/10.0L = %.35Lg\n", d);
}
When we run this program on a SPARC machine, we get:
actual = 3.14159265358979323846
float pi = 3.1415927
double pi = 3.1415926535897931
long double pi = 3.14159265358979323845999999999999997
float pi/10.0f = 0.3141593
double pi/10.0 = 0.3141592653589793
long double pi/10.0L = 0.31415926535897932384599999999999998
float pi*0.1f = 0.3141593
double pi*0.1 = 0.3141592653589793
long double pi*0.1L = 0.31415926535897932384600000000000003
long double = 4.8148248609680896326399448564623183e-35
This example illustrates a number of points about IEEE 754 and ANSI C.
First of all, the default type for a constant is double. To get a
float or a long double you need to use a suffix,
(f or L). Second
there are two kinds of rounding errors shown: