Floating Point Number Representation
 Representing Real Numbers
 IEEE SinglePrecision Representation
 IEEE DoublePrecision Representation
 Further Reading
Representing Real Numbers
No human system of numeration can give a unique representation to every real number; there are just too many of them. So it is conventional to use approximations. For instance, the assertion that π (pi) is 3.14159 is, strictly speaking, false, since π is actually slightly larger than 3.14159; but in practice we sometimes use 3.14159 in calculations involving π because it is a good enough approximation of π.
One approach to representing real numbers, then, is to specify some
tolerance epsilon
and to say that a real number x
can be approximated by any number in the range from x 
epsilon
to x + epsilon
. Then, if a system of
numeration can represent selected numbers that are never more than twice
epsilon
apart, every real number has a representable
approximation. For instance, in the United States, the prices of stocks
are given in dollars and eighths of a dollar, and rounded to the nearest
eighth of a dollar; this corresponds to a tolerance of onesixteenth of a
dollar. In retail commerce, however, the conventional tolerance is half a
cent; that is, prices are rounded to the nearest cent. In this case, we
can represent a sum of money as an whole number of cents, or equivalently
as a number of dollars that is specified to two decimal places.
Fixedpoint Representation
Such a representation, in which each real number is represented by a numeral for an approximation to some fixed number of decimal places, is called a fixedpoint representation. However, fixedpoint representations are unsatisfactory for most applications involving real numbers, for two reasons:

Real numbers that are very small are not clearly distinguished.
For instance, in a fixedpoint representation using two decimal
places, 14/1000 and 6/1000 would both be represented as
0.01
, even though the former is more than twice as large as the latter. Moreover, both 4/1000 and 7/10000 would be represented as0.00
, and they don't even have the same sign! If one is counting dollars, these differences are probably irrelevant, but there are a lot of scientific and technological applications (navigation systems, for instance) in which they are critical.  In many cases, real numbers that are very large are not known accurately enough for a fixedpoint representation to be selected. For instance, the speed of light in a vacuum is roughly 299792458 meters per second, but no one knows whether it is best approximated by 299792458.63 or 299792458.16 or 299792457.88, and there is no reason to prefer one of these fixedpoint representations to the others.
These two considerations really have the same underlying cause: When approximating some real number, what we usually know is some number of significant digits at the beginning of the number. A fixedpoint system of representation conceals some of what we know when the number is small, because some of these significant digits are too far to the right of the decimal point, and demands more than we know when the number is large, because the number of digits demanded by the representation system is greater than the number that we can provide.
Scientific Notation
Scientists and engineers long ago learned to cope with this problem by
using scientific notation, in which a number is expressed as the
product of a mantissa and some power of ten. The mantissa is a
signed number with an absolute value greater than or equal to one and less
than ten. So, for instance, the speed of light in vacuum is
2.99792458 x 10^{8}
meters per second, and one can
specify only the digits about which one is completely confident.
Using scientific notation, one can easily see both that 1.4 x
10^{2}
is more than twice as large as 6 x
10^{3}
, and that both are close to 1 x
10^{2}
; and one can easily distinguish 4 x
10^{3}
and 7 x 10^{4}
as small
numbers of opposite sign. The rules for calculating with
scientificnotation numerals are a little more complicated, but the
benefits are enormous.
The three things that vary in scientific notation are
 the sign,
 the absolute value of the mantissa, and
 the exponent on the power of ten.
A system of numeration for real numbers that is adapted to computers will typically store the same three data  a sign, a mantissa, and an exponent  into an allocated region of storage. By contrast with fixedpoint representations, these computer analogues of scientific notation are described as floatingpoint representations.
The exponent does not always indicate a power of ten; sometimes powers
of sixteen are used instead, or, most commonly of all, powers of two.
The numerals will be somewhat different depending how this choice is
made. For instance, the real number 0.125 will be expressed
as 1.25 x 10^{1}
if powers of ten are used, or
as 2 x 16^{1}
if powers of sixteen are used, or
as 1 x 2^{3}
if powers of two are used. The
absolute value of the mantissa is, however, always greater than or
equal to 1 and less than the base of numeration.
The particular system used on MathLAN computers was formulated and
recommended as a standard by the Institute of
Electrical and Electronics Engineers and is the most commonly used
numeration system for computer representation of real numbers.
Actually, their standard includes several variants of the system,
depending on how much storage is available for a real number. We'll
discuss two of these variants, both of which use binary numeration and
powers of 2: the IEEE singleprecision representation, which
fits in thirtytwo bits, and the IEEE doubleprecision
representation, which occupies sixtyfour bits. The C standard does
not specify how much precision
its float
or double
types must provide. However,
most modern computers and compilers follow this standard.
IEEE SinglePrecision Representation
We'll begin with singleprecision numbers. In the IEEE singleprecision
representation of a real number, one bit is reserved for the sign, and
it is set to 0
for a positive number and to 1
for a negative. A representation of the exponent is stored in the
next eight bits, and the remaining twentythree bits are occupied by a
representation of the mantissa of the number.
++++ ±exponent mantissa  ++++ 1 8 23 bits
The exponent, which is a signed integer in the range from 126 to 127, is represented neither as a signed magnitude nor as a twoscomplement number, but as a biased value. The idea here is that the integers in the desired range of exponents are first adjusted by adding a fixed "bias" to each one. The bias is chosen to be large enough to convert every integer in the range into a positive integer, which is then stored as a binary numeral. The IEEE singleprecision representation uses a bias of 127.
For example, the exponent 5 is represented by the eightbit pattern
01111010
, which is the binary numeral for 122, since 5 + 127
= 122. The least exponent, 126, is represented by 00000001
(since 126 + 127 = 1); the greatest exponent, 127, is represented by the
binary numeral for 254, 11111110
.
Because the base of numeration is two, the mantissa is always a number greater than or equal to one and less than two. Its fractional part is represented, using binary numeration, as a sum of the negative powers of two. Only the part of the mantissa that comes after the binary point is actually stored, because the bit to the left of the binary point is completely predictable. It is always 1, because the mantissa is always greater than or equal to one and less than two. This suppressed digit at the beginning of the mantissa is called the hidden bit.
Thus, the mantissa's stored columns represent the following numbers.
2^{1}  2^{2}  2^{3}  2^{4}  2^{5}  ...  2^{22}  2^{23}
For instance, the binary numeral for 21/16 is 1.0101  one, plus no
halves, plus one quarter, plus no eighths, plus one sixteenth. (The
"decimal point" in this case is actually a "binary point," separating
the digit in the units place from the digits representing multiples of
negative powers of two.) Dropping the leading hidden bit, the mantissa
would be stored as 0101
followed by nineteen zeros.
Somewhat surprisingly, this means that approximations are used for many real numbers that can be represented exactly in decimal numeration. For instance, 7/5 is 1.4 exactly in decimal numeration, but the .4 part cannot be expressed as a sum of powers of two; 7/5 has an infinite binary expansion 1.0110011001100110011, just as in decimal numeration a fraction like 27/11 leads to an infinite decimal expansion 2.4545. In a singleprecision representation, the expansion is rounded off at the twentythird digit after the binary point. Thus 7/5 is not actually stored as 7/5, but (in single precision) as the very close approximation 11744051/8388608, which can be expressed as a sum of powers of two.
Here's an example. Consider the thirtytwobit word
1 10000111 00101100000000000000000
The number represented by this sequence of bits is negative (because the
sign bit is 1
). It has an exponent of 8 (because the exponent
bits, 10000111
, form the binary numeral for 135, and therefore
represent the exponent 8 when the bias of 127 is removed). Its mantissa is
expressed by the binary numeral 1.00101100000000000000000
,
where the initial 1
is the hidden bit and the remaining digits
are taken from the right end of the word. This last binary numeral
expresses the number 75/64 (one, plus no halves, plus no quarters, plus one
eighth, plus no sixteenths, plus one thirtysecond, plus one sixtyfourth).
So the complete number is (75/64) x 2^{8}, which is 300.0.
Conversely, let us find the IEEE singleprecision representation of, say,
5.75. The sign bit is 0
, since the number is positive.
5.75 is 23/4; to express this as the product of a power of two
and a mantissa greater than or equal to one and less than two, one must
factor out 2^{2}: 23/4 = (23/16) x 2^{2}, so the
exponent is 2. It will be stored, with a bias of 127,
as 10000001
(the binary numeral for 129). The complete
mantissa, extended to twentythree digits after the binary point,
is 1.01110000000000000000000
. We line up the sign bit,
the biased representation of the exponent, and the digits following the
binary point in the mantissa and get
0 10000001 01110000000000000000000
The greatest number that has an exact IEEE singleprecision representation is 340282346638528859811704183484516925440.0 (2^{128} – 2^{104}), which is represented by
0 11111110 11111111111111111111111
and the least is the negative of this number, which has the same
representation except that the sign bit is 1
.
What About Zero?
The alert reader will have noticed that there is a serious gap in this
scheme, as so far described: What about 0.0? It is not possible to
represent 0.0 as the product of a power of two and a mantissa
greater than or equal to one, so none of the representations described
above will do. However, not all the possible settings of thirtytwo
switches have been used for numbers. Recall that the exponents permitted
in IEEE singleprecision reals range from 126 to +127, so that the binary
numerals for the biased exponents range from 00000001
to
11111110
. We haven't yet used any of the bit patterns in
which the exponent bits are all zeroes or all ones.
In the IEEE system, the allzero exponent is used for numbers that are very close to zero  closer than 2^{126}, which is the least of the positive reals that can be represented in the part of the system described above. It is
0 00000001 00000000000000000000000
Such tiny numbers are expressed in a slightly different form of
scientific notation: The exponent is held fixed at 126, and the
mantissa is a number greater than or equal to zero and less
than one. So, for instance, the mantissa used for 3 x
2^{129} is 0.011
(3 x 2^{129} is a
quarter plus an eighth of 2^{126}). Once again, only the part
of the mantissa that follows the binary point is stored explicitly, so
the representation of 3 x 2^{129} is
0 00000000 01100000000000000000000
(sign positive, exponent 126, mantissa
0.01100000000000000000000
). Mantissas less than one are
said to be "unnormalized" (because the "normal form" is the one in
which the mantissa is greater than or equal to one and less than the
base of numeration), so an allzero exponent indicates an unnormalized
number.
Because the hidden bit for unnormalized numbers is zero, they have one fewer significant digits in the mantissa. As a result, unnormalized numbers are stored with slightly less accuracy than normalized numbers. However, without this special convention for the allzero exponent it would not be possible to represent them at all; the designers of the IEEE standard felt that a degraded approximation is better than none.
This convention allows zero to be represented in two different ways. In
one, the sign bit is 0
, the exponent bits are
00000000
, and the visible bits of the mantissa are
00000000000000000000000
—yielding
+0.00000000000000000000000 x 2^{126}, or 0. The other
representation is the same, except that the sign bit is 1
.
(So, as in signedmagnitude representations, there is a "nonnegative"
and a "nonpositive" zero.)
The least positive real number that can be represented exactly in this way is 2^{149}, which is stored as
0 00000000 00000000000000000000001
Other Special Values
When the exponent bits are all ones, the value represented is not a real number at all, but a conventional signal of a computation that has gone wrong, either by going above the greatest representable real or below the least, or by attempting some undefined arithmetic operation, such as dividing by zero or taking the logarithm of a negative number. For instance, the thirtytwo bits
0 11111111 00000000000000000000000
represent "positive infinity", a pseudonumber that indicates that some
unrepresentably large quantity was generated by an arithmetic
operation. Changing the sign bit to 1
yields a
representation of negative infinity, an indication of a similar problem
at the other end of the range. If some of the bits of the mantissa
are 1
s, the pseudonumber is called
a NaN
("Not a Number"); trying to compute 0/0,
for instance, typically produces a NaN
. It should be
clear that positive infinity, negative infinity, and NaN
are not real numbers, although some programming languages will
try to do something sensible if they appear in places that are normally
occupied by real numbers. The value of such attempts at recovery is
questionable, however, since the appearance of a pseudonumber is
supposed to be a danger signal to the programmer and usually results
from a programming error.
IEEE DoublePrecision Representation
IEEE doubleprecision representations are quite similar. A
doubleprecision real begins with a sign bit, with the same interpretation
as in a singleprecision representation. The next eleven bits are used for
the exponent, which is an integer in the range from 1022 to +1023; a bias
of 1023 is added to the exponent, and the result is stored in a binary
numeral (the smallest is 00000000001
, the largest
11111111110
). The remaining fiftytwo bits are used for
the mantissa, and as above only the digits following the binary point
of the mantissa are actually stored; the 1
that precedes
the binary point is once again a "hidden bit." As in singleprecision
representations, the allzero exponent is used for unnormalized numbers
and (with an allzero mantissa) for 0, and the allone exponent is used
for the pseudonumbers positive infinity, negative infinity, and NaN.
The greatest real number that can be represented exactly as a
doubleprecision real is 2^{1024} – 2^{971}, and
the least positive real that can be so represented is
2^{1074}.
Most C compilers for modern computers provide IEEE doubleprecision
representations by means of the double
data type.
Further Reading
Those interested in learning more should read (according to the title)
Goldberg, D. (1991). What every computer scientist should know about floatingpoint arithmetic. ACM Computing Surveys, 23(1), 548.