Keil Logo Arm Logo

Technical Support

On-Line Manuals

Libraries and Floating Point Support Guide

Conventions and feedback The ARM C and C++ libraries The ARM C micro-library Floating-point support About floating-point support The software floating-point library, fplib Calling fplib routines fplib arithmetic on numbers in a particular format fplib conversions between floats, doubles, and int fplib conversion between long longs, floats, and d fplib comparisons between floats and doubles fplib C99 functions Controlling the ARM floating-point environment Floating-point functions for compatibility with Mi C99-compatible functions for controlling the ARM f C99 rounding mode and floating-point exception mac Exception flag handling Functions for handling rounding modes Functions for saving and restoring the whole float Functions for temporarily disabling exceptions ARM floating-point compiler extensions to the C99 Writing a custom exception trap handler Example of a custom exception handler Exception trap handling by signals Using C99 signalling NaNs provided by mathlib (_WA mathlib double and single-precision floating-point Nonstandard functions in mathlib IEEE 754 arithmetic Basic data types for IEEE 754 arithmetic Single precision data type for IEEE 754 arithmetic Double precision data type for IEEE 754 arithmetic Sample single precision floating-point values for Sample double precision floating-point values for IEEE 754 arithmetic and rounding Exceptions arising from IEEE 754 floating-point ar Ignoring exceptions from IEEE 754 floating-point a Trapping exceptions from IEEE 754 floating-point a Exception types recognized by the ARM floating-poi Using the Vector Floating-Point (VFP) support libr

Libraries and Floating Point Support Guide

Single precision data type for IEEE 754 arithmetic

Single precision data type for IEEE 754 arithmetic

A float value is 32 bits wide. The structure is shown in Figure 1.

Figure 1. IEEE 754 single-precision floating-point format


The S field gives the sign of the number. It is 0 for positive, or 1 for negative.

The Exp field gives the exponent of the number, as a power of two. It is biased by 0x7F (127), so that very small numbers have exponents near zero and very large numbers have exponents near 0xFF (255).

So, for example:

  • if Exp = 0x7D (125), the number is between 0.25 and 0.5 (not including 0.5)

  • if Exp = 0x7E (126), the number is between 0.5 and 1.0 (not including 1.0)

  • if Exp = 0x7F (127), the number is between 1.0 and 2.0 (not including 2.0)

  • if Exp = 0x80 (128), the number is between 2.0 and 4.0 (not including 4.0)

  • if Exp = 0x81 (129), the number is between 4.0 and 8.0 (not including 8.0).

The Frac field gives the fractional part of the number. It usually has an implicit 1 bit on the front that is not stored to save space.

So if Exp is 0x7F, for example:

  • if Frac = 00000000000000000000000 (binary), the number is 1.0

  • if Frac = 10000000000000000000000 (binary), the number is 1.5

  • if Frac = 01000000000000000000000 (binary), the number is 1.25

  • if Frac = 11000000000000000000000 (binary), the number is 1.75.

So in general, the numeric value of a bit pattern in this format is given by the formula:

(-1)S * 2(Exp-0x7F) * (1 + Frac * 2-23)

Numbers stored in this form are called normalized numbers.

The maximum and minimum exponent values, 0 and 255, are special cases. Exponent 255 is used to represent infinity, and store Not a Number (NaN) values. Infinity can occur as a result of dividing by zero, or as a result of computing a value that is too large to store in this format. NaN values are used for special purposes. Infinity is stored by setting Exp to 255 and Frac to all zeros. If Exp is 255 and Frac is nonzero, the bit pattern represents a NaN.

Exponent 0 is used to represent very small numbers in a special way. If Exp is zero, then the Frac field has no implicit 1 on the front. This means that the format can store 0.0, by setting both Exp and Frac to all 0 bits. It also means that numbers that are too small to store using Exp >= 1 are stored with less precision than the ordinary 23 bits. These are called denormals.

Show/hideSee also

Concepts
Reference
Other information
Copyright © 2007-2008, 2011-2012 ARM. All rights reserved.ARM DUI 0378D
Non-ConfidentialID062912

arm-logo-small

Keil logo
Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.