Wednesday, September 25, 2024

AIML: Floating Point Arithmetic Representation

In the world of computing, floating point arithmetic is the commonly used representation of the data as real numbers. Depending on the type of the use case and the application, the data processed by the GPU may be a signed (-1027 to + 1027) or unsigned number (0 to 65535). It can also be whole number (123) or a decimal with fractional part (1.23). To address disparate data types and support wide range of values, a common representation was agreed and standardized by IEEE as IEEE-754 format. The format of IEEE-754 FP32 datatype which is commonly referred as single precision floating point is shown as below:


The IEEE-754 format comprises are three parts as below:

  • Sign field - This field represents if the number is positive or negative value. This field is of single bit size with a value of 0 representing the number as positive and a value of 1 as negative.
  • Biased Exponent – This field is used to represent the exponent value as both positive and negative by adding the exponent value to a bias. Theoretically we could assume that the MSB bit of this field is used to represent the exponent as positive or negative. For a single precision FP32 datatype, the bias is 127 and so a value of 128 will be result in an exponent value of 1 while 126 will result in an exponent value of -1.
  • Mantissa – This is the fractional part of the real number represented in binary and optionally appended by zeros to make the size of the field.

Example


To better understand the conversion and the format, let us take an example of 129.375 as input and convert the same into FP32 format.
  • As a first step, let us convert 129.375 into binary format. 
    • The binary value of 129 is 10000001
    • The binary value of 0.375 is 0.011
    • The dotted representation of 129.375 is 10000001.011 
  • Now 10000001.011 can be converted into exponential form and represented as 1.0000001011 x 2^7
  • Let’s use the above to represent the value in IEEE-754 FP32 format:
    • Sign = 0
    • Exponent value is 7. So biased exponent is 127+7 = 134 (10000110)
    • Normalized mantissa = 00000010110000000000000 (appending 0s to make it to 23 bits)
  • The IEEE-754 format for FP32 single precision is 0 10000110 00000010110000000000000.

There are different online calculators to convert any value to FP32 format. One such online calculator is available here - https://www.h-schmidt.net/FloatConverter/IEEE754.html 

While the above example is explained with FP32 which is one of the commonly used formats, there are other floating-point formats used for deep learning. Below is a consolidated table defining the number of bits for sign, exponent and mantissa for each such format:



Each of these format supports different range and precision where range defines the limit of the number representation (min to max) while precision defines the distance between successive numbers. While FP64 can offer more range and precision comparing to FP8, it is compute intensive. 

No comments:

Post a Comment