Nagendra's Blog: AI/ML

In the world of computing, floating point arithmetic is the commonly used representation of the data as real numbers. Depending on the type of the use case and the application, the data processed by the GPU may be a signed (-1027 to + 1027) or unsigned number (0 to 65535). It can also be whole number (123) or a decimal with fractional part (1.23). To address disparate data types and support wide range of values, a common representation was agreed and standardized by IEEE as IEEE-754 format. The format of IEEE-754 FP32 datatype which is commonly referred as single precision floating point is shown as below:

The IEEE-754 format comprises are three parts as below:

Sign field - This field represents if the number is positive or negative value. This field is of single bit size with a value of 0 representing the number as positive and a value of 1 as negative.
Biased Exponent – This field is used to represent the exponent value as both positive and negative by adding the exponent value to a bias. Theoretically we could assume that the MSB bit of this field is used to represent the exponent as positive or negative. For a single precision FP32 datatype, the bias is 127 and so a value of 128 will be result in an exponent value of 1 while 126 will result in an exponent value of -1.
Mantissa – This is the fractional part of the real number represented in binary and optionally appended by zeros to make the size of the field.

Example

To better understand the conversion and the format, let us take an example of 129.375 as input and convert the same into FP32 format.

As a first step, let us convert 129.375 into binary format.

The binary value of 129 is 10000001
The binary value of 0.375 is 0.011
The dotted representation of 129.375 is 10000001.011

Now 10000001.011 can be converted into exponential form and represented as 1.0000001011 x 2^7
Let’s use the above to represent the value in IEEE-754 FP32 format:

Sign = 0
Exponent value is 7. So biased exponent is 127+7 = 134 (10000110)
Normalized mantissa = 00000010110000000000000 (appending 0s to make it to 23 bits)

The IEEE-754 format for FP32 single precision is 0 10000110 00000010110000000000000.

There are different online calculators to convert any value to FP32 format. One such online calculator is available here - https://www.h-schmidt.net/FloatConverter/IEEE754.html

While the above example is explained with FP32 which is one of the commonly used formats, there are other floating-point formats used for deep learning. Below is a consolidated table defining the number of bits for sign, exponent and mantissa for each such format:

Each of these format supports different range and precision where range defines the limit of the number representation (min to max) while precision defines the distance between successive numbers. While FP64 can offer more range and precision comparing to FP8, it is compute intensive.

Introduction

One of the recent buzz words in the industry that spans across different verticals is “Generative AI”. Generative AI or GenAI (in short) is a type of Artificial Intelligence (AI) that produces new types of contents such as text, image, audio and synthetic data based on the learning from the existing content.

GenAI is a subset of deep learning that leverages a foundation model which is a large language model (LLM) pre-trained with a large quantity of data (petabytes) with numerous parameters (billions) to produce different downstream outcomes.

The input used to train the foundational model can be documents, websites, files, etc which are natural language-based input. As the readers might be aware, any interaction with the AI/ML models such as training the model or sending a query and receiving a response are all performed using numeric values. Accordingly, any natural language processing techniques are used to convert the text in the documents, websites, etc. to numeric values. Some of the firmly established and dominant state of the art techniques used for NLP modeling are Recurrent Network Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Neural Network (GRU). The sequential nature of modeling the language by these techniques comes with its own disadvantages such as:

• Costly and time-consuming way of labeled the data to train the model.
• Slowness due to lack of parallel processing
• Handling large sequences etc.

In this article, we will discuss the evolutionary Transformer Architecture which acts as the fundamental concept for Large Language Models.

Basic Architecture

The transformer architecture was originally published as Attention Is All You Need paper in 2017 to improve the performance of the LLM models drastically. Unlike the traditional techniques such as RNN or LSTM, the transformer technique leverages a mathematical way of finding the pattern and relevancy between elements that eliminates the need to label the data for training. This mathematical technique referred as attention-map or self-attention allows the model to identify the relevancy of each element by creating a matrix kind of map and assigning different attention scores for each element in the map. Further, the mathematical method also naturally allows to perform parallel data processing makes it more faster comparing to the traditional techniques. We will deep dive into the architecture further and explain the concept.

The transformer architect comprises of 2 distinct components as below:

Encoder
Decoder

A simple pictorial representation of the architecture is as below:

Encoder Component

The Encoder component comprises of 6 identical layers where each layer has two sub-layers. The encoders encode and map the input sequence, enriched with additional details and convert into a sequence of continuous representations as it is passed through each layers within the encoder stack. While each of the encoder layer within the stack employs the same transformation (or attention mapping) logic to the input sequence, each layer uses different weight and bias parameters. The initial layer identifies the basic pattern while the final layer will perform more advanced mapping. The resulting output is fed to the decoder.

If we zoom into the encoder layer, we can see that it is made of two sub-layers as below:

Multi-Head Attention sub-layer
Feed Forward Neural Network sub-layer

A simple pictorial representation of the encoder layer is as below:

The input data is fed as tokenized vector in the form of Query, Key and Value (Data Tokenizatin is available here) which will be passed through multiple attention score head. Each head will perform similar calculation deriving attention score and then merge the scores to produce a final score for this encoder layer. The output from each encoder layer is represented as an attention vector that helps identify the relationship between different elements in the sequence.

The encoded output from the Self-attention sub-layer is normalized and fed to the next sub-layer Feed forward. The feed forward neural network transforms the attention vector into a form that is acceptable by the next encoder or the decoder layer.

Decoder Component

The Decoder component comprises of 6 identical layers where each layer has three sub-layers. The decoders learn to decode the representation to perform different tasks.

The decoder component is fed with 2 types of data as below:

Attention vector (from Encoder)
Target Sequence (Encoded as Q, K, V)

The masked Multi head attention will perform the similar functionality done by the encoder to calculate the score and identify the relevance and relationship between elements in the sequence. While this appears to be similar to the functionality performed by the encoder, there is a difference. The attention sub-layer of decoder while using the target sequence must not have access to the future words. For example, if the target sequence is “The sun sets in the west”, the attention layer should mask the “west” while using “The sun sets in the” sequence to predict the next word. This is why the layer is named as Masked Multi Head Attention Sub-layer.

The output of the sub-layer is normalized and fed to the next sub-layer which is responsible for training. The Encoder-Decoder Attention receives the representation of the encoder output as Query and Key and the representation of the target sequence as Value. This sub-layer computes the attention score for each target sequence elements influenced by the attention score from the attention vector (received from the encoder).

This is further normalized using the normalization sub-layer and to the feed forward sub-layer which in turn produces the output vector.

Nagendra's Blog

Wednesday, September 25, 2024

AIML: Floating Point Arithmetic Representation

Example

Saturday, October 28, 2023

Generative AI - Transformer Architecture

Introduction

Basic Architecture

Encoder Component

Decoder Component