Nagendra's Blog: Generative AI - Transformer Architecture

Introduction

One of the recent buzz words in the industry that spans across different verticals is “Generative AI”. Generative AI or GenAI (in short) is a type of Artificial Intelligence (AI) that produces new types of contents such as text, image, audio and synthetic data based on the learning from the existing content.

GenAI is a subset of deep learning that leverages a foundation model which is a large language model (LLM) pre-trained with a large quantity of data (petabytes) with numerous parameters (billions) to produce different downstream outcomes.

The input used to train the foundational model can be documents, websites, files, etc which are natural language-based input. As the readers might be aware, any interaction with the AI/ML models such as training the model or sending a query and receiving a response are all performed using numeric values. Accordingly, any natural language processing techniques are used to convert the text in the documents, websites, etc. to numeric values. Some of the firmly established and dominant state of the art techniques used for NLP modeling are Recurrent Network Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Neural Network (GRU). The sequential nature of modeling the language by these techniques comes with its own disadvantages such as:

• Costly and time-consuming way of labeled the data to train the model.
• Slowness due to lack of parallel processing
• Handling large sequences etc.

In this article, we will discuss the evolutionary Transformer Architecture which acts as the fundamental concept for Large Language Models.

Basic Architecture

The transformer architecture was originally published as Attention Is All You Need paper in 2017 to improve the performance of the LLM models drastically. Unlike the traditional techniques such as RNN or LSTM, the transformer technique leverages a mathematical way of finding the pattern and relevancy between elements that eliminates the need to label the data for training. This mathematical technique referred as attention-map or self-attention allows the model to identify the relevancy of each element by creating a matrix kind of map and assigning different attention scores for each element in the map. Further, the mathematical method also naturally allows to perform parallel data processing makes it more faster comparing to the traditional techniques. We will deep dive into the architecture further and explain the concept.

The transformer architect comprises of 2 distinct components as below:

Encoder
Decoder

A simple pictorial representation of the architecture is as below:

Encoder Component

The Encoder component comprises of 6 identical layers where each layer has two sub-layers. The encoders encode and map the input sequence, enriched with additional details and convert into a sequence of continuous representations as it is passed through each layers within the encoder stack. While each of the encoder layer within the stack employs the same transformation (or attention mapping) logic to the input sequence, each layer uses different weight and bias parameters. The initial layer identifies the basic pattern while the final layer will perform more advanced mapping. The resulting output is fed to the decoder.

If we zoom into the encoder layer, we can see that it is made of two sub-layers as below:

Multi-Head Attention sub-layer
Feed Forward Neural Network sub-layer

A simple pictorial representation of the encoder layer is as below:

The input data is fed as tokenized vector in the form of Query, Key and Value (Data Tokenizatin is available here) which will be passed through multiple attention score head. Each head will perform similar calculation deriving attention score and then merge the scores to produce a final score for this encoder layer. The output from each encoder layer is represented as an attention vector that helps identify the relationship between different elements in the sequence.

The encoded output from the Self-attention sub-layer is normalized and fed to the next sub-layer Feed forward. The feed forward neural network transforms the attention vector into a form that is acceptable by the next encoder or the decoder layer.

Decoder Component

The Decoder component comprises of 6 identical layers where each layer has three sub-layers. The decoders learn to decode the representation to perform different tasks.

The decoder component is fed with 2 types of data as below:

Attention vector (from Encoder)
Target Sequence (Encoded as Q, K, V)

The masked Multi head attention will perform the similar functionality done by the encoder to calculate the score and identify the relevance and relationship between elements in the sequence. While this appears to be similar to the functionality performed by the encoder, there is a difference. The attention sub-layer of decoder while using the target sequence must not have access to the future words. For example, if the target sequence is “The sun sets in the west”, the attention layer should mask the “west” while using “The sun sets in the” sequence to predict the next word. This is why the layer is named as Masked Multi Head Attention Sub-layer.

The output of the sub-layer is normalized and fed to the next sub-layer which is responsible for training. The Encoder-Decoder Attention receives the representation of the encoder output as Query and Key and the representation of the target sequence as Value. This sub-layer computes the attention score for each target sequence elements influenced by the attention score from the attention vector (received from the encoder).

This is further normalized using the normalization sub-layer and to the feed forward sub-layer which in turn produces the output vector.

Nagendra's Blog

Saturday, October 28, 2023

Generative AI - Transformer Architecture

Introduction

Basic Architecture

Encoder Component

Decoder Component

No comments:

Post a Comment