Nagendra's Blog: transformer

While the transformer architecture is primarily targeted to produce new content (such as text), like any other AI/ML models, the models using transformer architecture relies on numeric values as input to perform all the mathematical computations for learning and predictions. This is not different for NLP and so the natural language needs to be converted to numeric values. This conversion is known as Data tokenization.

Data Tokenizer is responsible for preparing the input into numeric tokens that can be consumed by the AI/ML model. There are different tokenization methods are available and the selection of the method is implementation matter.

Below is an example output from OpenAI tokenizer page where a numeric tokenID is generated for each word in the sentence.

The readers can leverage https://platform.openai.com/tokenizer to play around with the tokenizer.

There are other options and SDK modules such as huggingface that can be used as well. Below is sample python code that can be used to convert the text into token IDs.

from transformers import AutoTokenizer

import os

from huggingface_hub import login

login(token=os.environ.get("<Removed>"))

text = "BGP is a routing protocol"

tokenizer = AutoTokenizer.from_pretrained("TinyPixel/Llama-2-7B-bf16-sharded")

tokenized_text = tokenizer(text)["input_ids"]

decoded_text = tokenizer.decode(tokenized_text)

Embedding Layer

The embedding layer of the transformer is responsible for receiving the tokenIDs are input and convert them into vectors embedded with additional details such as context, semantic similarity, correlation and relevance to other words/tokens in the sequence, etc. For example, the semantic similarity and the context can be used to differentiate king to queen as in man to woman.

Positional Encoding

One of the primary advantage of transformer architecture is that it is capable of performing parallel processing which is different from the traditional methods that sequentially process the input. In order to perform parallel processing of the vector, we need to encode the position details for each element in the sequence so that parallel processing can still identify the position of the elements in the sequence. This is performed by leveraging a computation logic on the position of the element in the sequence pos, the length of the encoding vector d_model, and the index value in the vector i.

A pictorial representation of the technique to convert the text and embed with position and other details is shown as below:

This position embedded vector is fed as input to the Encoder component of the transformer architecture.s

Introduction

One of the recent buzz words in the industry that spans across different verticals is “Generative AI”. Generative AI or GenAI (in short) is a type of Artificial Intelligence (AI) that produces new types of contents such as text, image, audio and synthetic data based on the learning from the existing content.

GenAI is a subset of deep learning that leverages a foundation model which is a large language model (LLM) pre-trained with a large quantity of data (petabytes) with numerous parameters (billions) to produce different downstream outcomes.

The input used to train the foundational model can be documents, websites, files, etc which are natural language-based input. As the readers might be aware, any interaction with the AI/ML models such as training the model or sending a query and receiving a response are all performed using numeric values. Accordingly, any natural language processing techniques are used to convert the text in the documents, websites, etc. to numeric values. Some of the firmly established and dominant state of the art techniques used for NLP modeling are Recurrent Network Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Neural Network (GRU). The sequential nature of modeling the language by these techniques comes with its own disadvantages such as:

• Costly and time-consuming way of labeled the data to train the model.
• Slowness due to lack of parallel processing
• Handling large sequences etc.

In this article, we will discuss the evolutionary Transformer Architecture which acts as the fundamental concept for Large Language Models.

Basic Architecture

The transformer architecture was originally published as Attention Is All You Need paper in 2017 to improve the performance of the LLM models drastically. Unlike the traditional techniques such as RNN or LSTM, the transformer technique leverages a mathematical way of finding the pattern and relevancy between elements that eliminates the need to label the data for training. This mathematical technique referred as attention-map or self-attention allows the model to identify the relevancy of each element by creating a matrix kind of map and assigning different attention scores for each element in the map. Further, the mathematical method also naturally allows to perform parallel data processing makes it more faster comparing to the traditional techniques. We will deep dive into the architecture further and explain the concept.

The transformer architect comprises of 2 distinct components as below:

Encoder
Decoder

A simple pictorial representation of the architecture is as below:

Encoder Component

The Encoder component comprises of 6 identical layers where each layer has two sub-layers. The encoders encode and map the input sequence, enriched with additional details and convert into a sequence of continuous representations as it is passed through each layers within the encoder stack. While each of the encoder layer within the stack employs the same transformation (or attention mapping) logic to the input sequence, each layer uses different weight and bias parameters. The initial layer identifies the basic pattern while the final layer will perform more advanced mapping. The resulting output is fed to the decoder.

If we zoom into the encoder layer, we can see that it is made of two sub-layers as below:

Multi-Head Attention sub-layer
Feed Forward Neural Network sub-layer

A simple pictorial representation of the encoder layer is as below:

The input data is fed as tokenized vector in the form of Query, Key and Value (Data Tokenizatin is available here) which will be passed through multiple attention score head. Each head will perform similar calculation deriving attention score and then merge the scores to produce a final score for this encoder layer. The output from each encoder layer is represented as an attention vector that helps identify the relationship between different elements in the sequence.

The encoded output from the Self-attention sub-layer is normalized and fed to the next sub-layer Feed forward. The feed forward neural network transforms the attention vector into a form that is acceptable by the next encoder or the decoder layer.

Decoder Component

The Decoder component comprises of 6 identical layers where each layer has three sub-layers. The decoders learn to decode the representation to perform different tasks.

The decoder component is fed with 2 types of data as below:

Attention vector (from Encoder)
Target Sequence (Encoded as Q, K, V)

The masked Multi head attention will perform the similar functionality done by the encoder to calculate the score and identify the relevance and relationship between elements in the sequence. While this appears to be similar to the functionality performed by the encoder, there is a difference. The attention sub-layer of decoder while using the target sequence must not have access to the future words. For example, if the target sequence is “The sun sets in the west”, the attention layer should mask the “west” while using “The sun sets in the” sequence to predict the next word. This is why the layer is named as Masked Multi Head Attention Sub-layer.

The output of the sub-layer is normalized and fed to the next sub-layer which is responsible for training. The Encoder-Decoder Attention receives the representation of the encoder output as Query and Key and the representation of the target sequence as Value. This sub-layer computes the attention score for each target sequence elements influenced by the attention score from the attention vector (received from the encoder).

This is further normalized using the normalization sub-layer and to the feed forward sub-layer which in turn produces the output vector.

Nagendra's Blog

Saturday, October 28, 2023

Generative AI - Data Tokenization and Embedding