Saturday, October 28, 2023

Generative AI - Data Tokenization and Embedding

While the transformer architecture is primarily targeted to produce new content (such as text), like any other AI/ML models, the models using transformer architecture relies on numeric values as input to perform all the mathematical computations for learning and predictions. This is not different for NLP and so the natural language needs to be converted to numeric values. This conversion is known as Data tokenization.



Data Tokenizer is responsible for preparing the input into numeric tokens that can be consumed by the AI/ML model. There are different tokenization methods are available and the selection of the method is implementation matter.

Below is an example output from OpenAI tokenizer page where a numeric tokenID is generated for each word in the sentence.


The readers can leverage https://platform.openai.com/tokenizer to play around with the tokenizer.

There are other options and SDK modules such as huggingface that can be used as well. Below is sample python code that can be used to convert the text into token IDs.

from transformers import AutoTokenizer
import os
from huggingface_hub import login
login(token=os.environ.get("<Removed>"))
text = "BGP is a routing protocol"
tokenizer = AutoTokenizer.from_pretrained("TinyPixel/Llama-2-7B-bf16-sharded")
tokenized_text = tokenizer(text)["input_ids"]
decoded_text = tokenizer.decode(tokenized_text)

Embedding Layer


The embedding layer of the transformer is responsible for receiving the tokenIDs are input and convert them into vectors embedded with additional details such as context, semantic similarity, correlation and relevance to other words/tokens in the sequence, etc. For example, the semantic similarity and the context can be used to differentiate king to queen as in man to woman.

Positional Encoding


One of the primary advantage of transformer architecture is that it is capable of performing parallel processing which is different from the traditional methods that sequentially process the input. In order to perform parallel processing of the vector, we need to encode the position details for each element in the sequence so that parallel processing can still identify the position of the elements in the sequence. This is performed by leveraging a computation logic on the position of the element in the sequence pos, the length of the encoding vector d_model, and the index value in the vector i. 


A pictorial representation of the technique to convert the text and embed with position and other details is shown as below:

This position embedded vector is fed as input to the Encoder component of the transformer architecture.s

No comments:

Post a Comment