Wednesday, September 25, 2024

AIML: Floating Point Arithmetic Representation

In the world of computing, floating point arithmetic is the commonly used representation of the data as real numbers. Depending on the type of the use case and the application, the data processed by the GPU may be a signed (-1027 to + 1027) or unsigned number (0 to 65535). It can also be whole number (123) or a decimal with fractional part (1.23). To address disparate data types and support wide range of values, a common representation was agreed and standardized by IEEE as IEEE-754 format. The format of IEEE-754 FP32 datatype which is commonly referred as single precision floating point is shown as below:


The IEEE-754 format comprises are three parts as below:

  • Sign field - This field represents if the number is positive or negative value. This field is of single bit size with a value of 0 representing the number as positive and a value of 1 as negative.
  • Biased Exponent – This field is used to represent the exponent value as both positive and negative by adding the exponent value to a bias. Theoretically we could assume that the MSB bit of this field is used to represent the exponent as positive or negative. For a single precision FP32 datatype, the bias is 127 and so a value of 128 will be result in an exponent value of 1 while 126 will result in an exponent value of -1.
  • Mantissa – This is the fractional part of the real number represented in binary and optionally appended by zeros to make the size of the field.

Example


To better understand the conversion and the format, let us take an example of 129.375 as input and convert the same into FP32 format.
  • As a first step, let us convert 129.375 into binary format. 
    • The binary value of 129 is 10000001
    • The binary value of 0.375 is 0.011
    • The dotted representation of 129.375 is 10000001.011 
  • Now 10000001.011 can be converted into exponential form and represented as 1.0000001011 x 2^7
  • Let’s use the above to represent the value in IEEE-754 FP32 format:
    • Sign = 0
    • Exponent value is 7. So biased exponent is 127+7 = 134 (10000110)
    • Normalized mantissa = 00000010110000000000000 (appending 0s to make it to 23 bits)
  • The IEEE-754 format for FP32 single precision is 0 10000110 00000010110000000000000.

There are different online calculators to convert any value to FP32 format. One such online calculator is available here - https://www.h-schmidt.net/FloatConverter/IEEE754.html 

While the above example is explained with FP32 which is one of the commonly used formats, there are other floating-point formats used for deep learning. Below is a consolidated table defining the number of bits for sign, exponent and mantissa for each such format:



Each of these format supports different range and precision where range defines the limit of the number representation (min to max) while precision defines the distance between successive numbers. While FP64 can offer more range and precision comparing to FP8, it is compute intensive. 

Wednesday, November 8, 2023

Google Cloud VPC Configuration


Virtual Private Cloud

Virtual Private Cloud (VPC) is the virtual instance of the network within Google Cloud that provides connectivity for compute and other resources. Unlike AWS, VPC is global in Google cloud so any compute resources created irrespective of the region or Availability Zone (AZ) can be part of the same VPC can communicate among themselves by default. The subnets are regional resources and so cannot span across different regions. Each region can be assigned with one or more subnet from within the VPC.

In this article, we will discuss about 4 different ways of creating a VPC in Google Cloud as below:

  • Google Cloud Console 
  • gcloud CLI
  • Terraform


Google Cloud Console

Below is a quick one-minute video showing the steps to create the VPC.



This section explains the use of the Google Cloud console to create a new VPC using the below simple steps:

  1. Go to console.cloud.google.com/networking to create a new VPC network. It could be noted that for each project, a default VPC will be created with subnets assigned for each, and this “default” VPC can be deleted by the admin to avoid creating any confusion.
  2. Use the CREATE VPC NETWORK button to create a new VPC for this project. It takes the user to the VPC creation page.
  3. Configure a unique name for the VPC and set the MTU value based on the requirement. 
  4. Select the relevant subnet creation mode.
    • Custom will let the user manually create the subnets on a per-region basis.
    • Automatic will assign subnets for all the regions automatically.
  5. Configure the relevant firewall rules that the user would like to apply to the VPC. Any instances connected/using this VPC will be assigned to these firewall rules.
  6. Select the relevant Dynamic routing mode based on the requirement which influences how the prefixes/routes learned from external networks using Cloud router will be propagated within the VPC. This can be changed even after creating the VPC.
    • Regional will instruct the cloud router to make the externally learned routes available only to the instances within the same region.
    • Global will instruct the cloud router to make the externally learned routes available for all the instances irrespective of the region.

Gcloud CLI


Gcloud command line is another approach to configure the VPC using CLI. Below is an example way of configuring a new VPC with the following parameters:
  • Name – nyacorp-vpc1
  • Project – nyacorp
  • Description – “VPC1 for NYACORP”
  • Subnet Creation Mode – Automatic
  • MTU – 1500
  • Dynamic Routing Mode – Regional.

gcloud compute networks create nyacorp-vpc1 --project=nyacorp --description=VPC1\ for\ NYACORP --subnet-mode=auto --mtu=1500 --bgp-routing-mode=regional

Any additional Firewall rules can be created using the below CLI:

gcloud compute firewall-rules create nyacorp-vpc1-allow-custom --project=nyacorp --network=projects/nyacorp/global/networks/nyacorp-vpc1 --description=Allows\ connection\ from\ any\ source\ to\ any\ instance\ on\ the\ network\ using\ custom\ protocols. --direction=INGRESS --priority=65534 --source-ranges=10.128.0.0/9 --action=ALLOW --rules=all

VPC Configuration using Terraform


The basic terraform configuration involves the below steps:
  • Define the provider as google with the project and credentials details. Optionally include the region if the configuration is specific to the region.
  • Define the google_compute_network resource where the VPC specific details such as below are configured:
    • Name (Mandatory)
    • Description (Optional)
    • Auto_subnet_create mode defining the subnets
    • Routing_mode defining the Dynamic Routing mode.
  • When the auto_subnet_create is disabled (using false as the configuration option), subnets are configured using google_compute_subnetwork resource.

The sample terraform configuration is as below:

provider "google" {
project="Terraform-Project"
credentials = "${file("credentials.json")}"
}

resource "google_compute_network" "nyacorp-vpc1" {
# Ref - https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_network
name = "nyacorp-vpc1"
description = "VPC1 for NYACORP"
auto_create_subnetworks = false
routing_mode = "REGIONAL"
}

resource "google_compute_subnetwork" "public" {
# Ref - https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_subnetwork
name = "public-subnet"
ip_cidr_range = "10.1.0.0/20"
region = "us-east1"
network = google_compute_network.nyacorp-vpc1.id
}

Once the terraform configuration is applied using “terraform apply”, the relevant VPC and the subnet will be configured.s

Sunday, November 5, 2023

  The Cisco SDWAN Edge device onboarding process can be explained at a high level as below:

 

·      Choose one of the below provisioning options to onboard the edge device.

o   Automatic Provisioning

o   Semi-Automatic/Bootstrap Provisioning

o   Manual provisioning

·      Populate the device Chassis ID, Serial Number, Organization Name, Certificate in the relevant entities.

o   When Automatic provisioning is used, the above details are required to be populated in the ZTP/PnP and in the allowed list configured in the vBond controller.

o   When other provisioning options are used, the above details are required to be populated in the allowed list configured in the vBond controller.

·      Power-On the edge device and follow the below procedure:

o   If Automatic provisioning is used, no further action is required. It follows the zero-touch provisioning.

o   If Bootstrap provisioning is used, the basic configuration file needs to be loaded in the USB while booting the edge device.

o   If manual provisioning is used, the basic configuration to reach the vBond controller is manually configured using the CLI.

·      The device will authenticate and communicate with the relevant controllers and will join the fabric.

 

Edge Pre-Onboarding Process

 

Each edge devices are preloaded with a root certificate, unique serial number and chassis ID during the manufacturing process itself. In case of vEdge, the certificate is stored in the TPM chip while for cEdge, the certificate is stored in the SUDI chip.

 

The Cisco SDWAN solution leverages whitelist or allowedlist model. The serial number, chassis ID and the certificate details are required to be populated in the PnP/ZTP connect portal, vBond controller for the respective organization.

 

Edge Onboarding Process

 

The onboarding process can be simplified into the below steps:

·      Obtaining the vBond information 

·      vBond Session Establishment

·      vManage Session Establishment

·      vSmart and Fabric Join


d

Obtaining the vBond information 

 

When Automatic provisioning option is used, the procedure is as below:

·      The edge device upon coming up, will leverage DHCP to configure IP address in the VPN0 (management) interface. This IP address is expected to have internet access to further reach the ZTP/PnP server and vBond controller.

·      The edge device will leverage DNS to resolve the ZTP/PnP server IP address.

o   In case of vEdge, ztp.viptela.com name is resolved to get the ZTP address.

o   In case of cEdge, devicehelper.cisco.com is resolved to get the PnP address.

·      Upon receiving the ZTP/PnP server details, the edge device will authenticate with the connect portal. The Serial Number and Chassis ID are used to identify the Smart account associated to the customer and the relevant vBond information is shared to the edge device.

 

When the Bootstrap provisioning option is used, the basic configuration file including the vBond info is loaded in a USB and is connected to the device.

 

When the Manual provisioning option is used, the administration is expected to manually configure the edge devices using the CLI. 

 

With any of the above provisioning options used, the edge device will now have the vBond IP address.

 

vBond Session Establishment

 

The session establishment between the edge device and the vBond is as below:

 

·      The edge device upon receiving the vBond IP address will establish a secure transient Datagram TLS (DTLS) connection over UDP port range 12346-12445.

·      The vBond controller will share its own Root CA signed certificate to the edge device to validate the integrity of vBond controller. In addition, it also sends a 256-bit NONCE value to the edge device.

·      The edge device will validate the root of trust for the received CA certificate from the vBond controller. 

o   If the validation fails, the session will be terminated. 

o   If the validation succeeds, the edge device will respond back with its Serial Number, Chassis ID and certificate along with the NONCE value signed using its private key.

·      The vBond will validate the certificate and check if the serial number and chassis ID is in the allowed list.

o   If the certificate validation succeeds and the edge device is in the allowed list, it replies back with the vManage and vSmart controller details.

o   If the certificate validation fails or if the edge device is not in the allowed list, the session will be terminated.

 

vManage Session Establishment

 

The session establishment between the edge device and the vManage controller is as below:

 

·      The edge device upon receiving the vManage IP address will establish a secure transient Datagram TLS (DTLS) connection over UDP port range 12346-13065.

·      The vManage controller will share its own Root CA signed certificate to the edge device to validate the integrity of the controller. 

·      The edge device will validate the root of trust for the received CA certificate from the vManage controller. 

o   If the validation fails, the session will be terminated. 

o   If the validation succeeds, the edge device will respond back with its Serial Number, Chassis ID and certificate.

·      The vManage will validate the certificate and check if the serial number and chassis ID is in the allowed list.

o   If the certificate validation succeeds and the edge device is in the allowed list, it pushes the relevant configuration through NETCONF over SSH.

o   If the certificate validation fails or if the edge device is not in the allowed list, the session will be terminated.

 

vSmart and Fabric Join

 

The session establishment between the edge device and the vSmart controller is as below:

 

·      The edge device will establish a secure transient Datagram TLS (DTLS) connection over UDP port range 12346-13065 to the vSmart Controller.

·      The OMP session is established between the vSmart controller and the edge device for route and policy update.

s

Saturday, October 28, 2023

Generative AI - Data Tokenization and Embedding

While the transformer architecture is primarily targeted to produce new content (such as text), like any other AI/ML models, the models using transformer architecture relies on numeric values as input to perform all the mathematical computations for learning and predictions. This is not different for NLP and so the natural language needs to be converted to numeric values. This conversion is known as Data tokenization.



Data Tokenizer is responsible for preparing the input into numeric tokens that can be consumed by the AI/ML model. There are different tokenization methods are available and the selection of the method is implementation matter.

Below is an example output from OpenAI tokenizer page where a numeric tokenID is generated for each word in the sentence.


The readers can leverage https://platform.openai.com/tokenizer to play around with the tokenizer.

There are other options and SDK modules such as huggingface that can be used as well. Below is sample python code that can be used to convert the text into token IDs.

from transformers import AutoTokenizer
import os
from huggingface_hub import login
login(token=os.environ.get("<Removed>"))
text = "BGP is a routing protocol"
tokenizer = AutoTokenizer.from_pretrained("TinyPixel/Llama-2-7B-bf16-sharded")
tokenized_text = tokenizer(text)["input_ids"]
decoded_text = tokenizer.decode(tokenized_text)

Embedding Layer


The embedding layer of the transformer is responsible for receiving the tokenIDs are input and convert them into vectors embedded with additional details such as context, semantic similarity, correlation and relevance to other words/tokens in the sequence, etc. For example, the semantic similarity and the context can be used to differentiate king to queen as in man to woman.

Positional Encoding


One of the primary advantage of transformer architecture is that it is capable of performing parallel processing which is different from the traditional methods that sequentially process the input. In order to perform parallel processing of the vector, we need to encode the position details for each element in the sequence so that parallel processing can still identify the position of the elements in the sequence. This is performed by leveraging a computation logic on the position of the element in the sequence pos, the length of the encoding vector d_model, and the index value in the vector i. 


A pictorial representation of the technique to convert the text and embed with position and other details is shown as below:

This position embedded vector is fed as input to the Encoder component of the transformer architecture.s

Generative AI - Transformer Architecture

Introduction

    One of the recent buzz words in the industry that spans across different verticals is “Generative AI”. Generative AI or GenAI (in short) is a type of Artificial Intelligence (AI) that produces new types of contents such as text, image, audio and synthetic data based on the learning from the existing content. 



GenAI is a subset of deep learning that leverages a foundation model which is a large language model (LLM) pre-trained with a large quantity of data (petabytes) with numerous parameters (billions) to produce different downstream outcomes. 

The input used to train the foundational model can be documents, websites, files, etc which are natural language-based input. As the readers might be aware, any interaction with the AI/ML models such as training the model or sending a query and receiving a response are all performed using numeric values. Accordingly, any natural language processing techniques are used to convert the text in the documents, websites, etc. to numeric values. Some of the firmly established and dominant state of the art techniques used for NLP modeling are Recurrent Network Network (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Neural Network (GRU). The sequential nature of modeling the language by these techniques comes with its own disadvantages such as:

Costly and time-consuming way of labeled the data to train the model.
Slowness due to lack of parallel processing
Handling large sequences etc.

In this article, we will discuss the evolutionary Transformer Architecture which acts as the fundamental concept for Large Language Models. 


Basic Architecture

    The transformer architecture was originally published as Attention Is All You Need paper in 2017 to improve the performance of the LLM models drastically. Unlike the traditional techniques such as RNN or LSTM, the transformer technique leverages a mathematical way of finding the pattern and relevancy between elements that eliminates the need to label the data for training. This mathematical technique referred as attention-map or self-attention allows the model to identify the relevancy of each element by creating a matrix kind of map and assigning different attention scores for each element in the map. Further, the mathematical method also naturally allows to perform parallel data processing makes it more faster comparing to the traditional techniques. We will deep dive into the architecture further and explain the concept.

The transformer architect comprises of 2 distinct components as below:
  • Encoder
  • Decoder
A simple pictorial representation of the architecture is as below:



Encoder Component


    The Encoder component comprises of 6 identical layers where each layer has two sub-layers. The encoders encode and map the input sequence, enriched with additional details and convert into a sequence of continuous representations as it is passed through each layers within the encoder stack. While each of the encoder layer within the stack employs the same transformation (or attention mapping) logic to the input sequence, each layer uses different weight and bias parameters. The initial layer identifies the basic pattern while the final layer will perform more advanced mapping. The resulting output is fed to the decoder.

If we zoom into the encoder layer, we can see that it is made of two sub-layers as below:
  • Multi-Head Attention sub-layer
  • Feed Forward Neural Network sub-layer 

A simple pictorial representation of the encoder layer is as below:



    The input data is fed as tokenized vector in the form of Query, Key and Value (Data Tokenizatin is available here) which will be passed through multiple attention score head. Each head will perform similar calculation deriving attention score and then merge the scores to produce a final score for this encoder layer. The output from each encoder layer is represented as an attention vector that helps identify the relationship between different elements in the sequence.

The encoded output from the Self-attention sub-layer is normalized and fed to the next sub-layer Feed forward. The feed forward neural network transforms the attention vector into a form that is acceptable by the next encoder or the decoder layer.


Decoder Component

The Decoder component comprises of 6 identical layers where each layer has three sub-layers. The decoders learn to decode the representation to perform different tasks.



The decoder component is fed with 2 types of data as below:
  • Attention vector (from Encoder)
  • Target Sequence (Encoded as Q, K, V)

The masked Multi head attention will perform the similar functionality done by the encoder to calculate the score and identify the relevance and relationship between elements in the sequence. While this appears to be similar to the functionality performed by the encoder, there is a difference. The attention sub-layer of decoder while using the target sequence must not have access to the future words. For example, if the target sequence is “The sun sets in the west”, the attention layer should mask the “west” while using “The sun sets in the” sequence to predict the next word. This is why the layer is named as Masked Multi Head Attention Sub-layer. 

The output of the sub-layer is normalized and fed to the next sub-layer which is responsible for training. The Encoder-Decoder Attention receives the representation of the encoder output as Query and Key and the representation of the target sequence as Value. This sub-layer computes the attention score for each target sequence elements influenced by the attention score from the attention vector (received from the encoder).

This is further normalized using the normalization sub-layer and to the feed forward sub-layer which in turn produces the output vector.

Saturday, August 26, 2023

AWS Nitro Cards for VPC Networking

 The Nitro cards are a family of cards that offloads I/O functions such as networking, security, and virtualization for service acceleration and to improve the system performance. The Nitro cards are built for specific purposes and will have their own System of Chip (SoC) software. There are 4 different types of Nitro cards as below:

  • Nitro card for VPC
  • Nitro card for EBS – NVMe based storage.
  • Nitro card for instance storage
  • Nitro card for system control (brain behind the Nitro system)

Nitro card for VPC is a network card to handle VPC traffic. It is similar to a physical PCIe card with network ports on one end and a PCIe bus on the other end. It uses SR-IOV to create different virtual network functions to offer enhanced network functionalities.  In addition to the network I/O functionality, the nitro cards also perform the other functionalities of the VPC such as:

  • Overlay Encapsulation and Decapsulation
  • Security Groups
  • Traffic Rate Limiter
  • Traffic Routing

These OS-bypass capable Nitro cards come with an optional capability to create Elastic Fabric Adapter (EFA) which are primarily targeted for High Performance Computing (HPC) and Machine Learning (ML) applications. The traditional behavior of processing the network packets via the kernel’s TCP/IP stack is not sufficient to address such low latency requirements. This is addressed using the below enhancement:


A new open-source library known as libfabric was developed as part of Open Fabric Interface (OFI) work group which aims to create a family of application program interfaces (APIs) to exposes the network data directly from the NIC cards to the middleware and/or applications. The libfabric API is leveraged (and installed as a module on the host) to bypass the host kernel TCP/IP stack and directly create message queues to the Nitro card drivers. This helps achieve high performance and low latency for the applications.

Any HPC or ML application may have distributed workloads performing the computing functionalities and may need to exchange data between them in a low-latency manner. While the use of OFI/Libfabric helps bypass the kernel, this is not sufficient for inter-host communication over the fabric. To address such requirements, a new cloud optimized transport protocol referred as Scalable Reliable Datagram (SRD) is used by the Elastic Fabric Adapter (EFA) for the overlay virtual network for communication between workloads distributed across different servers. SRD which is inspired by Infiniband combines the positive and working characteristics of both UDP and TCP to improve the overall performance. SRD allows ECMP spraying to take advantage of the availability of multiple paths between the host. It also supports packet out-of-order delivery and let the upper layer handle the re-ordering.

More details about SRD are available here - https://ieeexplore.ieee.org/document/9167399 

The 2nd generation of EFA was recently introduced in the late 2022 for 6th generation compute and memory optimized instances. EFA v2 enables full RDMA semantics and bumps the bandwidth from 100 to 200 GBps.

A quick comparison of Nitro generation cards below:


Thursday, August 24, 2023

Elastic Network Interface (ENI)

 Elastic Network Interface (ENI) is a logical VPC component that acts as a virtual network interface card connecting the resource such as EC2 or DB instance to the respective subnet in the VPC. The name “Elastic” is because the ENI can be created independently ahead of time with IP address and other details and then can be associated to the relevant instance during the launch time. 

By default, each EC2 instance will be created with one ENI created and managed by AWS. Multiple ENIs can be connected to the same resource such as EC2 based on the business need.  When more than one ENI is attached to the resource, the primary one cannot be detached from a running instance while the secondary can be.


ENI Limitations

  • NIC teaming for higher bandwidth or resiliency is not supported.
  • The number of ENI per instance is limited by the instance type.


The sample configuration to create the ENI from AWS portal is as shown below:


Below are the basic steps involved in creating the ENI:

  • (Mandatory) Define the subnet to which the ENI must be associated to.
  • (Mandatory) Define the Private IPv4 Address (Auto-Assign or custom create)
  • (Mandatory) Associate one or more security-groups to the ENI
  • (Optional) Enable Elastic Fabric Adapter
  • (Optional) Configure the relevant Tags

As it could be noted, the Private IPv4 Address can be auto-assigned by AWS or can be custom-created to retain the IP address.

The optional Elastic Fabric Adapter (EFA) field can be selected to enhance the network scalability for high performance computing (HPC) applications. The EFAs are a type of ENI where the message protocol interface (MPI) leverages a new type of library known as libfabric that bypasses the kernel and talks directly to the underlying EFA hardware. More details are available here.


Terraform Configuration

provider "aws" {
region = "us-east-1"
access_key = "<Removed>"
secret_key = "<Removed>"
}

data "aws_subnets" "subnets" {
}

data "aws_security_groups" "sg_groups" {
}

resource "aws_network_interface" "NyaCorp-ENI" {
subnet_id = "${data.aws_subnets.subnets.id}"
security_groups = ["${data.aws_security_groups.sg_groups.vpc_ids[0]}"]
}

It could be noted that the terraform configuration files doesn't have private_ips field configured. When this field is missing, AWS will auto-assign the IPv4 address for the ENI.