Transformers: How Do They Remodel Your Information?


Diving into the Transformers structure and what makes them unbeatable at language duties

1*eBqke1in L9fFZrZ6hjdRA
Picture by the creator

Within the quickly evolving panorama of synthetic intelligence and machine studying, one innovation stands out for its profound influence on how we course of, perceive, and generate knowledge: Transformers. Transformers have revolutionized the sector of pure language processing (NLP) and past, powering a few of at present’s most superior AI purposes. However what precisely are Transformers, and the way do they handle to remodel knowledge in such groundbreaking methods? This text demystifies the interior workings of Transformer fashions, specializing in the encoder structure. We are going to begin by going via the implementation of a Transformer encoder in Python, breaking down its foremost parts. Then, we’ll visualize how Transformers course of and adapt enter knowledge throughout coaching.

Whereas this weblog doesn’t cowl each architectural element, it supplies an implementation and an general understanding of the transformative energy of Transformers. For an in-depth clarification of Transformers, I counsel you have a look at the superb Stanford CS224-n course.

I additionally suggest following the GitHub repository related to this text for extra particulars. 😊

What’s a Transformer encoder structure?

The Transformer mannequin from Consideration Is All You Want

This image exhibits the unique Transformer structure, combining an encoder and a decoder for sequence-to-sequence language duties.

On this article, we’ll deal with the encoder structure (the pink block on the image). That is what the favored BERT mannequin is utilizing below the hood: the first focus is on understanding and representing the information, slightly than producing sequences. It may be used for quite a lot of purposes: textual content classification, named-entity recognition (NER), extractive query answering, and many others.

So, how is the information really reworked by this structure? We are going to clarify every part intimately, however right here is an summary of the course of.

  • The enter textual content is tokenized: the Python string is reworked into a listing of tokens (numbers)
  • Every token is handed via an Embedding layer that outputs a vector illustration for every token
  • The embeddings are then additional encoded with a Positional Encoding layer, including details about the place of every token within the sequence
  • These new embeddings are reworked by a sequence of Encoder Layers, utilizing a self-attention mechanism
  • A task-specific head might be added. For instance, we’ll later use a classification head to categorise film critiques as constructive or unfavorable

That’s necessary to know that the Transformer structure transforms the embedding vectors by mapping them from one illustration in a high-dimensional area to a different inside the identical area, making use of a sequence of advanced transformations.

Implementing an encoder structure in Python

The Positional Encoder layer

In contrast to RNN fashions, the eye mechanism makes no use of the order of the enter sequence. The PositionalEncoder class provides positional encodings to the enter embeddings, utilizing two mathematical capabilities: cosine and sine.

1*971n dZ3KwprUN0GKG dg
Positional encoding matrix definition from Consideration Is All You Want

Word that positional encodings don’t comprise trainable parameters: there are the outcomes of deterministic computations, which makes this methodology very tractable. Additionally, sine and cosine capabilities take values between -1 and 1 and have helpful periodicity properties to assist the mannequin study patterns in regards to the relative positions of phrases.

class PositionalEncoder(nn.Module):
def __init__(self, d_model, max_length):
tremendous(PositionalEncoder, self).__init__()
self.d_model = d_model
self.max_length = max_length

# Initialize the positional encoding matrix
pe = torch.zeros(max_length, d_model)

place = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * -(math.log(10000.0) / d_model))

# Calculate and assign place encodings to the matrix
pe[:, 0::2] = torch.sin(place * div_term)
pe[:, 1::2] = torch.cos(place * div_term) = pe.unsqueeze(0)

def ahead(self, x):
x = x +[:, :x.size(1)] # replace embeddings
return x

Multi-Head Self-Consideration

The self-attention mechanism is the important thing part of the encoder structure. Let’s ignore the “multi-head” for now. Consideration is a option to decide for every token (i.e. every embedding) the relevance of all different embeddings to that token, to acquire a extra refined and contextually related encoding.

How does“it” take note of different phrases of the sequence? (The Illustrated Transformer)

There are 3 steps within the self-attention mechanism.

  • Use matrices Q, Okay, and V to respectively remodel the inputs “question”, “key” and “worth”. Word that for self-attention, question, key, and values are all equal to our enter embedding
  • Compute the eye rating utilizing cosine similarity (a dot product) between the question and the key. Scores are scaled by the sq. root of the embedding dimension to stabilize the gradients throughout coaching
  • Use a softmax layer to make these scores chances
  • The output is the weighted common of the values, utilizing the eye scores because the weights

Mathematically, this corresponds to the next components.

The Consideration Mechanism from Consideration Is All You Want

What does “multi-head” imply? Principally, we will apply the described self-attention mechanism course of a number of instances, in parallel, and concatenate and undertaking the outputs. This permits every head to focus on completely different semantic points of the sentence.

We begin by defining the variety of heads, the dimension of the embeddings (d_model), and the dimension of every head (head_dim). We additionally initialize the Q, Okay, and V matrices (linear layers), and the ultimate projection layer.

class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
tremendous(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
self.head_dim = d_model // num_heads

self.query_linear = nn.Linear(d_model, d_model)
self.key_linear = nn.Linear(d_model, d_model)
self.value_linear = nn.Linear(d_model, d_model)
self.output_linear = nn.Linear(d_model, d_model)

When utilizing multi-head consideration, we apply every consideration head with a diminished dimension (head_dim as an alternative of d_model) as within the authentic paper, making the full computational price just like a one-head consideration layer with full dimensionality. Word it is a logical break up solely. What makes multi-attention so highly effective is it might probably nonetheless be represented through a single matrix operation, making computations very environment friendly on GPUs.

def split_heads(self, x, batch_size):
# Break up the sequence embeddings in x throughout the eye heads
x = x.view(batch_size, -1, self.num_heads, self.head_dim)
return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

We compute the eye scores and use a masks to keep away from utilizing consideration on padded tokens. We apply a softmax activation to make these scores chances.

def compute_attention(self, question, key, masks=None):
# Compute dot-product consideration scores
# dimensions of question and key are (batch_size * num_heads, seq_length, head_dim)
scores = question @ key.transpose(-2, -1) / math.sqrt(self.head_dim)
# Now, dimensions of scores is (batch_size * num_heads, seq_length, seq_length)
if masks just isn't None:
scores = scores.view(-1, scores.form[0] // self.num_heads, masks.form[1], masks.form[2]) # for compatibility
scores = scores.masked_fill(masks == 0, float('-1e20')) # masks to keep away from consideration on padding tokens
scores = scores.view(-1, masks.form[1], masks.form[2]) # reshape again to authentic form
# Normalize consideration scores into consideration weights
attention_weights = F.softmax(scores, dim=-1)

return attention_weights

The ahead attribute performs the multi-head logical break up and computes the eye weights. Then, we get the output by multiplying these weights by the values. Lastly, we reshape the output and undertaking it with a linear layer.

def ahead(self, question, key, worth, masks=None):
batch_size = question.dimension(0)

question = self.split_heads(self.query_linear(question), batch_size)
key = self.split_heads(self.key_linear(key), batch_size)
worth = self.split_heads(self.value_linear(worth), batch_size)

attention_weights = self.compute_attention(question, key, masks)

# Multiply consideration weights by values, concatenate and linearly undertaking outputs
output = torch.matmul(attention_weights, worth)
output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
return self.output_linear(output)

The Encoder Layer

That is the principle part of the structure, which leverages multi-head self-attention. We first implement a easy class to carry out a feed-forward operation via 2 dense layers.

class FeedForwardSubLayer(nn.Module):
def __init__(self, d_model, d_ff):
tremendous(FeedForwardSubLayer, self).__init__()
self.fc1 = nn.Linear(d_model, d_ff)
self.fc2 = nn.Linear(d_ff, d_model)
self.relu = nn.ReLU()

def ahead(self, x):
return self.fc2(self.relu(self.fc1(x)))

We will now code the logic for the encoder layer. We begin by making use of self-attention to the enter, which supplies a vector of the identical dimension. We then use our mini feed-forward community with Layer Norm layers. Word that we additionally use skip connections earlier than making use of normalization.

class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout):
tremendous(EncoderLayer, self).__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)

def ahead(self, x, masks):
attn_output = self.self_attn(x, x, x, masks)
x = self.norm1(x + self.dropout(attn_output)) # skip connection and normalization
ff_output = self.feed_forward(x)
return self.norm2(x + self.dropout(ff_output)) # skip connection and normalization

Placing Every thing Collectively

It’s time to create our remaining mannequin. We go our knowledge via an embedding layer. This transforms our uncooked tokens (integers) right into a numerical vector. We then apply our positional encoder and a number of other (num_layers) encoder layers.

class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
tremendous(TransformerEncoder, self).__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

def ahead(self, x, masks):
x = self.embedding(x)
x = self.positional_encoding(x)
for layer in self.layers:
x = layer(x, masks)
return x

We additionally create a ClassifierHead class which is used to remodel the ultimate embedding into class chances for our classification job.

class ClassifierHead(nn.Module):
def __init__(self, d_model, num_classes):
tremendous(ClassifierHead, self).__init__()
self.fc = nn.Linear(d_model, num_classes)

def ahead(self, x):
logits = self.fc(x[:, 0, :]) # first token corresponds to the classification token
return F.softmax(logits, dim=-1)

Word that the dense and softmax layers are solely utilized on the primary embedding (similar to the primary token of our enter sequence). It’s because when tokenizing the textual content, the primary token is the [CLS] token which stands for “classification.” The [CLS] token is designed to combination the whole sequence’s info right into a single embedding vector, serving as a abstract illustration that can be utilized for classification duties.

Word: the idea of together with a [CLS] token originates from BERT, which was initially educated on duties like next-sentence prediction. The [CLS] token was inserted to foretell the probability that sentence B follows sentence A, with a [SEP] token separating the two sentences. For our mannequin, the [SEP] token merely marks the top of the enter sentence, as proven beneath.

[CLS] Token in BERT Structure (All About AI)

When you consider it, it’s actually mind-blowing that this single [CLS] embedding is ready to seize a lot details about the whole sequence, because of the self-attention mechanism’s capacity to weigh and synthesize the significance of each piece of the textual content in relation to every different.

Coaching and visualization

Hopefully, the earlier part offers you a greater understanding of how our Transformer mannequin transforms the enter knowledge. We are going to now write our coaching pipeline for our binary classification job utilizing the IMDB dataset (film critiques). Then, we’ll visualize the embedding of the [CLS] token in the course of the coaching course of to see how our mannequin reworked it.

We first outline our hyperparameters, in addition to a BERT tokenizer. Within the GitHub repository, you may see that I additionally coded a operate to pick out a subset of the dataset with solely 1200 prepare and 200 check examples.

num_classes = 2 # binary classification
d_model = 256 # dimension of the embedding vectors
num_heads = 4 # variety of heads for self-attention
num_layers = 4 # variety of encoder layers
d_ff = 512. # dimension of the dense layers within the encoder layers
sequence_length = 256 # most sequence size
dropout = 0.4 # dropout to keep away from overfitting
num_epochs = 20
batch_size = 32

loss_function = torch.nn.CrossEntropyLoss()

dataset = load_dataset("imdb")
dataset = balance_and_create_dataset(dataset, 1200, 200) # verify GitHub repo

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=sequence_length)

You possibly can attempt to use the BERT tokenizer on one of many sentences:


Each sequence ought to begin with the token 101, similar to [CLS], adopted by some non-zero integers and padded with zeros if the sequence size is smaller than 256. Word that these zeros are ignored in the course of the self-attention computation utilizing our “masks”.

tokenized_datasets =, batched=True)
tokenized_datasets.set_format(sort='torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataloader = DataLoader(tokenized_datasets['train'], batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(tokenized_datasets['test'], batch_size=batch_size, shuffle=True)

vocab_size = tokenizer.vocab_size

encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
classifier = ClassifierHead(d_model, num_classes)

optimizer = torch.optim.Adam(checklist(encoder.parameters()) + checklist(classifier.parameters()), lr=1e-4)

We will now write our prepare operate:

def prepare(dataloader, encoder, classifier, optimizer, loss_function, num_epochs):
for epoch in vary(num_epochs):
# Accumulate and retailer embeddings earlier than every epoch begins for visualization functions (verify repo)
all_embeddings, all_labels = collect_embeddings(encoder, dataloader)
reduced_embeddings = visualize_embeddings(all_embeddings, all_labels, epoch, present=False)
dic_embeddings[epoch] = [reduced_embeddings, all_labels]

correct_predictions = 0
total_predictions = 0
for batch in tqdm(dataloader, desc="Coaching"):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask'] # point out the place padded tokens are
# These 2 strains make the attention_mask a matrix as an alternative of a vector
attention_mask = attention_mask.unsqueeze(-1)
attention_mask = attention_mask & attention_mask.transpose(1, 2)
labels = batch['label']
output = encoder(input_ids, attention_mask)
classification = classifier(output)
loss = loss_function(classification, labels)
preds = torch.argmax(classification, dim=1)
correct_predictions += torch.sum(preds == labels).merchandise()
total_predictions += labels.dimension(0)

epoch_accuracy = correct_predictions / total_predictions
print(f'Epoch {epoch} Coaching Accuracy: {epoch_accuracy:.4f}')

You could find the collect_embeddings and visualize_embeddings capabilities within the GitHub repo. They retailer the [CLS] token embedding for every sentence of the coaching set, apply a dimensionality discount method referred to as t-SNE to make them 2D vectors (as an alternative of 256-dimensional vectors), and save an animated plot.

Let’s visualize the outcomes.

1*K3bwH03hjtra QQyY5vYGg
Projected [CLS] embeddings for every coaching level (blue corresponds to constructive sentences, pink corresponds to unfavorable sentences)

Observing the plot of projected [CLS] embeddings for every coaching level, we will see the clear distinction between constructive (blue) and unfavorable (pink) sentences after a number of epochs. This visible exhibits the exceptional functionality of the Transformer structure to adapt embeddings over time and highlights the ability of the self-attention mechanism. The information is reworked in such a method that embeddings for every class are nicely separated, thereby considerably simplifying the duty for the classifier head.


As we conclude our exploration of the Transformer structure, it’s evident that these fashions are adept at tailoring knowledge to a given job. With using positional encoding and multi-head self-attention, Transformers transcend mere knowledge processing: they interpret and perceive info with a degree of sophistication beforehand unseen. The power to dynamically weigh the relevance of various components of the enter knowledge permits for a extra nuanced understanding and illustration of the enter textual content. This enhances efficiency throughout a big selection of downstream duties, together with textual content classification, query answering, named entity recognition, and extra.

Now that you’ve got a greater understanding of the encoder structure, you might be able to delve into decoder and encoder-decoder fashions, that are similar to what we have now simply explored. Decoders play a pivotal function in generative duties and are on the core of the favored GPT fashions.


[1] Vaswani, Ashish, et al. “Consideration Is All You Want.” thirty first Convention on Neural Data Processing Methods (NIPS 2017), Lengthy Seaside, CA, USA.

[2] “The Illustrated Transformer.” Jay Alammar’s Weblog, June 2018,

[3] Official PyTorch Implementation of the Transformer Structure. GitHub repository, PyTorch,

[4] Manning, Christopher, et al. “CS224n: Pure Language Processing with Deep Studying.” Stanford College, Stanford CS224N NLP course,


Transformers: How Do They Remodel Your Information? was initially printed in In direction of Information Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.

Supply hyperlink


Please enter your comment!
Please enter your name here