A Simple Example ChatGPT Used to Explain Transformer

Paul Xiong
6 min readMay 11, 2023

Certainly! Let’s consider a simple example to explain how the Transformer architecture works. Suppose we have a sentence: “I love pizza.”

1. Tokenization: The first step is to tokenize the input sentence into individual words or subwords. In this case, the sentence might be tokenized into [“I”, “love”, “pizza”].

2. Embedding: Each token is then converted into a vector representation called an embedding. These embeddings capture the meaning and context of the words. For example, “I” might be represented as [0.2, 0.4, -0.1], “love” as [0.5, 0.9, 0.3], and “pizza” as [0.8, 0.2, 0.6].

3. Positional Encoding: To account for the order of the words in the sentence, positional encoding is added to the embeddings. This helps the model understand the sequential nature of the input.

4. Self-Attention: The self-attention mechanism allows the model to weigh the importance of different words in the sentence when understanding the context. It calculates attention scores between each word and every other word in the sentence. These attention scores represent how much each word should focus on other words.

The attention scores in the self-attention mechanism of the Transformer architecture can be represented as a matrix or a set of matrices, depending on the specific implementation. Let’s consider an example with the sentence “I love pizza” to illustrate how the attention scores might look.

For simplicity, let’s assume we have three words in the sentence: [“I”, “love”, “pizza”]. We’ll denote the embeddings of these words as E, and the attention scores as A.

In the self-attention mechanism, attention scores are calculated by comparing each word with every other word in the sentence. The attention scores determine the relevance or importance of each word with respect to others.

Let’s assume the attention scores are represented as a matrix A with dimensions (3x3) for our example. The entry A(i, j) in the matrix represents the attention score of the i-th word attending to the j-th word.

For our example sentence “I love pizza,” the attention scores matrix A might look like:


A = | A(1,1) A(1,2) A(1,3) |
| A(2,1) A(2,2) A(2,3) |
| A(3,1) A(3,2) A(3,3) |

Each entry in the attention scores matrix represents the attention weight or importance of a word attending to another word. Higher attention scores indicate greater relevance or importance.

In practice, attention scores are often calculated using a softmax function over a compatibility or similarity measure between pairs of words. The softmax function ensures that the attention scores sum to 1 along each row, representing a valid attention distribution.

It’s important to note that the specific values of the attention scores will depend on the input sentence and the learned parameters of the Transformer model. The model learns to assign attention scores based on the context and the relationships between words in the input sequence.

5. Contextual Representation: Using the attention scores, the model generates a weighted sum of the embeddings of all the words in the sentence. This produces a contextually rich representation for each word, capturing its relationship with other words in the sentence.

In step 5 of the Transformer architecture, the contextual representation is obtained by applying the attention scores to the embeddings of the words in the input sentence. This process generates a weighted sum of the embeddings, where the weights are determined by the attention scores. Here’s how the contextual representation might look based on the example sentence “I love pizza”:

Let’s denote the contextual representations as C, the embeddings as E, and the attention scores as A.

  • Compute the attention weights: The attention scores matrix A determines how much each word should attend to other words in the sentence. Using the attention scores, we calculate the attention weights for each word.
  • Calculate the contextual representation: The contextual representation of a word is obtained by taking a weighted sum of the embeddings of all the words in the sentence, where the weights are determined by the attention scores.

For example, let’s assume the embeddings E for the words [“I”, “love”, “pizza”] are as follows:

E(I) = [0.2, 0.4, -0.1]
E(love) = [0.5, 0.9, 0.3]
E(pizza) = [0.8, 0.2, 0.6]

And the attention scores A are as follows:

A = | A(1,1) A(1,2) A(1,3) |
| A(2,1) A(2,2) A(2,3) |
| A(3,1) A(3,2) A(3,3) |

To compute the contextual representation C for each word, we multiply the attention weights by the respective embeddings and sum them up. The contextual representation C(i) for the i-th word can be calculated as:


C(i) = A(i,1) * E(I) + A(i,2) * E(love) + A(i,3) * E(pizza

So, for example, the contextual representation C(1) for the word “I” would be:


C(1) = A(1,1) * E(I) + A(1,2) * E(love) + A(1,3) * E(pizza)

The same process applies to calculate C(2) and C(3) for the words “love” and “pizza” respectively.

The resulting contextual representations capture the contextual information of each word in the input sentence, taking into account the relationships between words as determined by the attention scores. These representations serve as inputs for further layers in the Transformer model, such as the feed-forward network, to make predictions or perform downstream tasks.

6. Feed-Forward Network: The contextual representations of the words are then passed through a feed-forward neural network, which applies non-linear transformations to further refine the representations.

In step 6 of the Transformer architecture, the input to the feed-forward network is the contextual representation obtained from step 5. Let’s denote the input to the feed-forward network as X_in and the output as X_out. Here’s how the input and output might look based on the example sentence “I love pizza”:

Assuming the contextual representations C for the words [“I”, “love”, “pizza”] are as follows:

C(I) = [0.3, 0.1, 0.5]
C(love) = [0.7, 0.2, 0.4]
C(pizza) = [0.4, 0.6, 0.9]

The input X_in to the feed-forward network is the concatenation of these contextual representations:

X_in = [C(I), C(love), C(pizza)] = [0.3, 0.1, 0.5, 0.7, 0.2, 0.4, 0.4, 0.6, 0.9]

The feed-forward network applies non-linear transformations to this input representation to further process and refine the information. It typically consists of multiple layers with intermediate activation functions.

The output X_out of the feed-forward network will depend on the specific architecture and parameters of the model. The output may undergo additional transformations or be used for various downstream tasks, such as sentiment analysis, language translation, or text generation.

It’s important to note that the specific values of the input and output will depend on the learned parameters of the Transformer model and the input sentence. The model learns to transform the input based on the context and the relationships between words, capturing higher-level representations and patterns in the data.

7. Output: Finally, the output of the transformer model can be used for various tasks such as language translation, sentiment analysis, or text generation.

The key idea behind the Transformer architecture is the self-attention mechanism, which allows the model to capture dependencies between words without relying on recurrent or convolutional structures. This enables Transformers to handle long-range dependencies and process input in parallel, leading to improved performance in various natural language processing tasks.

Please note that this example is a simplified explanation of the Transformer architecture, and the actual implementation can involve more complex components and additional layers.

Here’s a text diagram summarizing the total steps of the Transformer architecture explained above, using the example sentence “I love pizza”:

   +------------------------+
| Input |
|------------------------|
| Sentence: "I love pizza" |
+------------------------+
|
|
v
+------------------------+
| Tokenization |
|------------------------|
| Tokens: ["I", "love", "pizza"] |
+------------------------+
|
|
v
+------------------------+
| Embedding |
|------------------------|
| Embeddings: |
| "I": [0.2, 0.4, -0.1] |
| "love": [0.5, 0.9, 0.3] |
| "pizza": [0.8, 0.2, 0.6] |
+------------------------+
|
|
v
+------------------------+
| Positional Encoding |
|------------------------|
| Encoded Embeddings: |
| "I": [0.2, 0.4, -0.1] |
| "love": [0.5, 0.9, 0.3] |
| "pizza": [0.8, 0.2, 0.6] |
+------------------------+
|
|
v
+------------------------+
| Self-Attention |
|------------------------|
| Attention Scores: |
| A(1,1) A(1,2) A(1,3). [0.8, 0.2, 0.7] |
| A(2,1) A(2,2) A(2,3) |
| A(3,1) A(3,2) A(3,3) |
+------------------------+
|
|
v
+------------------------+
| Contextual Representation |
|------------------------|
| Contextual Representations: |
| "I": [C(I)] [0.2, 0.4, -0.1] |
| "love": [C(love)] [0.7, 0.2, 0.4] |
| "pizza": [C(pizza)] [0.4, 0.6, 0.9] |
+------------------------+
|
|
v
+------------------------+
| Feed-Forward Network |
|------------------------|
| Input: [X_in] [0.3, 0.1, 0.5, 0.7, 0.2, 0.4, 0.4, 0.6, 0.9] |
| Output: [X_out] |
+------------------------+
|
|
v
+------------------------+
| Output |
|------------------------|
| Prediction, Translation|
| Sentiment Analysis, etc|
+------------------------+

--

--

Paul Xiong

Coding, implementing, optimizing ML annotation with self-supervised learning, TLDR: doctor’s labeling is the 1st priority for our Cervical AI project.