How the “Multi words mask matrix” was build with an example:

Paul Xiong
2 min readMay 30, 2023

--

This code randomly selects a percentage of tokens to mask, replaces them with [MASK], and then generates a masked word activity matrix using a learned feature creation matrix. The resulting matrix represents all possible pairs of words that occur with each token in the input sequence and is used during self-attention to ensure that each token can only attend to previous non-masked tokens in the sequence.

input_sequence:

“the quick brown fox jumps over the lazy dog”

tokens via tokens = input_sequence.split():

tokens:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

after randomly masked:

['the', 'quick', 'brown', '[MASK]', 'jumps', 'over', 'the', 'lazy', 'dog']

feature_creation_matrix via feature_creation_matrix = torch.randn((len(tokens), 2)):

feature_creation_matrix:
tensor([[-0.0312, 0.8866],
[-0.6970, 0.6740],
[-0.1193, -0.5010],
[ 0.2698, -0.8354],
[ 0.2350, 0.1044],
[-1.0366, -0.9134],
[ 0.6098, -0.7495],
[-0.0932, 0.5929],
[ 0.5193, -0.4662]])

Initialize an empty masked word activity matrix masked_word_activity_matrix = torch.zeros((len(tokens), len(tokens))):

masked_word_activity_matrix:
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Iterate over each token and generate features for all possible pairs of token, pair_vectors:

    token_vector = feature_creation_matrix[i]
# oputput: token_vector= tensor([-1.5081, -1.8666])
pair_vectors = torch.matmul(feature_creation_matrix, token_vector.unsqueeze(1)).squeeze()
# output: pair_vectors= tensor([-6.7708, 0.6501, -0.4261, 2.4651, 2.2849, 1.5554, -1.5681, 2.8133,
3.6422])
pair_vectors[:i] = float('-inf')
pair_vectors[i+1:] = float('-inf')
# output: pair_vectors=: tensor([ -inf, -inf, -inf, -inf, -inf, -inf, -inf, 2.8133, -inf])
attention_scores = torch.softmax(pair_vectors, dim=0)
pair_vectors= 
tensor([ 0.7870, 0.6193, -0.4404, -0.7490, 0.0852, -0.7774, -0.6836, 0.5286,
-0.4295])

Set any entries corresponding to [MASK] tokens or future tokens to -infinity so they are not considered during self-attention

    pair_vectors[:i] = float('-inf')
pair_vectors[i+1:] = float('-inf')
print(f"pair_vectors=: {pair_vectors}")
#output: tensor([ -inf, -inf, -inf, -inf, -inf, 1.2442, -inf, -inf, -inf])

Then we get attention score:

    attention_scores = torch.softmax(pair_vectors, dim=0)
print(f"pair_vectors=: {pair_vectors}")
#output: pair_vectors=: tensor([ -inf, -inf, -inf, -inf, -inf, -inf, 2.1547, -inf, -inf])Line by line, make masked_word_activity_matrix

the end result:

    masked_word_activity_matrix[i] = attention_scores
print(f'masked_word_activity_matrix= {masked_word_activity_matrix}')
#output: masked_word_activity_matrix= tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.]])

The source code running on Colab

--

--

Paul Xiong
Paul Xiong

Written by Paul Xiong

Predicting the next word (token) is what powers ChatGPT, while predicting the next photo (embedding) forms the foundation of ImageGPT.

No responses yet