How the “Multi words mask matrix” was build with an example:

2 min readMay 30, 2023

This code randomly selects a percentage of tokens to mask, replaces them with [MASK], and then generates a masked word activity matrix using a learned feature creation matrix. The resulting matrix represents all possible pairs of words that occur with each token in the input sequence and is used during self-attention to ensure that each token can only attend to previous non-masked tokens in the sequence.

input_sequence:

“the quick brown fox jumps over the lazy dog”

tokens via tokens = input_sequence.split():

tokens:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

after randomly masked:

['the', 'quick', 'brown', '[MASK]', 'jumps', 'over', 'the', 'lazy', 'dog']

feature_creation_matrix via feature_creation_matrix = torch.randn((len(tokens), 2)):

feature_creation_matrix:
tensor([[-0.0312,  0.8866],
        [-0.6970,  0.6740],
        [-0.1193, -0.5010],
        [ 0.2698, -0.8354],
        [ 0.2350,  0.1044],
        [-1.0366, -0.9134],
        [ 0.6098, -0.7495],
        [-0.0932,  0.5929],
        [ 0.5193, -0.4662]])

Initialize an empty masked word activity matrix masked_word_activity_matrix = torch.zeros((len(tokens), len(tokens))):

masked_word_activity_matrix:
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Iterate over each token and generate features for all possible pairs of token, pair_vectors:

    token_vector = feature_creation_matrix[i]
# oputput: token_vector= tensor([-1.5081, -1.8666])
    pair_vectors = torch.matmul(feature_creation_matrix, token_vector.unsqueeze(1)).squeeze()
# output: pair_vectors= tensor([-6.7708,  0.6501, -0.4261,  2.4651,  2.2849,  1.5554, -1.5681,  2.8133,
         3.6422]) 
    pair_vectors[:i] = float('-inf')
    pair_vectors[i+1:] = float('-inf')
# output: pair_vectors=: tensor([  -inf,   -inf,   -inf,   -inf,   -inf,   -inf,   -inf, 2.8133,   -inf])
    attention_scores = torch.softmax(pair_vectors, dim=0)

pair_vectors= 
tensor([ 0.7870,  0.6193, -0.4404, -0.7490,  0.0852, -0.7774, -0.6836,  0.5286,
        -0.4295])

Set any entries corresponding to [MASK] tokens or future tokens to -infinity so they are not considered during self-attention

    pair_vectors[:i] = float('-inf')
    pair_vectors[i+1:] = float('-inf')
    print(f"pair_vectors=: {pair_vectors}")
    #output: tensor([  -inf,   -inf,   -inf,   -inf,   -inf, 1.2442,   -inf,   -inf,   -inf])

Then we get attention score:

    attention_scores = torch.softmax(pair_vectors, dim=0)
    print(f"pair_vectors=: {pair_vectors}")
    #output: pair_vectors=: tensor([  -inf,   -inf,   -inf,   -inf,   -inf,   -inf, 2.1547,   -inf,   -inf])Line by line, make masked_word_activity_matrix

the end result:

    masked_word_activity_matrix[i] = attention_scores
    print(f'masked_word_activity_matrix= {masked_word_activity_matrix}')
#output: masked_word_activity_matrix= tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1.]])

The source code running on Colab

How the “Multi words mask matrix” was build with an example:

Written by Paul Xiong