How the “Multi words mask matrix” was build with an example:

Paul Xiong
2 min readMay 30, 2023

This code randomly selects a percentage of tokens to mask, replaces them with [MASK], and then generates a masked word activity matrix using a learned feature creation matrix. The resulting matrix represents all possible pairs of words that occur with each token in the input sequence and is used during self-attention to ensure that each token can only attend to previous non-masked tokens in the sequence.

input_sequence:

“the quick brown fox jumps over the lazy dog”

tokens via tokens = input_sequence.split():

tokens:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

after randomly masked:

['the', 'quick', 'brown', '[MASK]', 'jumps', 'over', 'the', 'lazy', 'dog']

feature_creation_matrix via feature_creation_matrix = torch.randn((len(tokens), 2)):

feature_creation_matrix:
tensor([[-0.0312, 0.8866],
[-0.6970, 0.6740],
[-0.1193, -0.5010],
[ 0.2698, -0.8354],
[ 0.2350, 0.1044],
[-1.0366, -0.9134],
[ 0.6098, -0.7495],
[-0.0932, 0.5929],
[ 0.5193, -0.4662]])

Initialize an empty masked word activity matrix masked_word_activity_matrix = torch.zeros((len(tokens), len(tokens))):

masked_word_activity_matrix:
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])

Iterate over each token and generate features for all possible pairs of token, pair_vectors:

    token_vector = feature_creation_matrix[i]
# oputput: token_vector= tensor([-1.5081, -1.8666])
pair_vectors = torch.matmul(feature_creation_matrix, token_vector.unsqueeze(1)).squeeze()
# output: pair_vectors= tensor([-6.7708, 0.6501, -0.4261, 2.4651, 2.2849, 1.5554, -1.5681, 2.8133,
3.6422])
pair_vectors[:i] = float('-inf')
pair_vectors[i+1:] = float('-inf')
# output: pair_vectors=: tensor([ -inf, -inf, -inf, -inf, -inf, -inf, -inf, 2.8133, -inf])
attention_scores = torch.softmax(pair_vectors, dim=0)
pair_vectors= 
tensor([ 0.7870, 0.6193, -0.4404, -0.7490, 0.0852, -0.7774, -0.6836, 0.5286,
-0.4295])

Set any entries corresponding to [MASK] tokens or future tokens to -infinity so they are not considered during self-attention

    pair_vectors[:i] = float('-inf')
pair_vectors[i+1:] = float('-inf')
print(f"pair_vectors=: {pair_vectors}")
#output: tensor([ -inf, -inf, -inf, -inf, -inf, 1.2442, -inf, -inf, -inf])

Then we get attention score:

    attention_scores = torch.softmax(pair_vectors, dim=0)
print(f"pair_vectors=: {pair_vectors}")
#output: pair_vectors=: tensor([ -inf, -inf, -inf, -inf, -inf, -inf, 2.1547, -inf, -inf])Line by line, make masked_word_activity_matrix

the end result:

    masked_word_activity_matrix[i] = attention_scores
print(f'masked_word_activity_matrix= {masked_word_activity_matrix}')
#output: masked_word_activity_matrix= tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.]])

The source code running on Colab

--

--

Paul Xiong

Coding, implementing, optimizing ML annotation with self-supervised learning, TLDR: doctor’s labeling is the 1st priority for our Cervical AI project.