How the “Multi words mask matrix” was build with an example:
This code randomly selects a percentage of tokens to mask, replaces them with [MASK], and then generates a masked word activity matrix using a learned feature creation matrix. The resulting matrix represents all possible pairs of words that occur with each token in the input sequence and is used during self-attention to ensure that each token can only attend to previous non-masked tokens in the sequence.
input_sequence:
“the quick brown fox jumps over the lazy dog”
tokens via tokens = input_sequence.split()
:
tokens:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
after randomly masked:
['the', 'quick', 'brown', '[MASK]', 'jumps', 'over', 'the', 'lazy', 'dog']
feature_creation_matrix via feature_creation_matrix = torch.randn((len(tokens), 2))
:
feature_creation_matrix:
tensor([[-0.0312, 0.8866],
[-0.6970, 0.6740],
[-0.1193, -0.5010],
[ 0.2698, -0.8354],
[ 0.2350, 0.1044],
[-1.0366, -0.9134],
[ 0.6098, -0.7495],
[-0.0932, 0.5929],
[ 0.5193, -0.4662]])
Initialize an empty masked word activity matrix masked_word_activity_matrix = torch.zeros((len(tokens), len(tokens)))
:
masked_word_activity_matrix:
tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0.]])
Iterate over each token and generate features for all possible pairs of token, pair_vectors:
token_vector = feature_creation_matrix[i]
# oputput: token_vector= tensor([-1.5081, -1.8666])
pair_vectors = torch.matmul(feature_creation_matrix, token_vector.unsqueeze(1)).squeeze()
# output: pair_vectors= tensor([-6.7708, 0.6501, -0.4261, 2.4651, 2.2849, 1.5554, -1.5681, 2.8133,
3.6422])
pair_vectors[:i] = float('-inf')
pair_vectors[i+1:] = float('-inf')
# output: pair_vectors=: tensor([ -inf, -inf, -inf, -inf, -inf, -inf, -inf, 2.8133, -inf])
attention_scores = torch.softmax(pair_vectors, dim=0)
pair_vectors=
tensor([ 0.7870, 0.6193, -0.4404, -0.7490, 0.0852, -0.7774, -0.6836, 0.5286,
-0.4295])
Set any entries corresponding to [MASK] tokens or future tokens to -infinity so they are not considered during self-attention
pair_vectors[:i] = float('-inf')
pair_vectors[i+1:] = float('-inf')
print(f"pair_vectors=: {pair_vectors}")
#output: tensor([ -inf, -inf, -inf, -inf, -inf, 1.2442, -inf, -inf, -inf])
Then we get attention score:
attention_scores = torch.softmax(pair_vectors, dim=0)
print(f"pair_vectors=: {pair_vectors}")
#output: pair_vectors=: tensor([ -inf, -inf, -inf, -inf, -inf, -inf, 2.1547, -inf, -inf])Line by line, make masked_word_activity_matrix
the end result:
masked_word_activity_matrix[i] = attention_scores
print(f'masked_word_activity_matrix= {masked_word_activity_matrix}')
#output: masked_word_activity_matrix= tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.]])