Skip to content

Latest commit

 

History

History

pos-encoding

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

q: what are some options for positional encoding of word embeddings?

a: Positional encoding is a critical component in transformer models to inject information about the relative or absolute position of tokens in a sequence. Since the transformer architecture does not have any inherent notion of sequence order (unlike recurrent neural networks or convolutional neural networks), positional encodings are added to the input embeddings to provide this information.

Here are some common approaches for positional encoding of word embeddings:

1. Sinusoidal Positional Encoding (Original Transformer)

This method was introduced in the original Transformer paper (Vaswani et al., 2017). It generates positional encodings using sine and cosine functions of different frequencies. The idea is that each dimension of the positional encoding corresponds to a different frequency, allowing the model to generalize to sequences longer than those seen during training.

The formula for the sinusoidal positional encoding is:

  • For a position pos and embedding dimension i, the positional encoding is defined as:
    • ( \text{PE}(pos, 2i) = \sin(pos / 10000^{2i/d}) )
    • ( \text{PE}(pos, 2i+1) = \cos(pos / 10000^{2i/d}) )

Where:

  • pos is the position (integer index) in the sequence.
  • i is the dimension index in the embedding.
  • d is the dimensionality of the embeddings.

This encoding creates a smooth, continuous representation of positions that allows the transformer model to capture both local and global dependencies in a sequence.

Here’s an example of how to implement it:

import torch
import math

def get_positional_encoding(seq_len, embed_size):
    pe = torch.zeros(seq_len, embed_size)
    for pos in range(seq_len):
        for i in range(0, embed_size, 2):
            pe[pos, i] = math.sin(pos / 10000 ** (i / embed_size))
            pe[pos, i + 1] = math.cos(pos / 10000 ** (i / embed_size))
    return pe

# Example usage
seq_len = 10  # Sequence length
embed_size = 16  # Embedding size
pos_encoding = get_positional_encoding(seq_len, embed_size)

2. Learned Positional Encoding

Instead of using a fixed sinusoidal encoding, this method allows the model to learn the positional encodings as part of the training process. A matrix of trainable parameters (of shape [max_seq_len, embed_size]) is initialized randomly and then updated through backpropagation during training. This allows the model to learn the most appropriate encoding for the task at hand.

To use learned positional encoding, you can add a learnable embedding layer for positions.

import torch
import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, max_len, embed_size):
        super(LearnedPositionalEncoding, self).__init__()
        self.position_embedding = nn.Embedding(max_len, embed_size)

    def forward(self, x):
        seq_len = x.size(1)  # The sequence length from the input tensor
        positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0)  # Create position indices
        return self.position_embedding(positions)

# Example usage
max_len = 100  # Maximum sequence length
embed_size = 16  # Embedding size
learned_pos_encoding = LearnedPositionalEncoding(max_len, embed_size)

3. Absolute Positional Encoding (Relative Encoding with Attention)

While not strictly a standalone positional encoding method, relative positional encoding can be integrated into the attention mechanism itself. Instead of using explicit positional encodings, this approach modifies the attention mechanism to be aware of the relative positions of tokens. For example, the attention score is adjusted by adding a learned or precomputed relative positional bias to the attention score matrix.

This approach is more common in advanced models like T5 and Relative Position Encodings (used in BERT-like models). By using this approach, the model can handle sequences of varying lengths and learn better relationships between words at different positions.

4. Rotary Positional Embedding (RoPE)

Rotary Positional Embedding is a newer approach that attempts to improve the quality of attention and reduce the need for absolute positional information. RoPE works by encoding positional information directly into the attention mechanism through rotations in the complex plane (for each token). This method is typically used in models like GPT-NeoX.

In RoPE, the attention matrices are modified using a complex exponential, making positional information inherent to the attention mechanism itself.

5. Relative Positional Encoding (Attention Bias)

Relative positional encoding adjusts the attention mechanism so that tokens’ attention is computed based on their relative distances instead of absolute positions. For example, T5 and Transformer-XL use relative positional encodings, where the positional bias is learned and added to the attention scores before applying softmax.

In this case, the idea is to focus on the difference between the positions of tokens rather than treating their positions as unique values.

6. Absolute and Relative Positional Encoding Combination

Some architectures combine both absolute and relative positional encodings to get the benefits of both. T5 and ALBERT models use this approach. The absolute position encoding provides global positioning context, and the relative positional encoding helps the model capture local dependencies.

7. Fixed Positional Embeddings Using Other Functions

Besides the sinusoidal function, other types of fixed functions have been explored in research. For example:

  • Polynomial or exponential functions can be used to encode positional information.
  • Learned embeddings based on attention window sizes can be explored for specific tasks.

Conclusion

Here is a summary of the positional encoding options:

  1. Sinusoidal Positional Encoding: Fixed encoding based on sine and cosine functions.
  2. Learned Positional Encoding: Learnable embedding matrix for positional information.
  3. Relative Positional Encoding: Embeds relative distances into the attention mechanism.
  4. Rotary Positional Embedding (RoPE): A newer method, using rotations in the complex plane for positional encoding.
  5. Attention Bias with Relative Positional Encoding: Encodes relative positional information directly into the attention mechanism.

Each method has its advantages depending on the task. Sinusoidal and Learned encodings are the most commonly used in general-purpose transformers, while relative and rotary encodings are gaining popularity in specific architectures and are suitable for long-range dependency tasks.