Skip to content
Klara Meyer edited this page Sep 30, 2022 · 16 revisions

Name Generation

This is an abstractive summarisation model, which means that new words are generated to represent ideas from the input text instead of extracting existing words from the text. It has been implemented using TensorFlow and Keras in Python.

Model Architecture

Sequence2Sequence

A Seq2Seq model takes a sequence of objects (words, letters, time series, etc) and outputs another sequence of objects. The ‘black box’ in between is a complex structure of numerous Recurrent Neural Networks (RNNs) that first transfers an input string of varying length into a fixed length vector representation, which is then used to generate an output string. The two major components of the Seq2Seq model is the encoder and the decoder.

blackbox

Encoder-Decoder

The model relies on an encoder-decoder architecture – a combination of layered RNNs that are arranged in way that allows them to perform the tasks of encoding a word sequence and then passing that encoded sequence to a decoder network to produce an output. The input sequence is first tokenised (transformed from a collection of words into a collection of integers that represent each word) and then fed word-for-word into the encoder. The encoder transforms the sequence into a new, abstracted state, which then, after being passed to the decoder, becomes the basis of producing an output sequence. Both the encoder and the decoder of the model are LSTM networks. Long Short Term Memory (LSTM) models are a variant of a recurrent neural network.

encoder-decoder

The encoder reads the entire input sequence word by word, producing a sequence of encoder hidden states. At each time step, a new token is read and the hidden state is updated with the new information. Upon reaching the end of the input sequence, the encoder puts out a fixed length representation of the input, regardless of input length, which is called the encoder vector. The encoder vector is the final hidden state that is used to initialise the decoder. In contrast to the encoder, which takes in input data and reflects in a final, abstracted state, the decoder is trained to output a new, fixed-length sequence (word-for-word) given the previous word for each time step. It is initialised by receiving the encoder vector as its first hidden state, as well as a “start”-token, indicating the start point of the output sequence. The true output sequence is unknown to decoder while decoding the input sequence, it only knows the last encoder hidden state and the previous input (the “start” token or the next token from the input sequence), which it receives at each time step. The decoder has the ability to freely generate words from the vocabulary.

Attention

LSTM units begin to struggle with this task as inputs grow longer. With long input sequences, the final state vector that is output by the encoder may lose important contextual information from earlier points in the sequence. Through every iteration of the encoder RNN, hidden states are updated by new information, slowly moving away from early inputs. One solution to the this issue is called attention. Attention serves to assist the encoder-decoder model in specifically focusing on certain, relevant sections / words in the input text when predicting the next output token. This helps to mitigate the issue of lost context from earlier chunks of an input sequence. With attention, instead of a one-shot context vector based on the last (hidden) state of the encoder, the context vector is constructed using all hidden states of the encoder.

When combining the hidden states into the final encoder output (decoder input), each vector (state) gets its own random weight. These weights are re-calculated by the alignment model, another Neural Network trained parallel to the decoder which checks how well the last decoder output fits to the different states passed over from the encoder. Depending on the respective fit scores, the alignment model weights are optimised via back propagation. Through this dynamic weighting, the importance of the different hidden states varies across input-output instances, allowing the model to pay more attention (weight / importance) to different encoder states based on the input.

The encoder reads input text and produces hidden states for each time step The combined encoder state is the first input to the decoder The decoder puts out the first decoder hidden state A score (scalar) is obtained by an alignment model, also called score function (blue) In this blog’s model, this is an addition / concatenation of the decoder and encoder hidden states (i.e. tensors)* The final scores are then obtained by application of a softmax layer Each encoder hidden state is then multiplied by its scores/weights All weighted encoder hidden states are then added together to form the context vector (dark green) The input to the next decoder step is the concatenation between the generated word from the previous decoder time step (pink) and context vector from the current time step.

Training data

We used the TensorFlow dataset Gigaword to train our model. It was trained on about a million document-summary pairs. The model's accuracy can be improved by training it on more inputs, however, we do not have the necessary resources to do this. The following is how the training data was cleaned:

  • Contraction mapping
  • Removing text inside parenthesis
  • Removing digits (replaced by hashes in gigaword)
  • Removing unknown words such as names
  • Removing words with only one character
  • Removing 's
  • Only selecting the entries within a certain length threshold according to the graph below
Graphs * A rare word threshold was chosen such that around 60% of words are removed: 5 for summaries, 7 for headlines

The training process

Training makes use of early stopping. It stops training once the validation loss increases. The validation loss is calculated from a sum of the errors for each example in the validation set. The validation loss is measured after each epoch.

epoch loss

As can be seen in the graph, the validation loss (blue line) increased at epoch 16, stopping the training of the model.

Speech to text

As we are also using Azure for hosting, we decided to use Azure's speech-to-text services. It is also very reliable and includes punctuation.

Abstractive summarisation

We are using the pre-trained module 'bart-large-cnn'. It is an encoder-decoder seq2seq model with a bidirectional encoder and an autoregressive decoder. In order to ensure that the summary is a reasonable length, we set the maximum length of the summary to be half of the length of the input text and the minimum length of the summary as a tenth of the input text.

Semantic search

We are using the pre-trained module 'all-MiniLM-L6-v2'. This seq2seq model works by embedding all of the entries in the corpus into a vector space. At search time, the query is embedded into the same vector space and the closest embeddings from the corpus are found.