Transformers are a type of machine learning model that specializes in processing and interpreting sequential data, making them optimal for natural language processing tasks. To better understand what a machine learning transformer is, and how they operate, let’s take a closer look at transformer models and the mechanisms that drive them.
This article will cover:
- Sequence-to-Sequence Models
- The Transformer Neural Network Architecture
- The Attention Mechanism
- Differences Between Transformers and RNNs/LSTMs
Sequence-to-sequence models are a type of NLP model that are used to convert sequences of one type to a sequence of another type. There are various types of sequence-to-sequence models, such as Recurrent Neural Network models and Long Short-Term Memory (LSTM) models.
Traditional sequence-to-sequence models like RNNs and LSTMS are not the focus of this article, but an understanding of them is necessary to appreciate how transformer models operate and why they are superior to traditional sequence-to-sequence models.
In brief, RNN models and LSTM models consist of encoder and decoder networks that analyze input data at various time steps. The encoder model is responsible for forming an encoded representation of the words in the input data. At every time step the encoder network takes an input sequence and a hidden state from the previous time step in the series. The hidden state values are updated as the data proceeds through the network, until the last time step, where a “context vector” is generated. The context vector is then passed to the decoder network, which is used to generate a target sequence by predicting the most likely word that pairs with the input word for the respective time steps.
These models can be augmented through the use of an “attention mechanism”. An attention mechanism defines which portions of the input vector the network should focus on to generate the proper output. To put that another way, an attention mechanism lets the transformer model process one input word while also attending to the relevant information contained by the other input words. Attention mechanisms also mask out words that don’t contain relevant information.
The Transformer Neural Network Architecture
We will go into the attention mechanism in more detail later on, but for now let’s take a look at the architecture of a transformer neural network at a higher level.
In general, a transformer neural network looks something like the following:
While this general structure may change between networks, the core pieces will remain the same: positional encodings, word vectors, attention mechanism, feed-forward neural network.
Positional Encodings and Word Vectors
A transformer neural networks operates by taking a sequence of inputs and converting these inputs into two other sequences. The transformer produces a sequence of word vector embeddings and positional encodings.
Word vector embeddings are just the text represented in a numerical format that the neural network can process. Meanwhile, the positional encodings are vectorized representations containing information about the position of the current word was in the input sentence, in relation to other words.
Other text-based neural network models like RNNs and LSTMs use vectors to represent the words in input data. These vector embeddings map words to constant values, but this is limiting because words can be used in different contexts. A transformer network solves this problem by making word values more flexible, using sinusoidal functions to let the word vectors take on different values depending on the position of the word in the sentence.
This allows the neural network model to preserve information regarding the relative position of the input words, even after the vectors move through the layers of the transformer network.
The positional encodings and the word vector embeddings are summed together then passed into both the encoder and decoder networks. While transformer neural networks use encoder/decoder schemas just like RNNs and LSTMs, one major difference between them is that all the input data is fed into the network at the same time, whereas in RNNs/LSTMs, the data is passed in sequentially.
The encoder networks are responsible for converting the inputs into representations the network can learn from, while the decoder networks do the opposite and convert the encodings into a probability distribution used to generate the most likely words in the output sentence. Crucially, both the encoder and decoder networks have an attention mechanism.
Because GPUs are capable of parallel processing, multiple attention mechanisms are used in parallel, calculating the relevant information for all the input words. This ability to pay attention to multiple words, dubbed “multi-head” attention, at a time helps the neural network learn the context of a word within a sentence, and it’s one of the primary advantages that transformer networks have over RNNs and LSTMs.
The Attention Mechanism
The attention mechanism is the most important part of a transformer network. The attention mechanism is what enables transformer models to go beyond the attention limit of a typical RNN or LSTM model. Traditional Sequence-to-Sequence models discard all of the intermediate states and use only the final state/context vector when initializing the decoder network to generate predictions about an input sequence.
Discarding everything but the final context vector works okay when the input sequences are fairly small. Yet as the length of an input sequence increases, the model’s performance will degrade while using this method. This is because it becomes quite difficult to summarize a long input sequence as a single vector. The solution is to increase the “attention” of the model and utilize the intermediate encoder states to construct context vectors for the decoder.
The attention mechanism defines how important other input tokens are to the model when encodings are created for any given token. For example, “it” is a general pronoun, often used to refer to animals when their sex isn’t known. An attention mechanism would let a transformer model determine that in the current context “it” refers to a squirrel, because it can examine all the relevant words in the input sentence.
An attention mechanism can be used in three different ways: encoder-to-decoder, encoder-only, decoder-only.
Encoder-decoder attention lets the decoder consider input sequences when generating an output, while the encoder-only and decoder-only attention mechanisms lets the networks consider all parts of the previous and current sequences respectively.
The construction of an attention mechanism can be divided into five steps:
- Computing a score for all encoder states.
- Calculating the attention weights
- Computing context vectors
- Updating context vector with previous time step output
- Generating Output With Decoder
The first step is to have the decoder compute a score for all the encoder states. This is done by training the decoder network, which is a basic feed-forward neural network. When the decoder is trained on the first word in the input sequence, no internal/hidden state has been created yet, so the encoder’s last state is typically used as the decoder’s previous state.
In order to calculate the attention weights, a softmax function is used to generate a probabilistic distribution for the attention weights.
Once the attention weights have been calculated, the context vector needs to be computed. This is done by multiplying the attention weights and the hidden state together for every time step.
After the context vector is computed, it’s used alongside the word generated in the previous time step to generate the next word in the output sequence. Because the decoder has no previous output to refer to in the first time step, a special “start” token is often used instead.
Differences Between Transformers and RNNs/LSTMs
Let’s quickly cover some of the differences between RNNs and LSTMs.
RNNs process inputs sequentially, while a hidden state vector is maintained and altered by the input words as they move through the network. The hidden states of an RNN typically contain very little relevant information regarding the earlier inputs. New inputs often overwrite the current state, which causes information loss and degrades performance over time.
In contrast, transformer models process the entire input sequence at once. The attention mechanism allows every output word to be informed by every input and hidden state, making the network more reliable for long pieces of text.
LSTMs are modified version of RNNs, adjusted to handle longer input sequences. The LSTM architecture uses a structure called “gates”, with “input gates”, “output gates”, and “forget gates”. The gated design deals with the information loss common to RNN models. Data is still processed sequentially, and the architecture’s recurrent design makes LSTM models difficult to train using parallel computing, making the training time longer overall.
LSTM engineers would frequently add attention mechanisms to the network, which was known to improve the performance of the model. However, it was eventually discovered that the attention mechanism alone improved accuracy. This discovery lead to the creation of transformer networks that used attention mechanisms and parallel computing thanks to GPUs.