"Transformer" is well-known as the basis of large language models (LLMs). Vaswani et al. proposed this architecture in the paper "Attention Is All You Need" [Reference 1]. Until then, in NLP, "recurrent neural networks (RNNs)" are widely used (Note 1). After the appearance of Transformer, this situation has changed drastically. Now, Transformer is used not only for text but also for other modalities, such as images or music. It is now treated as general architecture, not restricted in NLP.
From this article, we would like to explain the Transformer architecture.
-The background of Transformer
In order to show you what was revolutionary about the Transformer, we would like to give you an overview of the limitations of earlier models.
Since the Transformer first made its performance known in machine translation tasks, we will again use machine translation as an example.
Figure 1 An image of the machine translation task.
In this time, a Japanese text "私はゲームがとても好きです。" is being translated into English
Figure 1 shows an example in which a Japanese text, "私はゲームがとても好きです." is being translated into English. The output is generated token by token, and "I like" is already generated, and the next token is being generated. Using the input text information, it can be calculated that the token "games" has a high probability of being generated, and you can get the correct translation (Note 2).
Then, investigate the way the input text information is handled. In RNN, as shown in its name, "Recurrent," the process is conducted recurrently.
Figure 2 The recurrent processing of the input text
This process assumes the "state" of the RNN when the processing up to the immediately preceding point is finished, and performs the next processing. For example, to process "とても", it is necessary to have completed processing up to "が". After the processing of "とても" is completed, the processing of "好き" can be done. This is done for all tokens from "私" to "。" in turn to obtain the final RNN "state" (such a "state" is called a "hidden state"). (This is because, unlike input and output statements, they are handled inside the model and are hidden from the outside).
Such recursive processing has the advantage that the same processing can be applied regardless of the length of the text; whether the text is 10 words or 10,000 words, it can be handled in the same way, in terms of the mechanism.
However, because the next token is processed after the previous token has been processed, it is difficult to take the approach of parallelizing the computation and performing matrix operations at high speed on a GPU, which has become commonplace in deep learning. In addition, the positional relationship between tokens is reflected too strongly, making it difficult to handle the relationship with distant tokens and increasing the influence of nearby tokens.
Transformer emerged as an approach to this problem. Strictly speaking, "attention" (in NLP) first appeared as an auxiliary to recursive processing, and the attention, which had only played an auxiliary role, became the main player, leading to the Transformer, but we will explain that in the next article.
In this article, we have outlined the background of the Transformer. In the next section, we would like to talk about self-attention and the various innovations in putting it to practical use (multi-head attention, mask processing on the decoder side, etc.).
(Note 1) RNNs have derivatives such as "Long-Short Term Memory (LSTM)" [Reference 2] and "Gated Recurrent Unit (GRU)" [Reference 3]. but they are not described here in detail.
(Note 2) To convey the image, the diagram here looks as if the information of the input sentence is added to the arrow extending from "I like" to the next word, but it is common to illustrate the output of "Encoder", which summarizes the information of the input sentence, put it at the beginning of the "Decoder," which handles the final output. In the paper which proposed Sequence-to-sequence showed a formula such that the encoded information is referenced in every step of the output, as shown below [Ref. 4], and said that the hidden state of the decoder is initialized by the representation of the last hidden state of the encoder. The image below was created based on the formula in the paper.
Figure 3 The equation of the sequence-to-sequence model
References
[1]
Paper Title: "Attention is all you need"
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin..
Conferences or Journals: In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
[2]
Paper Title: "Long Short-Term Memory"
Authors: Sepp Hochreiter, Jürgen Schmidhuber.
Conferences or Journals: Neural Computation, 1997.
[3]
Paper Title: "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling"
Authors: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio,
Conferences or Journals: Deep Learning and Representation Learning Workshop (in NeurIPS 2014).
[4]
Paper Title: "Sequence to Sequence Learning with Neural Networks"
Authors: Ilya Sutskever, and Oriol Vinyals, and Quoc V Le.,
Conferences or Journals: Proceedings of the 27th International Conference on Neural Information Processing Systems , 2014.