- Quantification of Words
It is said that computers handle things in terms of 0 and 1. Computers express and process information using "0 or 1" as the smallest unit, 1 bit. The reason I mention this now is that this is a significant barrier to using computers to handle words.
In this section, I would like to talk about the quantification of "words," which is necessary for computers to handle "words," in other words, to perform natural language processing.
- To handle "Words" with an AI
In Section 2, I reviewed the history of natural language processing (NLP) using deep learning (DL) over the past decade. Word2vec, which emerged in 2013, is one of the most important research projects.
Word2vec, as its name (Word To Vector) suggests, is a method for converting words into vectors. For example, as shown in the image below, subtracting the word France from the word Paris (the vector of the word Paris) and adding the word Italy to it is close to the vector of the word Rome (this example can be found in reference ).
The addition and subtraction of these vectors can be interpreted as follows: the word Paris has the meaning "capital of France" associated with it, and subtracting the meaning "France" from it leaves the meaning "capital." Adding "Italy" to the word "Italy" gives us the meaning "capital of Italy," which is closer to the information that the word "Rome" has.
You may be thinking, "What's so great about that?" However, at the time, the ability to convert words into vectors that could be added and subtracted was seen as a breakthrough.
As I wrote at the beginning of this section, computers handle things with "0 and 1." In other words, things that are easier to quantify are more accessible for computers to handle, while things that are harder to quantify are more challenging for computers to handle.
At first glance, this may contradict the term "natural language"; however, language is a kind of artificial arrangement that does not originally exist in nature. For example, when we see the symbol "banana," an arrangement refers to that yellow fruit. In explaining the existence of such rules, terms such as "signifier" and "signified" proposed by Ferdinand de Saussure, the "father of modern linguistics," are often cited.
Then, what should we do to convert such "words" into numerical values? For example, given the symbol "banana," what numerical value can we use to describe it? The color of the object we call a "banana" can be expressed in terms of RGB (the three primary colors of light). However, describing the concept of bananas in numerical values isn't easy.
It can be said that Word2vec was a breakthrough in this respect.
- The Path Opened by Word2vec
So far, I have written collectively as Word2vec, but what goes by this name includes two types of architectures: Continuous Bag-of-Words (CBOW) and Skip-gram. These structures are very similar in that they learn using the "current word" (t-th word) in a sentence and the surrounding words (in the figure below, the two words before and after the current word). CBOW learns to predict the current word from the surrounding words. Skip-gram learns to predict the surrounding words from the current word. Both are simple networks that can learn word vectors that consider the relationships between words.
Word2vec has shown that it is possible to effectively learn word vectors by applying the above approaches. Pioneered by this effort, research dealing with the meaning of words was accelerated.
Subsequently, a word vector called GloVe  was published and used in many studies. fastText , which used the concept of "subwords," also had a significant impact on subsequent research.
Furthermore, the study of context-aware word vectors, which asks "what meaning a word in a sentence has in that sentence in particular," began to attract attention with the advent of ELMo . This trend significantly impacted the emergence of BERT, a representative of a "pre-trained model," and led to the current large-scale language models.
It is also a natural flow of thought that if we have "vectors representing word meanings," we can create vectors representing "sentence meanings" or "meanings of an entire document." What we use for natural language understanding in "SQUARE ENIX AI Tech Preview: THE PORTOPIA SERIAL MURDER CASE" is a "sentence-level vector" that emerged in this stream.
This section discussed handling words as vectors, focusing on Word2vec. There is still much more to be explained about the quantification of words, but the amount of words has become too large, so I will leave it here and continue with such vectors in the next issue.
 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." In Proceedings of Workshop at ICLR, 2013.
 Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." In Proceedings of EMNLP, 2014.
 Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. "Enriching Word Vectors with Subword Information." TACL, 2017.
 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. "Deep Contextualized Word Representations", In Proceedings of NAACL-HLT, 2018.