July 28, 2023 5:00 PM

In the previous section, we looked at the digitization of words for use in computers. We focused on Word2vec, which creates vectors that successfully represent the meaning of words and even allows addition and subtraction.

In this section, we will continue looking at the quantification of words.

**- One-hot Vector to Represent Words**

Word2vec is a technology that has brought about breakthroughs, but of course, it is not the case that computers could not handle natural language at all before the advent of Word2vec.

Even before Word2vec, there was a common practice of replacing "tokens" (letters, words, etc.) with numbers (indexes) and then vectorizing them. For example, "dog" is converted to 1, "cat" to 2, "kitten" to 3, "kotatsu (「こたつ」 in Japanese)" to 4, "mikan (mikan orange)" to 5, "apple" to 6, and so on. As shown in the figure below, "kitten" with index three can be represented by a vector with the third number 1 and all others 0. Such a vector is called a "one-hot vector. "

While this method is intuitive and straightforward, it has some well-known challenges. First, it cannot express relationships of words. In this example, the distance between "dog" and "cat," "cat" and "kitten," and "cat" and "kotatsu" are all √2. We would like "cat" and "kitten" to be more closely related than the other combinations, but a one-hot vector cannot express such a relationship. The distance between the two words would be the same in any combination.

Another problem is that as the number of words (vocabulary) to be handled increases, the number of vector dimensions required increases proportionally. If you are dealing with only two words, "dog" and "cat," a two-dimensional vector will suffice, but if you are dealing with 10,000 words, you will need a vector with 10,000 dimensions. If we try to represent all the words in the dictionary in this way, each word will be represented by a long vector (and most of the vectors will have zero values). It will be inefficient.

Today, such one-hot vectors are only sometimes processed as is, but they are often used to convert words once into numerical values. In short, the textual information is once converted to a one-hot vector and then to a representation such as Word2vec. Such a process is called "embedding." In NLP using deep learning, "embedding" is sometimes learned during the training of a neural network, and sometimes word vectors such as Word2vec are used as "pre-trained" ones.

For solving the problem of the number of vocabularies, "subword" is widely used. This is a method in which the number of words in the vocabulary is determined first, and then the frequency of occurrence in the data set is used as a reference to decide how to separate the tokens. This method is efficient because processing can be performed within a predetermined vocabulary, and it is also flexible enough to handle unknown words that are not included in the vocabulary. We want to explain more about subwords when we talk about "tokenization," which is the process of separating sentences into tokens that we want to process. When we use the Large Language Models (LLMs), there is a fixed "number of tokens" that it can handle, and there sometimes is a fee based on the number of tokens. Knowing more about "tokens" should allow us to imagine the background of such a situation in more detail.

**- Word Vectors in "Poincaré Space"**

In the above example of one-hot vector, I wrote that "dog" and "cat," "cat" and "kitten," and "cat" and "kotatsu" all have a distance of √2 in this expression. The distance here is mathematically called "Euclidean distance."

Geometry, systematized by the ancient mathematician Euclid in his book "Elements" (considered one of the world's best-sellers), has long been treated as the foundation of mathematics. Various theorems have been proposed and proved based on intuitively satisfactory "axioms," for example, "the sum of the interior angles of a triangle is 180 degrees."

However, it is now known that there are things that cannot be firmly explained by such "intuitively satisfactory" axioms. You may have heard the theory of relativity or black holes explained as "space is distorted by the mass of an object." It is known that such "distorted space" cannot be explained well by Euclidean geometry. There is such a thing as "non-Euclidean geometry" that can deal with such non-intuitive spaces, where different "distances" are defined.

Let me preface this by asking, "In dealing with natural language, might it be better to define a different distance instead of the distance of Euclidean geometry?" There is a study called "Poincaré embedding" as a typical example, which showed a significant difference in terms of efficiency compared to word vectors using Euclidean distances [1]. A representation as good as or better than the 200 dimensions used by the conventional word vectors was obtained in five dimensions with Poincaré embeddings.

How was this possible? This is due to the hierarchy and distribution of the vocabulary of natural languages.

For example, "Golden Retriever" and "Shiba Inu (柴犬)" are subdivided concepts of "dog". On the other hand, "mammal" implies something broader than "dog," and "animal" is an even larger concept. Thus, words do not all exist on the same layer but have a hierarchical structure.

It has been pointed out that it is easier to express such hierarchy in Poincaré space than in Euclidean space.

In this section, continuing from the previous one, we have looked at quantifying words. What we wanted to convey is that one of the difficulties of natural language processing is "how to handle words with a computer in the first place. At the same time, this is the depth and interest of natural language processing research.

In the next article, I would like to talk about the Transformer, which is also the basis of today's LLM network structure.

References

[1] Maximilian Nickel and Douwe Kiela. "Poincaré embeddings for learning hierarchical representations." In Proceedings of NIPS, 2017.