10 February, 2024
Easier to Understand: Natural Language Processing
Dive into the world of Natural Language Processing (NLP) with this beginner-friendly guide.
Nowadays, we're no longer surprised by machines that can converse naturally and fluently like a normal person. The remarkable progress of machine learning in recent years has created a whole new world to explore. However, for those like me who haven't formally studied AI, finding resources to learn from can be quite challenging. Some materials are too detailed and difficult to understand, while others are too superficial.
So in this blog series, I will recount my journey of stumbling through learning about the Natural Language Processing (NLP) branch - the foundation of modern ChatGPT. I'm not an expert, and my approach to learning isn't by the book, so it will be quite free-form. The content of the article will try to balance between not being too simplistic and not delving into overly complex details.
This "Easier to Understand" series will consist of 2 parts:
- Part 1: Natural Language Processing
- Part 2: What is a Transformer?
Why process language?
This seemingly "stupid" question doesn't come out of nowhere. Have you ever wondered why in English, we add -ed or -ing to verbs to add meaning about time? And in French/Italian/German... oh boy, each pronoun conjugates the verb differently, and you need to pay attention to feminine or masculine forms before conjugating.
Natural language is very complex, but it's how humans communicate with each other every day. It would be really useful if we could at least find a way to "study" and "dissect" natural language.
Here's an example: you need to translate a document from English to Korean, you need to know both languages to do this task.
But because you know that all languages must have some commonalities, you program a software to "analyze" the text and separate the common meaning parts, turning them into "numbers". This software can even reverse the process, from "numbers" back to text. Applied to the task above, you can automatically convert English ==> numbers ==> Korean, isn't that useful?
In reality, the process of converting from natural language to numbers is called "encoding", and the reverse process is called "decoding". What we casually called "software" above is actually a "language model", which can be understood as a "virtual brain model".
Word embedding
Before thinking about digitizing an entire sentence or paragraph, we first need to observe the smallest unit of language, which is vocabulary and letters.
The idea of word embedding (word vector) is to find a way to turn words (or parts of a word) into vector form, with the purpose of capturing part of the meaning of individual words. One of the most typical examples of word vectors is the ability to "add" words with individual meanings to find a word with a similar meaning.
For example, the word "grandfather" actually refers to a "male" who is "old", so if we add the vectors "male" + "old", we will find a vector close to the word "grandfather".
Of course, this is a simplified explanation, because in reality there are many cases where the same word has different meanings in different contexts. To capture these multiple layers of meaning, the vector space is not 2-dimensional or 3-dimensional, but usually needs to have at least 1000 dimensions.
To create a model that can convert normal words into vectors, one needs a lot of data, which is then input for training. During this process, the machine will analyze the connection between words in a sentence, a paragraph, a text, to "adjust" the list of vectors.
Token
Converting a complete word into a vector is simple. But here's the problem, natural language has a... living nature, meaning it changes over time.
Every time a new word is added, people have to retrain the model, which is very inefficient.
So instead of converting an entire word into a vector, why don't we "divide" the word into different parts and then convert those small parts into vectors? For example, in French, "bonjour" is composed of "bon" (meaning good) and "jour" (meaning day). Surprisingly, this idea actually applies to all languages: Vietnamese writing is composed of initial consonant - vowel - final consonant - tone, Chinese characters are composed of "radicals", for example 好 (good) is composed of 女 (woman) and 子 (child),...
In natural language processing, these small components are called "tokens". The process of separating an entire sentence containing different characters (words) into tokens is called tokenization.
Here's an example of how ChatGPT tokenizes a text:
However, breaking words into smaller parts also causes a problem: the model will have to process more tokens, requiring more time to process a text. For example, if you input Vietnamese into ChatGPT, some words will be separated into many small tokens (e.g., the word "Cộng" is separated into C-ộ-ng), so generally ChatGPT is quite slow when using Vietnamese:
Convolution neural network
Now we can convert a paragraph into many tokens, and each token into a vector. The next thing we need to do is find some operation to analyze and process this data set.
One of the simplest ways is to create a Convolution neural network (CNN) and input all the tokens into it for processing. CNN is actually one of the first techniques invented for image recognition. To understand how convolution neural networks work, you can read this blog post I wrote when I was still a "young buffalo" (in 2017).
Translator's note: "Trẻ trâu" (young buffalo) is a Vietnamese slang term often used to describe inexperienced or immature young people. The author is humorously referring to his younger, less experienced self.
The idea here is, instead of inputting each pixel of the image into the CNN, we will replace the pixels with tokens, and the output will be other tokens (e.g., input Vietnamese and the output will be Korean)
Although convolution neural networks can theoretically solve the problem at hand, they actually have some major limitations, such as:
- Can only use a limited number of input tokens. Every time you increase the number of input tokens, the computational complexity increases exponentially.
- You will need a lot of neurons for it to work as desired. Very inefficient for training.
In fact, convolution neural networks work quite well with images, because in the worst case, you can cut the image to a certain size. For example, creating a model to recognize handwritten digits (MNIST dataset) is one of the very typical and easy-to-experiment exercises for newcomers to machine learning.
Recurrent neural network
The next idea is this: instead of inputting 1000 tokens at once, why don't we input one token at a time?
With this idea, our model will output a vector, and this vector will be used to represent the "total" meaning of the tokens you have entered up to that point. In other words, we want to convert an entire paragraph into a vector.
Each time you input a new token, you need to input the old vector + new token to create a new vector. Because of the reuse of the old vector to create a new vector, the name of this technique has the word "Recurrent".
For example, the sentence "How are you?", when I input each token "How" "are" "you" "?" into it, what happens is:
- input vector 0 = (no information yet)
- input token: "How" + input vector 0 => vector 1 = (hmm, seems to be asking something)
- input token: "are" + input vector 1 => vector 2 = (hmm, definitely a question)
- input token: "you" + input vector 2 => vector 3 = (ah, a greeting question)
- input token: "?" + input vector 3 => vector 4 = (definitely a greeting question)
Thus, you have encoded the question into a single vector (we only take vector 4, which is the final vector). The next task is to have a model to decode this vector, for example, when the model sees a vector of the "greeting question" type, it will respond with "I'm fine, thank you". In fact, we can completely reuse the structure of the encoder for the decoder part, just need to reverse it.
In practice, the idea of Recurrent neural network (RNN) is applied in many different forms, such as LSTM (Long short-term memory) or GRU (Gated recurrent units). Simply put, these types of neural networks use different techniques to convert a series of tokens (also called a sequence) into a vector.
The difficulty of RNN lies in the fact that it often encounters problems with "vanishing gradient", meaning that after a while, the neurons will be "saturated" to values that are too large or too small, no matter how many tokens you add, these neurons can't transmit information anymore. This problem will lead to the model "forgetting" what it's saying and will repeat the sentence indefinitely like this like this like this like this like this like this like this like this like this like this ^C
Another problem is that training for RNN is usually very inefficient, mostly because you have to input one token at a time into the RNN, instead of inputting 1000 tokens at once.
So how have we solved this problem? How does ChatGPT work? Let's look at the next issue to learn more about the Transformer structure.
Link to the next article: https://blog.ngxson.com/de-hieu-hon-transformer-la-gi-gpt-hoat-dong-the-nao/
References
- https://distill.pub/2019/memorization-in-rnns/
- https://en.wikipedia.org/wiki/Natural_language_processing
- https://ai.stackexchange.com/questions/20075/why-does-the-transformer-do-better-than-rnn-and-lstm-in-long-range-context-depen
- https://www.tensorflow.org/tutorials/structured_data/time_series
- https://www.tensorflow.org/text/tutorials/word2vec