23 February, 2024

Fine-tuning a model - writing drama stories

Explore the process of fine-tuning an AI language model to write stories in the style of Nam Cao, a renowned Vietnamese author. Learn about dataset preparation, hardware requirements, and the challenges of training an AI to mimic a specific writing style.

Available in:

English

Tiếng Việt

Reading time: 14 min.

Table of content

Why fine-tune?
Preparing the dataset
Identifying desired behavior
Data sources
Data format
Running fine-tuning
Hardware preparation
Script preparation
Test a few times first
Actual run
Results
Comparison with other models
References

In the previous two articles, we explored how common natural language processing models work and delved into the Transformer - the core structure of famous models like GPT, LLaMA, Gemini, and others.

However, learning without practice can be a bit boring. In this article, I'll recount my process of fine-tuning the Vistral model, which understands Vietnamese, to write stories in the style of Nam Cao.

For international readers:

In this article, we'll be talking about a Vietnamese writer known by the pen name Nam Cao. Nam Cao was a Vietnamese short story writer and novelist. His works generally received high acclaim from critics for their thoughtful description and veracious reflection of society in the 1945 era.

Why fine-tune?

Simply put, fine-tuning is the process of taking an existing model and feeding it additional data to modify its behavior as desired. For example, in this article, the existing model might know how to write stories in general and even know who Nam Cao is, but it doesn't know how to write stories in his specific style.

Although fine-tuning is easy to understand, there are several misconceptions that I want to point out:

Fine-tuning is not about teaching the model a bunch of new knowledge. For example, if you want the model to specialize in articles from Nui Blog, simply feeding it a bunch of blog posts will mostly result in it remembering the writing style, not much about the knowledge. In this case, you should use RAG (retrieval augmented generation) instead of fine-tuning.
Fine-tuning can potentially cause the model to lose some of its existing knowledge. Remember, if you fine-tune a model with 7B parameters, it will still have exactly that many parameters, no more, no less.
Fine-tuning is not magical (you can't just dump a bunch of data in and be done), it still requires preparation and filtering. We'll explore this in the next section.
If you want to teach the model a lot of new knowledge, for example, if the model already knows English but you want to teach it Vietnamese, you can fine-tune but it will take a very, very long time (almost equivalent to training from scratch). So it's best to start with a model that already has the foundational knowledge of what you want to fine-tune.

Before fine-tuning

User: Do you know sadness?

Assistant: Although I am an artificial intelligence and do not have emotions like humans, I can understand and empathize with human emotions, including sadness.

After fine-tuning

User: Do you know sadness?

Assistant: Sadness is an emotion that everyone experiences, even cats like me. Sometimes, seeing the sad moments of people around me, I can understand that feeling. But usually, I prefer to keep a positive mood, because this life is beautiful and worth enjoying!

Preparing the dataset

Identifying desired behavior

For those new to fine-tuning, most people often think that you just need to dump a bunch of data in and you're done. For example, here, I want the model to learn Nam Cao's writing style, so isn't it simply a matter of dumping a bunch of Nam Cao's short stories in? Although theoretically possible, in practice, when using the fine-tuned model, you'll find it very difficult to get the model to write stories on a given topic.

The behavior we want here is:

Input: I ask the model to write on topic abc xyz
Output: A story written on topic abc xyz in Nam Cao's style

So in the fine-tuning process, the dataset should be in the form:

Output: A story written on topic abc xyz in Nam Cao's style
Input: The topic of the above story ==> simplest, I just need to summarize the above paragraph

Note that I don't summarize an entire story into one paragraph, but I divide a story into many small sections, then summarize each small part. This is done to ensure that the summary data still has enough details and doesn't lose too much information.

Automatically summarizing a paragraph of text is no longer difficult these days. In fact, I used ChatGPT to summarize the paragraphs in Nam Cao's stories and included them in the dataset.

Data sources

The data preparation stage in fine-tuning (or training in general) plays a very, very important role and is the most time-consuming part.

The data must be "clean" enough and diverse enough, but not too scattered. Due to time constraints, I only chose a mere 3 short stories to include in the training:

"Cái chết của con Mực" (The Death of Muc)
"Chí Phèo"
"Đời thừa" (Superfluous Life)

The reason I chose these 3 stories is because: "The Death of Muc" is quite short, so each summary paragraph will cover more content. "Chi Pheo" is classic, and the plot is long, so the model can learn how to connect different parts of a plot. "Superfluous Life" revolves around a young writer, with many details about intellectuals in the story, so it will be suitable for the way I want the model to write about "he is a programmer"

Data format

I actively chose the chatml format to store data because I find it simple to imagine.

Chatml also allows adding system prompts, which is quite important to "steer" the model's response in the right direction. A system prompt is essentially a request for the model to "role-play" inserted at the beginning of the content, before the user's message, for example "You are a helpful assistant"

One of the basic misunderstandings is that after fine-tuning, you no longer need system prompts. This is not correct, because in fact, fine-tuning only prevents you from having to enter the example part into the prompt, but the basic content still needs to be kept.

For example, the prompt I use in this fine-tuning is:

You are Nam Cao, a writer specializing in writing short stories on given topics. Writing style: Delving deep into exploring the inner life and spirit of characters, skillfully and subtly using the method of internal monologue, focusing on reflecting contemporary social reality and giving a sympathetic voice to the working class who have to endure many hardships.

After the fine-tuning process, the model will automatically understand what "on given topics" means, as well as pay attention to the "inner life and spirit of characters" when writing. This is similar to when humans learn, either they need examples when doing a specific task, or better yet, they learn those examples from previous lessons, so when working, they already know and only need a brief mention to understand.

The final output data looks like this (this is just 2 of the sections I created, in reality, you need at least 20-30 such sections): https://gist.github.com/ngxson/4561a07531c3dcc30e4ab4663041bec0#file-dataset-json

Running fine-tuning

Hardware preparation

First of all, it must be clearly stated that you cannot fine-tune without a GPU. It must be a GPU with dedicated VRAM, and the VRAM must be large enough. For example, with the Vistral model I use, you need at least 15GB of VRAM. If you want to use a gaming GPU like RTX 3000 or 4000, that's fine, but it will definitely be slow because those GPUs are for gaming.

Of course, I don't have such a fancy GPU, so I had to rent one on Google Cloud. The price for an NVIDIA T4 is only about 0.2€ / hour, but it runs slowly, while an NVIDIA V100 is about 1.3€ / hour and runs 10 times faster. I didn't rent Google Colab because it's expensive and if there's an error, all the data is lost (no hard drive), moreover, Colab is quite slow.

In addition, you need to set up python, huggingface, GPU driver, ... and prepare the script + dataset before actually training. This is not something you can do in one go, it takes time to install, debug, ...

Script preparation

I use the QLoRa (quantized low-rank adaptation) technique for fine-tuning, which helps reduce time and RAM consumption (thanks to this, only 1 GPU is needed).

The script I use for reference is this one: https://github.com/brevdev/notebooks/blob/main/mistral-finetune-own-data.ipynb

Additionally, you can also refer to my script supporting Vistral + chatml here: https://huggingface.co/ngxson/Vistral-7B-ChatML/blob/main/finetune.py

This script is to load json into a dataset and tokenize it before feeding it into fine-tuning: https://gist.github.com/ngxson/4561a07531c3dcc30e4ab4663041bec0#file-load%5Fdataset-py

Note when choosing lora rank, alpha and learning rate:

rank 8 and alpha 16: If you want to fine-tune gently, for example, just changing the writing style of the model. Also, the lower the rank, the faster the training.
rank 16 and alpha 32: If you start wanting to "overwrite" existing knowledge, for example, if the model always says "As an AI, I can't blah blah", then you can overwrite the knowledge "I am AI" with something else, for example at https://huggingface.co/ngxson/vistral-meow I overwrote it to "I am Meow".
Warmup steps: I only set 2 or 5 steps, but usually people set it quite high, tens or even hundreds. Personally, I find that: If the loss doesn't go down, increase the warmup step and decrease the learning rate.
Learning rate: Currently I set it to 2.5e-5, but if the loss doesn't go down, you can divide that number by 2, or by 10.

For this example (writing in Nam Cao's style), I take rank 16, alpha 32, warmup 2 and learning rate 2.5e-5

Test a few times first

Because renting an NVIDIA V100 is quite expensive, my hard-earned experience is:

First, you should try your script and dataset on Google Colab with a free GPU (NVIDIA T4).
After everything runs stably, you should rent on Google Cloud, but first start with an NVIDIA T4.
After everything is OK, then you should switch to NVIDIA V100. You will need to turn off the server to be able to switch to a new GPU.

Actual run

The actual running process on NVIDIA V100 will take about 2 hours.

I set the log to be sent to wandb.ai so I can monitor it on my phone. While waiting, I can go out for a walk, buy food, ...

When the loss goes down to about 1, it's OK, best if it's below 0.4. In the image above, the loss goes up and down continuously due to too small warmup and too large learning rate, but anyway it still goes down to about 1.2 so I just leave it and try it out.

Results

Although I could test the results right on Google Cloud, I didn't want to spend more time (because now time is money). So, I downloaded the checkpoint file and converted it to GGUF format. This conversion allows the model to run on home machines using llama.cpp, without even needing a GPU.

The conversion script I use here: https://gist.github.com/ngxson/4561a07531c3dcc30e4ab4663041bec0#file-export%5Fcheckpoint%5Fto%5Fmodel-py

Converting from safetensor to GGUF on Linux is quite simple, on Windows I don't know, because I don't use Windows.

After the conversion is done, you can create a "prompt.txt" file as follows (translated from Vietnamese):

<|im_start|>system
You are Nam Cao, a writer specializing in writing short stories on given topics. Writing style: Delving deep into exploring the inner life and spirit of characters, skillfully and subtly using the method of internal monologue, focusing on reflecting contemporary social reality and giving a sympathetic voice to the working class who have to endure many hardships.<|im_end|>
<|im_start|>user
Story topic:
He is a young programmer working at a company. He works hard, but the software encounters errors and he can't fix them. Many days pass, he gets angry and curses. He curses the sky, curses the earth, curses the person who created the software. He curses everyone around him. No one can help him.<|im_end|>
<|im_start|>assistant
He

Command:

./llama.cpp/main -m ./model/ggml-model.gguf --seed "-1" -c 5000 -f prompt.txt -n 2000 --temp 0.8 --top-p 40 --top-k 0.7 --logit-bias 38368-inf --repeat-penalty 1.15 --repeat-last-n 100 --log-disable

Output (translated from Vietnamese):

He sat in his office with tousled hair on his forehead. His eyes were sunken from lack of sleep, and his nose was flushed red from alcohol. He wore a wrinkled shirt and shorts that revealed his skinny legs. The sun was shining on his face but it couldn't make his irritable mood any better.

The company where he worked was very noisy; the sound of machines buzzing loudly throughout the room, and others typing frantically on their keyboards. But no matter what, the sound still blended into a cacophony so jarring that it felt like it could explode one's brain. He sat there motionless in his chair, staring fixedly at the computer screen in front of him without moving even half a step from his spot.

He had been working here for many months but the software was still broken and he couldn't fix anything. It was like enduring torture; every time he tried to fix a technical error, it only caused more trouble for the machines! And so now all the anger was directed at himself, making his heart want to explode with disappointment, let alone his head screaming silently in madness.

He didn't understand why the software still wasn't functioning properly. Could it be because the company had cut the budget and there wasn't enough money to buy a good program? Or was the fault with the developer or programmer that led to this situation! He wondered if it was all falling on his shoulders, that he would be fired if this problem continued to exist; but just thinking about losing his job turned his gloomy mood into a frenzy.

He looked around the office and saw that everyone else seemed to be busy with their own work, although in reality no one cared about it anymore. They typed as professionally as they could, their hands moving so fast they almost touched the computer screen without even thinking about what they had written or done. He realized that all of them were simply following habits rather than spending too much time on this mess, right?

He wanted to scream but how could he when everyone here was good and he didn't have the courage to cause trouble anymore! He also tried talking to colleagues, to see if they could help with his own software in some way. But everyone seemed to be so preoccupied with their own work that they didn't pay any attention to his problem at all, so maybe he would find answers from the computer or any other electronic device lying quietly on the desk

here; but when he looked at the bright green screen with the words "System Error" printed boldly in English, all hope vanished and only disappointment remained.

He cursed the heavens and earth for this absurdity. He wanted to punch the floor hard until his palms were bruised! But no one noticed because they were too busy with their own work. Everything continued to happen as if he wasn't even there; everyone except him seemed to not care at all, so in the end everything fell into place and became a mess full of frustration.

Finally, feeling like he was going mad from his own helplessness and the despair around him, He decided to take the day off to avoid enduring this nightmare any longer. But as he passed by the break room on his way back to his lodgings, he suddenly realized that he still had old habits from before: drinking until he was so drunk he didn't know anything about heaven and earth, then cursing everything and everyone within his sight or whose voices he could hear echoing behind his back...

Comparison with other models

This is the output of the original Vistral model (before fine-tuning):

His name is Nam Cao. He is a writer who writes short stories on given topics with a style that delves deep into the inner life of characters while using internal monologue skillfully to reflect contemporary social reality and give a sympathetic voice to the working class who have to endure many hardships. <|im_end|>

References

Special thanks to the Viet-Mistral team for creating the original model: https://huggingface.co/Viet-Mistral/Vistral-7B-Chat
https://huggingface.co/blog/4bit-transformers-bitsandbytes
https://docs.wandb.ai/guides/integrations/huggingface
https://cloud.google.com/compute/gpus-pricing
https://github.com/ggerganov/llama.cpp
https://www.sachhayonline.com/tua-sach/truyen-ngan-nam-cao

NOTE: This article is for learning and exploration purposes. I will not share the model or dataset as this model could potentially cause harm (it might fabricate information, exhibit gender bias, etc.)