Fine-tuning an LLM to write docs like it’s 1995 – Fabrizio Ferri Benedetti

In my predictions for 2030 I wrote that technical writers would use specialized LLMs running locally on powerful hardware. I see signs of this move to “local first” among engineering pundits, but we’re not there yet, partly because of how much more powerful the connected frontier models are. However, this does not mean that we cannot experiment. That’s exactly what I did last week, trying to improve an instruction model for writing like a software technical writer from the 80s and 90s.

Calling upon old technical writing skills for research

To train a personalized, local model to write like a technical writer from the 90s, a lot of written sources are needed. For example, if I wanted to get a model to write like me, this blog would not suffice, as it barely has 100k words at the time of this post. For complete training you will need more samples, and they are not easy to obtain, nor are they easy to produce. The only quick way is to use existing funds. Where can I get one?

Meet Bitsavers: This is a website that collects and scans old computer manuals and brochures. It’s an incredibly valuable repository of computer history and ancient technical writings, mirrors of which are available everywhere. Since I am fond of Microsoft manuals from the 90s, I chose the Microsoft archive as a source of training material. The collection contains out-of-print documents published between 1977 and 2005: over 37 million words, covering older systems and SDKs.

ms collection

I downloaded the OCR’d text files and cleaned the content from artifacts and clutter (like index and frontmatter) using a good old Python script. I then used a cheap and fast model via OpenRouter, gemma-4-26b, to classify each paragraph as “keep” or “drop” based on its accessibility. This second pass cost about $8. However, even with this two-pass cleaning, the training data retained noise that I discovered later, but it was fine enough for my tests.

Per Claude’s advice, I cleanly split the text into training examples at paragraph and section boundaries, breaking up headings and keeping code blocks complete, limiting each section to about 512 tokens. Each fragment was assembled with synthetic instructions drawn from the template. I had 192,456 examples (one JSON object per line) in JSONL format. I also could have used a smaller model for better instructions and questions, but I’m an impatient person.

💡 A note on content: This is an independent, non-commercial research project and is not affiliated with, sponsored, or endorsed by Microsoft. I used these out-of-print manuals only for personal style-shifting experimentation. The corpus, training data, and resulting adapters are not being distributed, and the fine-tuned models are completely local to my machine.

Fine-tuning as an alternative to training from scratch

In an ideal world, I would have several million dollars lying around, ready to be wasted on building my own LLM, Fabrice. Since I am far from rich (I would not write this otherwise), the alternative Fabrice There is fine-tuning, which involves changing the “weights” of a model so that each token generated is conditioned by the training material. I like to imagine fine-tuning as slightly steering the trajectory of a giant iceberg using tugs; Just a little, to get the desired effect.

Why and why not fine-tuning recovery-augmented generation (RAG)? Because in this experiment I was not so much interested in retrieving facts, a scenario where RAG excels, as much as I was interested in making an LLM behave and write in a specific style, regardless of knowledge of the context. Compared to full training, fine-tuning does not require huge amounts of data, so it is cheaper. Also, just because: I’ve always wanted to try fine-tuning as a technique and see how much it could do.

To avoid spending days or weeks fixing a model on my computer that has an old graphics card, I relied on RunPod, an online service for AI developers that offers on-demand pods with pre-configured GPUs and tools at a (relatively) low price. For example, for less than $6 per hour, you can lease a card, the Nvidia B200 (192GB memory). The service has a convenient API with configurable auto-recharge and cost control mechanisms.

runpod

Enter a world full of mysterious words

Having decided to improve a model, I consulted Claude on the most sensible ways to achieve this. We settled on QLoRA (Quantized Low-Rank Adaptation), which achieves fine-tuning not by changing each weight of the LLM, but by “freezing” them and putting an adapter on top, which is a small file that reshapes the model behavior (a bit like a mask, if you like). The Q in QLoRA means that the result has been quantized, i.e., compressed, thereby reducing memory requirements.

Are you still with me? Good. If you think it’s dense, that’s because it is.

Doing anything with an LLM at home these days is all about compromise: You either sacrifice time, spend money, or curtail your ambitious goals. I tried to strike a balance to get something worthwhile in less than a weekend. I chose to try fine-tuning on two models, the Llama 3.1 8B Instruct and the Quen 2.5 7B Instruct. At their size (about 8B) they run comfortably on a MacBook Air. I also tested the Llama base model (which is No Trained to answer questions).

I tested fine-tuning under several different conditions: varying the amount of training material (a subset vs. the full corpus), the number of epochs (training rounds), and structural parameters such as rank. I have only superficial knowledge of all this, but I trusted my agent to make the right choices, whom I happily questioned every step of the way. For example, 3 epochs may result in “overfitting” in some cases; In the world of LLM, this means extreme training. time to have fun.

Adapters can only be applied to the target model for which you have fine-tuned. After training each adapter, I exported them to my laptop and converted and quantized them into a GGUF LoRa file, and then registered it as a local OLAMA model that I could run in my laptop for benchmarking purposes. The local-convolution approach is fast and requires no GPU, although inference is somewhat slower than the fully merged model. For the current test, I didn’t care that much about speed.

It probably took a full day, including breaks, to train the adapter for all conditions, at a total cost of $50. During the trip, I lost two adapters: RunPod is insensitive to budget and immediately deletes pods if funding is zero (a lesson learned, yes). The cloud took care of setting up each run and following RunPod’s API. Cloud Code’s /goal command was quite helpful in looping through each step (in retrospect, I would have run it in YOLO mode).

This table shows all the models I compared and their conditions:

Did the style transfer after fine-tuning?

I subjected each model to the same signals:

  • Document malloc(), a staple C function, which the training material may know about.
  • Document a hypothetical ConnectWifi() Win32 API function. No presence in training material.
  • Explain what a REST API is in 1990s Microsoft style (anachronistic test).

You can see all the questions and answers in this summary.

For malloc() testing, the unmodified model generated modern Markdown docs in the style of README, while the fine-tuned model used a period correct structure consisting of a synopsis block, a return value section, and so on. For the hypothetical ConnectWifi() function, only 3 era models maintained the fiction and documented it as if it were real, while the others broke the 4th wall to follow internal knowledge and resist training.

The REST API practice was also quite interesting: Llama Instruct 40k failed, producing dull marketing prose. Claude attributed this to the heavy reinforcement training (RLHF) that the llama goes through to make it friendly and approachable. Kwen fine-tuns maintaining the register better, creating term-structured documents, using HTTP method names as verbs and formal headers. Quen 192k was the strongest, opening like a chapter from the Windows 2000 Resource Kit.

to amaze, to astonish, to astonish

Let me repeat this: A 7B model, trained on 1990s documentation and tested on 2000s concepts, produced a solid chapter that could be mistaken for actual period content. Style transferred. Very good. On the other hand, the base model, which is trained not to answer questions but to autocomplete text, failed miserably, spewing hundreds of lines of garbage almost randomly in the raw corpus. There is no notion of “answer this question” or “complete this” in the base model.

I finished the experiment by comparing the effect of rank between Quen models, varying between ranks 8 and 16, with 1 epoch. If I understood it correctly, rank 8 means that each adapter matrix can only describe 8 independent patterns. It’s like having 8 dials to tune. With so few dials, the adapter can’t be too clever: it must commit fully to the strongest, most repeating patterns in the training data. Rank 16 is, theoretically, more expressive and subtle.

Rank comparisons show that smaller adapters, with fewer degrees of freedom, commit to the imagination more readily than larger adapters; A rank 16 adapter can more easily “escape” the corpus. It also turned out that mixing only 1 epoch with the middle range of 16 caused hallucinations more frequently: the adapter is sufficiently expressive to reach the associated concept, but not strong enough to anchor on what the prompt is trying to say. Rank and era seem to interact – it’s like using a sound mixer. Interestingly, the cheaper the adapter, the more honest the impersonation.

Sophisticated models are able to reassure impersonators, but they are not replacements

Streamlined models were great impersonators of Microsoft technical writers of the late 90s. The corpus influenced the models’ style and voice as well as some knowledge, while mostly retaining the models’ ability to describe novel concepts. This is a relatively inexpensive process that can produce effective small models for the purpose of tasks such as reviewing style or drafting new documents while following in-house style guidelines.

However, getting there is not an easy journey. Fine-tuning a model, although inexpensive, requires a good amount of high-quality training data, which is not easy to prepare. Even once you’ve got it in your hands, you still need to choose a built-in model that makes sense and is able to accept additional training. And then, the many parameters you have make the task of moving a better model to a good location a time-consuming proposition.

What is reassuring is that such a model can never replace human technical writers, only enhance them. Fine-tuned models lack the same judgment as their non-tuned siblings, and require copious amounts of steering. Fabrice Will have to wait.



<a href

Leave a Comment