How Much Data is Enough for Finetuning an LLM?

There's no shortage of analogies for explaining what an LLM is capable of - one of the best, though, is from this New Yorker article proclaiming it as a "blurry JPEG of the web". This metaphor is particularly useful for capturing many of the technical aspects of what an LLM really "is". In a very quick and handwavey way, when we talk about an LLM, we're typically talking about the two principal components of interest: first, an algorithm that is designed to manipulate a large blob of numbers, and second, the particular arrangement of that blob of numbers.

To make it really, really dumbed down, this large blob of numbers is basically some big structure of values (think lists of lists of numbers, something you could just as easily punch into a JSON file) that, while they started with random values, the values were tweaked ever so carefully by the algorithm such that, when encoding an input text, then running that text through the algorithm and banging it against these numbers iteratively, a single number is punched out the other side, which represents the ith value in some internal dictionary for what the next token would be (e.g. the input text is "I like to eat fruits like an ..." and the guessed word is index 897, or "apple"). The work product, in essence, is the blob of numbers.

These blobs of numbers can be used to do things like the work that ChatGPT does, where the goal is to generate that next ith value, but they can also be paused at a step earlier in the production of that number not to represent the most likely next word, but instead to do a semi-related task of capturing the "essence" of the input text. This is called an embedding, or for these models, more accurately, a sentence embedding. These sentence embeddings look like a long list of numbers, in some fixed length window output (e.g. a five-dimensional LLM output could look like [-0.86, -0.64, 0.42, 0.86, 0.46]). These models typically have hundreds of dimensions - even the smallest heavily used embedder is 384 dimensions. In essence, these numbers represent the "home address" of a text, such that some other text that conveys the same or a very similar meaning would produce a list of numbers that are very close, pairwise (e.g. in a simple example "I like apples" would produce [0.1, 0.3, 0.4], while "I love apples" would produce [0.1, 0.2, 0.4], and "That man is mean" would produce [0.8, 0.6, 0.8]).

The routine for training these models is to give it some text like "I like to eat fruits like an apple", mask some words randomly like "I like to [MASK] fruits like an [MASK]", then ask the model to "recall" what the "masked" words are - when the model gets it right, we ask it to strengthen associations that led it to that outcome - when it gets it wrong, we ask it to loosen those associations to allow for such an input in the future. When a model leaves the "factory floor," as it were, they are typically generalists unless advertised otherwise - the model is as good at predicting text about SaaS business, knitting, the Hundred Years' War, and Argon. This is great for generalist domain tasks, but in many real life applications, we want to make the model a "specialist" in a specific domain.

To "make" a model a "specialist", we engage in a "finetuning" procedure. Lots of companies provide this as an extension of their core models, but we can also just as easily do it with offline models. When we "fine-tune" a model, we are, in essence, teaching a machine to shift its attentional dimensions to the sub-territory of linguistic space on a "map" - we're doing the model equivalent of zooming into a map to consider the detail we actually care about.

A useless map of the entire territory A useful map of the territory we care about

It's still early days with LLMs, relatively speaking. It's become super easy to deploy models, run them offline, and play with their output, but there's surprisingly few rigorous, concrete results when asking a question like "How much data is enough for finetuning an LLM, though?"

Nearly 700 words in, that's the goal of this post (never mistake me for knowing how to SEO). I've wanted to see some concrete example of this, and haven't - so it's time. For this example, I wanted to provide a very clear example of how much additional benefit finetuning confers to a domain-specific project - how much juice you get from the squeeze - and, crucially, how much squeeze you need to apply in order to get that juice.

For this project, I have ≈120k descriptions of BringATrailer.com auctions, alongside the final bid price. That is, we have about ≈120k lines of JSON in a file that look like:

{
  "price": 67911.0,
  "text": "This 1996 Porsche 911 Turbo is finished in black over black leather and is powered 
  by a twin-turbocharged 3.6L flat-six that drives all four wheels through a six-speed 
  manual transaxle with a limited-slip differential. Equipment includes fog lights,
 headlight washers, a sunroof, 19″ Victor Equipment wheels, a fixed rear spoiler, 
  power-adjustable front sport seats, a Ruf-branded steering wheel, a cassette stereo,
 and automatic climate control. The seller acquired the car in 2006 and has since 
  added 106k of the 168k miles shown. The turbochargers, engine mounts, and timing 
  chain were replaced in 2024. This 993 Turbo is offered with manufacturer’s 
  literature, a tool kit, service records, and a clean California title in the 
  seller’s name."
}

For this demonstration, my questions are twofold - first, how much does finetuning help as compared to using a "factory floor" model when using a model for a given task, and second, how much data do I have to finetune with in order to get that result? For this project, my goal is to convert the text descriptions into embeddings, or "linguistic addresses", then use those "addresses" as I would any other set of features in predicting a target variable (in this case, the sale price of the car). In effect, we are asking "as we increasingly zoom into the linguistic subspace of interest for a given domain, how much more accurately can we predict the corresponding outcome of interest?"

In this example, we'll slowly scale up the amount of finetuning we do linearly - as we modulate the data linearly, how much more accurate does our model get? We can measure that as the mean absolute error (e.g. using the embeddings, predict the price of the cars, then take the absolute difference for each guess against the real value, and take the average of those differences), but we can also report median absolute errors which are a bit more robust to outlier examples (which, for BringATrailer, is pretty reasonable to expect as some weird cars may just sell for way more or way less than it's immediate "neighbors"). R2 is also a useful metric, so we'll throw it in. In the code below, I define a finetuning procedure that I invite you to use in your own work. Briefly, it pulls all the text, [MASK]'s 20% of the text, breaks the text up by the input token length for the provided model, then finetunes the model. Finally, it saves it so that it can be loaded with a typical SentenceTransformer(MODEL_NAME) invocation. We also do a quick sanity check to make sure the model we save and actually use is consistent with the model we trained.

I've also defined a simple routine for training a basic XGBoostRegressor which uses the trained embedding model to predict the prices for the full corpus of ≈120k auctions. Even though we only finetuned on some subset, we still predict on the full set. We predict the R2, mean absolute error, and median absolute error. So, how much data do we need to provide in order to get good results? On one extreme, we use the "factory floor" model, and just use factory floor "generalist" embeddings to predict the auction prices. On the other extreme, we train on everything in our dataset, which takes the longest amount of time and effort but may end up being the most accurate. Between these extremes, we'd like to know how much data is required to get to a point of diminishing returns on the finetuning process, and, ideally, we'd like to understand the general "shape" of increasing accuracy from finetuning (e.g. is it linear, where it just slowly creeps up with increasingly larger batches, or is it exponential, where the first few batches help the most, and then increasingly large batches have less of an impact since its unlikely they contain relatively more "informative" data compared to their smaller predecessors?).

In practice, I find that even modest amounts of data quickly increase accuracy of our scratch model (remember, we don't particularly care about how predictive this model is, we just want to know the relative increase in predictive power as we scale up our finetuning procedure). Out of the box, the model performs at an R2 of 0.52, a median absolute error of $11,280.72, and a mean absolute error of $19,617.56. If we go to the extreme and train on the full dataset, we achieve an R2 0.64 (23% boost), a median absolute error of $9,690.27 (14% error reduction), and a mean absolute error of $17,946.93 (9% reduction). Clearly, increasing the funetuning scale is doing something - but it's a modest effect. In other terms, it's certainly enough to warrant the procedure, but it is not a panacea, and isn't completely mitigating any remaining unexplained variance.

What's interesting, however, is that we already see basically the full benefits of finetuning very early. If we only finetune with 50,000 auction descriptions, we're getting nearly all of the squeeze from the modeling effort with an R2 of 0.62, a median absolute error of $10,145.99, and a mean absolute error of $18,767.09. Still certainly some room to go, but the majority has been accounted for with only a 42% sample of the data. In my own semi-lay interpretation, I think the handwavey way to explain this is that "once a model, in the process of finetuning, has seen enough data such that we're on the back of the curve in terms of relative surprisal for any particular document, we're at the point of diminishing returns of the finetuning procedure".

From the R2, we're left with the impression that this has some sort of exponential (although moderate, look at the vertical axis limits) effect with asymptotic attenuation.

Median and mean absolute error seem to drop more linearly, but don't seem to really have an asymptotic cutoff.

So - after a long-winded test, what's the value here? I think this jointly shows several things:

Finetuning an embedding will always likely provide some tangible benefits,
Those benefits are likely typically modest but real,
We should expect an immediate jump in R2 with a long tail of it slowly increasing,
We should expect the error to drop linearly, and we can probably estimate the slope of that drop with only a few tests.

The code for the entire test procedure here is available below! I encourage you to try it out.