Fine-tuning on a Budget

00:00:05:15 - 00:00:36:21
Lori MacVittie
You're listening to Pop Goes the Stack, the show that keeps tabs on emerging tech. So you don't have to pretend you read the white paper. I'm Lori MacVittie. And yeah, I read the footnotes, too. This week I'm joined by our co-host, who is not a model, I'm assured Joel Moses, and an expert from within our AI Center of Excellence, Dimitry Kit, who trains models and understands the topic for today, which is LoRA.

00:00:36:24 - 00:01:05:12
Lori MacVittie
That's low rank adaptation, and it's a not so secret weapon for customizing LLMs without melting your GPU or blowing out your budget. Now, training and fine tuning, you know, full scale models is expensive. It's slow and increasingly impractical outside hyperscalers because of the number of parameters and the cost associated from just building out that infrastructure. So LoRA flips the equation.

00:01:05:12 - 00:01:25:25
Lori MacVittie
It gives you some targeted intelligence, without having to retrain the whole brain. Right. Not even, like fine tuning. It's even beyond that. So you could do all sorts of things with LoRA. And to kind of dig into, like, how does it work and what can you do with it? Right, let's kick it off. You know, Dimitry, welcome.

00:01:27:02 - 00:01:29:27
Dmitry Kit
Thank you. Glad to be here.

00:01:30:00 - 00:01:51:12
Joel Moses
So let's let's start at the basics. First of all, let's talk about fine tuning. Fine tuning an open source model, for example. When would you choose to fine tune as opposed to use the fine tuning mechanism specified in LoRA? We'll, we'll get into the differences of that in a moment, but what are some of the benefits and what are some of the risks and drawbacks of full fine tuning?

00:01:51:14 - 00:02:17:29
Dmitry Kit
Yeah. So, these models are trained on, a lot of data on, on the internet. And, and this allows us to generalize and do prompt engineering. So by tailoring the prompts to explain what problem we're trying to solve, the large language model is going to use all this information that we've got to try to, you know, target your specific problem.

00:02:18:01 - 00:02:59:01
Dmitry Kit
Now, there are, things that these organizations do to bias the model towards certain, you know, certain responses that they are interested in. So it just goes through the reinforcement to the reinforcement learning through human feedback, where they after this large scale training, they, bias the behaviors of the model and if you want to bias it away from those kind of instructions, then you may want to consider doing full scale fine tuning, because you want to override the natural tendencies of these networks.

00:02:59:03 - 00:03:22:03
Dmitry Kit
However, if, if you do fine tuning and you force the model to behave in a very specific way, then you might actually start forgetting. And that's the biggest problem with full scale fine tuning, is that you start discarding information that might have actually allowed the network to generalize to the kind of problems you want to solve.

00:03:22:06 - 00:03:32:07
Dmitry Kit
And so it's a balance between asking it to bias behavior towards what you want without actually losing all this general knowledge that it gained.

00:03:32:10 - 00:03:55:15
Joel Moses
Sort of, accidentally blowing holes in the brain, so to speak. So, so, so effectively fine tuning risks messing with the established weights that the model has relied upon to build its corpus. So, so that, that, that can and sometimes that can sometimes be very dangerous. Now, how is LoRA a different approach and why is it better than than full fine tuning.

00:03:55:17 - 00:04:20:15
Dmitry Kit
Yeah. So LoRA is a type of an adapter, mechanism technique. What that means is it actually sits besides the base model. And you can target it to only affect certain portions of the, of the, of the large language model. Now, you can also affect every portion as well, but it's still is kind of a sidecar to the, to the original model.

00:04:20:18 - 00:04:58:18
Dmitry Kit
And, the thing about LoRA, low rank, adaptation is that it uses matrix decomposition. But what that allows it to do is, is actually represent numbers, a matrix, in a very efficient way. And once it represents, it's representing it that way, it can then transform it back into the original size of the matrix of that layer, and then it can, through addition, it can basically change the behavior and tendencies of that layer to do something different than what you expected.

00:04:58:21 - 00:05:25:07
Dmitry Kit
So in our work, we often focus on the attention layers and also how those, how that attention affects the final distribution over the tokens. So, so we're basically saying, look, use your entire knowledge base to figure out what you need, what information you you need, but start biasing it towards the inputs that we think are important towards the problem that we want to solve.

00:05:25:09 - 00:05:53:04
Dmitry Kit
And then after you've figured out what's important in the inputs that we gave you, start biasing what next tokens should be, should be generated based on the on those interesting elements. And that actually allows us to freeze and not affect most of the network. And so we're only focusing, so in practice we can take a 7 billion parameter model and only have maybe 10 million parameters to train.

00:05:53:06 - 00:06:03:14
Joel Moses
Okay. And so to continue the medical analogy of the brain, it's a bit it's a bit like a targeted, like gamma knife rather than a lobotomy. Which a full fine tuning can be.

00:06:03:14 - 00:06:19:15
Dmitry Kit
Exactly. Yeah. And, you know, the idea here is you really do not want to overwrite, most of the functionality. You do want to keep it, fairly static. But you do want to bias and push it just a little bit in certain places towards what you want.

00:06:19:15 - 00:06:21:09
Joel Moses
I see.

00:06:21:11 - 00:06:53:28
Lori MacVittie
So I, you know, I'm thinking about the, the case like languages, right. You've got a specific language you use and it may be constrained. It may be even a little different, right, then maybe what you would find out in the wild. And if you wanted to make sure that the model was representing your version of that language correctly, you might build a LoRA to sit next to it and say no, when you see "if" I know you no-, you normally want to see this other thing, but we want to see that thing.

00:06:54:00 - 00:07:15:14
Lori MacVittie
And so it trains it to, to basically be more specific to what you want as opposed to just the general knowledge. So general knowledge still there. It still understands the language and how languages are built. Conditional statements. But it's going to be more specific to choose what you think should be there based on your implementation as opposed to someone else's.

00:07:15:17 - 00:07:45:06
Dmitry Kit
Yeah, that's actually a great example. So I can give more concrete examples like that. And one project, we needed to have a large language model generate regular expressions in a very specific way. And regular expressions you can think of it as a different language. And, it turns out that models, models, like they kind of know the form of regular expressions, but they are definitely biasing it

00:07:45:09 - 00:08:14:00
Dmitry Kit
towards a certain kind of regular expressions they want to produce. So, what we did was, we generated a synthetic data set and automatically did that about 30,000 examples. And we trained a LoRA so that it took a prompt that was in a, in a very specific format that was consistent with the problem we were trying to solve and only outputted the regular expression that we needed based on that prompt.

00:08:14:00 - 00:08:37:20
Dmitry Kit
So we made sure, so this allowed us to not be too verbose in the prompt, because we biased the model to already, like, accept the prompt of that form and we bias it to only output just enough tokens to, to generate what we needed without being too verbose. And thereby we save time on prompt engineering, but also on the output token side too.

00:08:37:23 - 00:09:00:27
Dmitry Kit
Because we were very consistent there. And, because large language models do tend to reason in terms of tokens, so sub-words, they, you know if you need to do a dot star example in regular expression, it might already have a pattern for dot star. But if you need to do something a little bit more complicated, all of a sudden it might bias towards tokens that already have bigger patterns in them.

00:09:00:27 - 00:09:15:09
Dmitry Kit
And so it starts messing up the regular expression. So if you focus it towards no, no, no deal with character by character, build up the regular expression, character by character, you can actually get a much more consistent output.

00:09:15:11 - 00:09:20:23
Joel Moses
Interesting. And adding that through full fine tuning would have been risky to the model itself.

00:09:20:25 - 00:09:43:15
Dmitry Kit
Yeah, we noticed that it could already generate regular expressions, they just weren't good. So,

Lori MacVittie
There were good ones?

Dmitry Kit
Well, fair enough. So the idea was not to, like, that's positive. That's great. Right. Like we know that it has knowledge of the topic, we just needed to bias it towards the kind of regular expressions we were interested in.

00:09:43:17 - 00:09:44:23
Joel Moses
Right. Okay.

00:09:44:25 - 00:10:02:29
Lori MacVittie
It almost sounds like, I know fine tuning, right, is more costly. Certainly less than training a model from the ground up. Right? It's like adopting a kid at, you know, 16 instead of, you know, raising them up. Right? It costs a lot more when you got to start and teach them everything than when you get them

00:10:02:29 - 00:10:35:28
Lori MacVittie
and they know mostly everything, you just want to kind of refine. But it almost sounds like it isn't so much about the cost because the difference between the cost of, you know, doing LoRA versus fine tuning it, it's significant enough, but it's not like, you know, oh my goodness, we can never do this. It sounds like it's almost more about risk and what you're trying to do and making sure that you're not losing all the capabilities with that model, rather than just: well, it's cheaper.

00:10:36:00 - 00:10:58:25
Dmitry Kit
I mean, that's great, but I would say time is also important. Fully fine tuning a model of 7 billion parameters, it's going to take you a while. And it may require actually quite a number of GPUs. With LoRA you can, you can do that more efficiently. And you can do it in hours versus even days.

00:10:58:27 - 00:11:11:21
Dmitry Kit
So for, for the regular expression example, it took us about six hours to train on 30,000 examples on a single A10G machine, which is under $200.

00:11:11:23 - 00:11:37:11
Joel Moses
Wow.

Lori MacVittie
Wow.

Joel Moses
Now, it also has a downstream impact, the cost or the number of tokens necessary to satisfy the very specific things that you're incorporating, correct? So instead of a general model, you know, having a token cost that's much higher for, say, generating regular expressions, LoRA can give you the answers that you're seeking much more accurately, but at a lower token cost, right?

00:11:37:14 - 00:12:09:23
Dmitry Kit
Yes. So you need to, you can use smaller prompts, which if you're self-hosting these things. You, the larger the prompt, the more you need caching memory. And so that prevents you from, potentially, processing multiple prompts at the same time, reduces parallelism. So by reducing the prompt, because you don't have to provide all that context, you've already biased the model towards the kind of outputs you care about.

00:12:09:26 - 00:12:45:10
Dmitry Kit
You can actually reduce the prompt, the context you need to provide. And so that reduces memory costs there, but also response time. So response time for us was very important because this regular expression large language model was used in an automated workflow where we submitted hundreds of requests to it. And if it was too verbose, you would have to wait for quite a while for it to finish generating whatever it was generating and then taking the part that you needed. By making sure that it was only focusing on generating the kind of tokens that, you know, we needed for that project,

00:12:45:12 - 00:12:51:29
Dmitry Kit
we were able to scale this up to hundreds of requests, you know, within a reasonable amount of time.

00:12:51:29 - 00:13:15:10
Joel Moses
Got it. Talk for a moment about LoRA and its effect on latency. I know that one of the one of the things about LoRA is that you can actually reconstitute, or merge for inference, and not affect measurably the latency from the original model. What other downstream impacts on latency are there?

00:13:15:12 - 00:13:39:01
Dmitry Kit
So it, it does add latency, as a sidecar. Because you definitely have to take another path through the network to calculate the weights that you then modulate the original network with. The libraries these days, they do allow you to merge these networks together so that you get a larger base model that is now behaving in the way that you want it to behave.

00:13:39:01 - 00:14:02:03
Dmitry Kit
So you kind of overwrote the, the original model. We tend to not want to do that, because by keeping these LoRA's as adapters, it actually allows us to load the original model once, and then we can unload and load new LoRA adapters and change the behavior of the base model based on the use case that we're doing at the time.

00:14:02:05 - 00:14:28:23
Dmitry Kit
So, that way, loading these models take a few minutes because they're really large, but LoRA's, they're actually very, very, small. And so you can load them in within a very reasonable amount of time. And so if you're doing multiple use cases, you can flip them in and out and, and, get different behaviors. With a, yes, additional latency. Slight.

00:14:28:25 - 00:14:40:13
Dmitry Kit
But only slight. And then, but then if you do want the final model to operate at about the same, you just merge that in and now you have a new large language model.

00:14:40:15 - 00:15:06:12
Joel Moses
Okay. So just, just to confirm and just to kind of lay this out for the audience that there are multiple modes of operation for, for a LoRA type approach. And one is to use an adapter which keeps everything separate. But you can also take, once you've done, your, your weighting and, and your decomposition, you can actually take that and move it back into the original model without adapting, without, without, messing with the weights.

00:15:06:15 - 00:15:15:26
Joel Moses
Is that correct? And that's, that's basically merging it and allowing it to operate as a model. Not, not an adapter, but the original model where it will react the way that you want.

00:15:15:28 - 00:15:16:28
Dmitry Kit
Correct.

00:15:17:01 - 00:15:24:04
Joel Moses
Okay. But, but adapters do have certain advantages, as you said. They're quicker to load and things like that. But.

00:15:24:07 - 00:15:51:25
Lori MacVittie
Well I'm, I'm seeing, so I'm looking at things like agents. Right. I'm listening to a lot of the, you know, agents might make decisions and choose things they shouldn't. And it almost sounds to me, call me crazy, but using LoRA as a mechanism to, right, more focus models making decisions in your environment, right. Maybe you've got processes that are not standard, or you like things done the way you like them done.

00:15:51:27 - 00:16:02:06
Lori MacVittie
You could use LoRA to actually, right, guide them toward the right decision a little bit better without, you know, busting the bank, as it were.

00:16:02:08 - 00:16:29:06
Dmitry Kit
That's a great, that's a great insight. We in our group, we deal with a lot of log data from various customers, and various customers operate in very different ways. And so, there is actually value in having LoRA adapters for different customers. And then depending on which customer's data you're processing, you may want to load them in or, or remove them, and so on.

00:16:29:08 - 00:16:55:01
Dmitry Kit
So there's definitely advantages there because you don't want it, like, these models can get into tens of gigabytes and you don't want to necessarily train, or keep track of full models for, you know, 1000 customers. But adapters, you can, you can kind of have a catalog of them. And the cool thing is that the agent itself could decide, oh, I actually need knowledge about this customer and load in the adapter it needs.

00:16:55:20 - 00:17:00:17
Joel Moses
Okay, selecting the right adapter by the context of what they're, they're being queried with that. That's interesting.

00:17:00:18 - 00:17:28:23
Lori MacVittie
Right, partner, customer. You know maybe role, group. Right? You could start to see applications of this beyond just hey, right, we can get our model, you know, nail down a little bit better, but actually in how you're building out maybe, right, the future agenetic ops that you're, you're going to want one day. That this may be, you know, one of the ways that we get there a little faster.

Joel Moses
Got it

00:17:28:26 - 00:17:30:28
Dmitry Kit
Yeah.

00:17:31:00 - 00:17:31:19
Joel Moses
That sounds great.

00:17:31:20 - 00:17:56:02
Lori MacVittie
Well, what else did we learn? I mean, we learned there's, there's future applications that are very cool. Not that we see a lot of it, but it's a possibility. But with respect to, you know, someone who's trying to choose. How do they approach, right, what they need to do with a model and, and make a choice between fine tuning, using the base model, or LoRA, like, what did we learn?

00:17:56:04 - 00:18:29:02
Joel Moses
Well, from my perspective, I learned a little bit about the, the, the underlying implementation. The idea that you can run them as adapters or you can incorporate them or merge them back into the original model. And, and that the choice to do so is a is a trade off between, between facility or the ability to react quickly or latency. And that and that that's a perfectly choose that's a perfectly choosable decision for a lot of customers when they're dealing with the the need to produce more accurate output.

00:18:29:04 - 00:18:38:00
Joel Moses
You know. What, what, what do you, what's one takeaway that you'd like to, to, to, offer to some of our listeners? Dmitry?

00:18:38:03 - 00:19:06:19
Dmitry Kit
It's that prompt engineering is good. But it does take time, and there's cost associated. There's also costs associated with input and output tokens. And when we start getting into automated workflows where agents get, get to potentially interact with large language models multiple times, hosting your own model may actually start becoming a cost saving decision, right?

00:19:06:21 - 00:19:38:01
Dmitry Kit
Where, where you control it. And, then, you know, there's the concern. Well, what if these models are not as good as ChatGPT or not as good as, these trillion parameter models? And the answers is, is LoRA. You don't need, 100 thousands of examples. And you can often generate them synthetically. And you can use distillation where you use larger language models to, generate data for, for, for training.

00:19:38:03 - 00:19:55:03
Dmitry Kit
So getting data is, is often not even very challenging. And with a few tens of thousands of examples, you can get an adapter to a large language model of 7 billion parameters, for example, that actually behaves at the level of these larger models.

00:19:55:06 - 00:20:04:18
Joel Moses
And if listeners if listeners want to find out a little bit more about LoRA and learn, is there a resource that you found particularly useful that can can help them out?

00:20:08:25 - 00:20:20:11
Dmitry Kit
Reach out to our, reach out to our team, and we'll be we'll hook you up. I also want to just mention that we're talking about less than $1,000 for, for fine tuning.

Lori MacVittie
Wow.

00:20:20:13 - 00:20:22:28
Joel Moses
That's awesome, that's awesome.

00:20:23:01 - 00:20:45:08
Lori MacVittie
That's, and, and we're, that's where we're going to wrap it. It's $1,000 that is like really, really affordable for tuning up your model and getting it a little bit more focused on what you care about instead of the general internet. So that's that's a wrap for pop goes the stack. Hit subscribe, so you never have to fake a white paper opinion again.

00:20:45:10 - 00:20:48:21
Lori MacVittie
And I'll keep reading the citations. You keep saying brilliant.

Creators and Guests

Joel Moses
Host
Joel Moses
Distinguished Engineer and VP, Strategic Engineer at F5, Joel has over 30 years of industry experience in cybersecurity and networking fields. He holds several US patents related to encryption technique.
Lori MacVittie
Host
Lori MacVittie
Distinguished Engineer and Chief Evangelist at F5, Lori has more than 25 years of industry experience spanning application development, IT architecture, and network and systems' operation. She co-authored the CADD profile for ANSI NCITS 320-1998 and is a prolific author with books spanning security, cloud, and enterprise architecture.
Dmitry Kit
Guest
Dmitry Kit
Dmitry Kit, PhD is a Senior Principal Data Scientist at F5. He has over 15 years of experience in machine learning operations and artificial intelligence in both academia and industry. He brings deep expertise in developing scalable machine learning systems, optimizing data-driven workflows, and applying AI to solve complex business problems. He holds a PhD in Computer Science from The University of Texas at Austin and has previously held roles at Amazon and Hitachi Vantara, where he contributed to advancing AI-driven solutions across enterprise environments.
Fine-tuning on a Budget
Broadcast by