Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)
As part of my experiments, I added a cosine annealing scheduler to the LoRA finetuning scripts and observed that it improved the SGD performance noticeably. A limitation of my experiment is that I only explored two settings: LoRA for only the query and value weight matrices enabled, and LoRA for all layers enabled. As the original LoRA paper outlines, LoRA introduces an additional scaling coefficient for applying the LoRA weights to the pretrained weights during the forward pass. Choosing alpha as two times r is a common rule of thumb when using LoRA for LLMs, but I was curious if this still holds for larger r values. One of the main takeaways is that LoRA allows us to finetune 7B parameter LLMs on a single GPU. In this particular case, using QLoRA with the best setting, this 17.86 GB with AdamW takes about 3 hours for 50k training examples. It's worth noting that if memory is a concern, LoRA can also be used for further pretraining existing pretrained LLMs on domain-specific datasets. I only explored two settings: LoRA for only the query and value weight matrices enabled, and LoRA for all layers enabled.