6 cute pastel coloured sloths staring at their computer screens happy

Finetune Llama 3 - 2x faster + 6x longer context + 68% less VRAM

Apr 23, 2024 • By Daniel & Michael

Apr 23, 2024

• By Daniel & Michael

Llama-3 8B
1xL4 24GB
205%
faster

Llama-3 8B
1xL4 24GB
-63%
VRAM

Llama-3 70B
1xA100 80GB
183%
faster

Llama-3 70B
1xA100 80GB
-68%
VRAM

You can now finetune Meta’s latest Llama 3 (8B) model 2x faster and use 63% less memory than Flash Attention 2 (FA2) + Hugging Face (HF). Llama 3 (70B) is 1.8x faster and uses 68% less VRAM.

On 1xA100 80GB GPU, Llama-3 70B with Unsloth can fit 48K total tokens (8192 * bsz of 5) vs 7K tokens without Unsloth. That's 6x longer context lengths!

We uploaded a Colab notebook to finetune Llama-3 8B on a free Tesla T4: Llama-3 8b Notebook. We also uploaded pre-quantized 4bit models for 4x faster downloading to our Hugging Face page which includes Llama-3 70b Instruct and Base in 4bit form.

Someone from our community tested LoRA fine-tuning of bf16 Llama 3 8B and it only used 16GB of VRAM.

P.S. Don't forget to ⭐Star us on Github and join our Discord server ❤️

Llama 3 performance benchmarks

Model

VRAM

🦥Unsloth speed

🦥 VRAM reduction

🦥 Longer context

🤗Hugging Face+FA2

Llama-3 8B

24GB

63%

3x longer

Llama-3 70B

80GB

1.8x

68%

6x longer

We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down).

🦙 6x longer context lengths

By using Unsloth’s latest long context support, Llama-3 70b can now easily fit on a 48GB GPU card, allowing you to finetune on ~7K context lengths, whilst HF + FA2 might allow you to finetune lengths of 2 or even OOM.

On a A100 80GB SXM machine, Unsloth allows 6x longer context lengths with only +1.9% overhead, allowing you finetune on 48K sequence lengths vs 7.5K lengths. We can see below the VRAM vs context length data we experimentally gathered, showing the stark advantage of using Unsloth vs HF + FA2 for long context finetuning.

Llama 3 (70B) max. context length

GPU VRAM

Unsloth
(New)

Unsloth
(Old)

Hugging Face+FA2

48 GB

7,698

2,875

OOM

80 GB

48,053

18,332

7,433

In all our experiments, we used QLoRA with a rank of 32 and applied LoRA adapters to all linear linears (q, k, v, o, gate, up, down). We used a batch size of 1, and repeated data to make it fit to the maximum context window.

🦙 Llama 3 (8B) finetuning fits in 8GB

By using a batch size of 1, and a lora rank of 32 on all linear layers, HF + FA2 unfortunately fails or OOMs on 8GB GPU cards (needs ~9GB memory), whilst Unsloth comfortably allows 2K context lengths. On a 24GB consumer card, Unsloth allows 20K context lengths, or 3.5x longer context lengths than HF+FA2.

Below shows the VRAM consumption vs context lengths tested on a L4 GPU via Colab.

Llama 3 (8B) max. context length

GPU VRAM

Unsloth
(New)

Unsloth
(Old)

Hugging Face+FA2

8 GB

1,983

1,594

OOM

12 GB

6,638

5,352

1,044

16 GB

11,292

9,110

2,663

24 GB

20,601

16,626

5,901

40 GB

39,219

31,657

12,377

48 GB

48,528

39,172

15,615

80 GB

85,765

69,235

28,567

🦙 Llama 3 Quirks

There are a few weird “bugs” and quirks with Llama-3 as well! First it seems like the tokenizer does not add the BOS token unlike Llama-2. HuggingFace added a fix today, and we quickly resolved it inside Unsloth! We did test both scenarios, and saw virtually no difference with adding or not adding the BOS token.

A more unfortunate “bug” or quirk is Llama-3’s base (not instruct) model has untrained tokens, namely <|reserved_special_token_{0->250}|>
<|eot_id|>
<|start_header_id|>
<|end_header_id|>
We tweeted about this a few days ago here

Essentially if one uses these untrained tokens (like using the instruct template for the base model), then gradients will be NaN. As first noticed by Geronimo, one has to simply set these untrained tokens to be the mean vector.

However, from our investigations, you cannot simply set the mean, since it’s biased. You must first set these untrained tokens to 0 (bfloat16 will cause these vectors to not be 0 but rather 1e-23), then sum them, and then divide them by the number of trained tokens (n total tokens minus n untrained). We found 287 untrained tokens in total.

Unsloth’s new release now automatically fixes this for you during finetuning.

💕 Thank you!

Feel free to support us via our Ko-fi donation page. Huge shout out to: h3n0r1k (once again thank you), Jascha, safetyBot, Patleeman, Alberto, Pichet, Tseng, Stephen, abhi, sumak, Anoop, lhl & fefo who are new supporters! 🙏

As always, be sure to join our Discord server for help or just to show your support! You can also follow us on Twitter and Substack.

Thank you for reading!

Daniel & Michael Han 🦥
23 April 2024

Phi 3 support soon...

Get started for free

Apr 23, 2024 • By Daniel & Michael

Apr 23, 2024

•

By Daniel & Michael

Llama-3 8B1xL4 24GB205%faster

Llama-3 8B1xL4 24GB-63%VRAM

Llama-3 70B1xA100 80GB183%faster

Llama-3 70B1xA100 80GB-68%VRAM