Meta's new Llama 4 models can now be fine-tuned & run with using Unsloth. Llama 4 Scout (17B, 16 experts) is the best model for its size with a 10M context window. Llama 4 Maverick (17B, 128 experts) surpasses GPT-4o % rivals DeepSeek v3 in reasoning and coding.
Unsloth makes Llama-4-Scout (109B) training work with 71GB VRAM. Unsloth is the only framework that supports QLoRA 4-bit training of Llama 4. Scout will fit on a single H100 80GB VRAM GPU.
Unsloth makes Llama 4 finetuning 1.5x faster, use 50% less VRAM, and enables 8x longer context lengths than environments with Flash Attention 2.
We uploaded Llama 4 including dynamic GGUF, dynamic 4-bit & 16-bit versions, on Hugging Face here. The 16-bit & FP8 versions run anywhere (e.g., vLLM), but 4-bit & 8-bit only work on Unsloth for training and inference.
Unsloth now also supports EVERYTHING* including: full fine-tuning, 8-bit, pretraining, ALL transformer-style models (Mixtral, MOE, Cohere etc.) and ANY training algorithms like GRPO with VLMs.
Performance benchmarks
Model
VRAM
🦥Unsloth speed
🦥 VRAM reduction
🦥 Longer context
🤗Hugging Face+FA2
Llama 4 Scout
80GB
1.5x
>50%
8xlonger
OOM
We tested Llama-4-Scout-Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads.
QLoRA fine-tuning for Llama 4 Scout in 4-bit precision is currently only supported by 🦥Unsloth. All other frameworks, do not yet support 4-bit finetuning for Llama 4, resulting in significantly higher VRAM usage (300GB+), often causing out-of-memory (OOM) errors. To ensure a fair comparison, we benchmarked Llama 4 Scout with LoRA instead of QLoRA across all platforms. Otherwise, Unsloth would appear to use up to 400% less VRAM due to proper QLoRA support.
💕 Thank you!
A huge thank you to everyone for using & sharing Unsloth - we really appreciate it. And of course thank you to Meta who built the models.🙏