Large language models (LLMs) like GPT-4, Bloom, and LLaMA have achieved remarkable capabilities by scaling up to billions of parameters. However, deploying these massive models for inference or fine-tuning is challenging due to their immense memory requirements. In this technical blog, we will explore techniques for estimating and optimizing memory consumption during LLM inference and fine-tuning across various hardware setups.
Understanding Memory Requirements
The memory required to load an LLM is primarily determined by the number of parameters and the numerical precision used to store the parameters. A simple rule of thumb is:
- Loading a model with X billion parameters requires roughly 4X GB of VRAM in 32-bit float precision
- Loading a model with X billion parameters requires roughly 2X GB of VRAM in 16-bit bfloat16/float16 precision
For example, loading the 175B parameter GPT-3 model would require approximately 350GB of VRAM in bfloat16 precision. As of today, the largest commercially available GPUs like the NVIDIA A100 and H100 offer only 80GB of VRAM, necessitating tensor parallelism and model parallelism techniques.
During inference, the memory footprint is dominated by the model parameters and the temporary activation tensors produced. A high-level estimate for the peak memory usage during inference is the sum of the memory required to load the model parameters and the memory for activations.
Quantifying Inference Memory
Let’s quantify the memory requirements for inference using the OctoCode model, which has around 15 billion parameters in bfloat16 format (~ 31GB). We’ll use the Transformers library to load the model and generate text:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import torch model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0) tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder") pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) prompt = "Question: Please write a Python function to convert bytes to gigabytes.nnAnswer:" result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] def bytes_to_gigabytes(bytes): return bytes / 1024 / 1024 / 1024 bytes_to_gigabytes(torch.cuda.max_memory_allocated())
Output:
The peak GPU memory usage is around 29GB, which aligns with our estimate of 31GB for loading the model parameters in bfloat16 format.
Optimizing Inference Memory with Quantization
While bfloat16 is the common precision used for training LLMs, researchers have found that quantizing the model weights to lower precision data types like 8-bit integers (int8) or 4-bit integers can significantly reduce memory usage with minimal accuracy loss for inference tasks like text generation.
Let’s see the memory savings from 8-bit and 4-bit quantization of the OctoCode model:
</div> # 8-bit quantization model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] bytes_to_gigabytes(torch.cuda.max_memory_allocated())</pre>
Output:
# 4-bit quantization model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0) pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] bytes_to_gigabytes(torch.cuda.max_memory_allocated())
Output:
With 8-bit quantization, the memory requirement drops from 31GB to 15GB, while 4-bit quantization further reduces it to just 9.5GB! This allows running the 15B parameter OctoCode model on consumer GPUs like the RTX 3090 (24GB VRAM).
However, note that more aggressive quantization like 4-bit can sometimes lead to accuracy degradation compared to 8-bit or bfloat16 precision. There’s a trade-off between memory savings and accuracy that users should evaluate for their use case.
Quantization is a powerful technique that can enable LLM deployment on resource-constrained environments like cloud instances, edge devices, or even mobile phones by drastically reducing the memory footprint.
Estimating Memory for Fine-Tuning
While quantization is primarily used for efficient inference, techniques like tensor parallelism and model parallelism are crucial for managing memory requirements during the training or fine-tuning of large language models.
The peak memory consumption during fine-tuning is typically 3-4 times higher than inference due to additional memory requirements for:
- Gradients
- Optimizer states
- Activations from the forward pass stored for backpropagation
A conservative estimate is that fine-tuning an LLM with X billion parameters requires around 4 * (2X) = 8X GB of VRAM in bfloat16 precision.
For example, fine-tuning the 7B parameter LLaMA model would require approximately 7 * 8 = 56GB of VRAM per GPU in bfloat16 precision. This exceeds the memory capacity of current GPUs, necessitating distributed fine-tuning techniques.
Distributed Fine-Tuning Techniques
Several distributed fine-tuning methods have been proposed to overcome GPU memory constraints for large models:
- Data Parallelism: The classic data parallelism approach replicates the entire model across multiple GPUs while splitting and distributing the training data batches. This reduces training time linearly with the number of GPUs but does not reduce the peak memory requirement on each GPU.
- ZeRO Stage 3: An advanced form of data parallelism that partitions the model parameters, gradients, and optimizer states across GPUs. It reduces memory compared to classic data parallelism by keeping only the required partitioned data on each GPU during different phases of training.
- Tensor Parallelism: Instead of replicating the model, tensor parallelism divides the model parameters into rows or columns and distributes them across GPUs. Each GPU operates on a partitioned set of parameters, gradients, and optimizer states, leading to substantial memory savings.
- Pipeline Parallelism: This technique partitions the model layers across different GPUs/workers, with each device executing a subset of the layers. Activations are passed between workers, reducing peak memory but increasing communication overhead.
Estimating memory usage for these distributed methods is non-trivial as the distribution of parameters, gradients, activations, and optimizer states varies across techniques. Moreover, different components like the transformer body and language modeling head may exhibit different memory allocation behaviors.
The LLMem Solution
Researchers recently proposed LLMem, a solution that accurately estimates GPU memory consumption when applying distributed fine-tuning methods to LLMs across multiple GPUs.
LLMem considers factors like recombining parameters before computation (ZeRO Stage 3), output gathering in the backward pass (tensor parallelism), and the different memory allocation strategies for the transformer body and language modeling head.
Experimental results show that LLMem can estimate peak GPU memory usage for fine-tuning LLMs on a single GPU with error rates of up to 1.6%, outperforming the state-of-the-art DNNMem’s average error rate of 42.6%. When applying distributed fine-tuning methods to LLMs with over a billion parameters on multiple GPUs, LLMem achieves an impressive average error rate of 3.0%.