QLoRA: 70B를 GPU 한 장에서

세 가지 혁신

QLoRA(Dettmers et al. 2023)는 LoRA와 4-bit quantization 결합. 세 핵심 아이디어 —

1. NF4 (NormalFloat4) 양자화

정규분포 가중치에 최적화된 4-bit 데이터 타입. 신경망 가중치는 보통 Gaussian이라 NF4는 Gaussian 분포 비례로 양자화 레벨 할당 — 대부분 가중치가 모이는 0 근처에서 더 높은 정밀도.

2. 더블 양자화

양자화 scaling 상수 자체를 양자화(FP32 → FP8), 파라미터당 추가 ~0.37 비트 절감.

3. Paged optimizer

CUDA unified memory 써서 gradient 계산 중 GPU 메모리 스파이크 처리. GPU 메모리 차면 옵티마이저 상태를 임시로 CPU RAM에 paging.

메모리 절감

모델	Full FT (fp16)	LoRA (fp16)	QLoRA (4-bit)
7B	~60 GB	~18 GB	~6 GB
13B	~120 GB	~32 GB	~10 GB
70B	~600 GB	~160 GB	~36 GB

QLoRA가 7B 파인튜닝을 RTX 3090/4090(24GB) 한 장에 올렸고, 70B를 A100(80GB) 한 장에 올렸어. 그 어떤 기법보다 파인튜닝을 더 민주화했어.

Code

QLoRA setup: bitsandbytes + PEFT·python

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # double quantization
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare quantized model for training (enables grad on right layers,
# sets up checkpointing, ensures numeric stability)
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",  # recommended for QLoRA
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~42M || all params: ~8B || trainable%: 0.52%

QLoRA: 70B를 GPU 한 장에서

세 가지 혁신

1. NF4 (NormalFloat4) 양자화

2. 더블 양자화

3. Paged optimizer

메모리 절감

Code

External links

Exercise

Progress

댓글 0