Sampling, 신화 빼고 — temperature, top-p, repetition_penalty

모두가 의견 가진 세 노브

temperature, top_p, repetition_penalty. 이 세 sampling 파라미터가 모든 chat API 호출 signature 의 위에 살고, 미친 속도로 트위터 민담 쌓고, 대부분 풍문으로 이해돼. 각각이 모델이 만드는 logit 에 진짜 뭐 하는지, 각각이 도움 되고, 해롭고, 그냥 미신인 때 실제로 보자.

temperature — 분포 sharpen 하거나 soften

Sampling 직전, 모델이 vocabulary 의 모든 토큰에 logit (실수 점수) 만들어. Softmax 가 그걸 확률 분포로 변환. temperature 가 softmax 전에 logit 을 나눠 — 낮은 temperature 가 분포를 sharp 하게 (가장 가능성 높은 토큰 더 dominant), 높은 temperature 가 더 flat 하게 (희귀 토큰이 진짜 기회 받음).

temperature=0 은 greedy decoding — 항상 argmax 골라. 재현 가능, 가끔 반복적, 거의 안 놀라.
temperature=0.7-1.0 은 자연스럽게 느껴지는 generation 의 default zone. 대부분 chat API 가 여기 default.
temperature ≥ 1.5 가 creative-adjacent 느낌 시작, 그 다음 빨리 incoherent 됨. Brainstorm 에 유용; instruction-following 엔 나쁨.

top_p — 샘플링 전에 분포 truncate

top_p (nucleus sampling 이라고도) 는 누적 확률이 최소 p 인 가장 작은 토큰 셋만 유지하고, 그 셋에서 sample. top_p=0.95 설정은 "확률 질량의 95% 를 함께 덮는 top 토큰들에서 sample; 긴 꼬리 무시." temperature 의 보완 레버 — temperature 는 분포를 reshape; top_p 는 clip.

대부분 chat API 가 default 로 두는 조합은 temperature=0.7, top_p=0.95 — 적당한 sharpening + 꼬리의 최악 nonsense 방어 trim.

repetition_penalty — 최근 토큰 echo 막기

모델이 한 단어 사랑해서 모든 줄에 말하기 시작하면, repetition_penalty 가 최근 emit 된 토큰의 logit 을 작은 factor 로 나눠 (전형 값 1.05-1.15). 루프 도는 작은 instruct 모델에 써. 일부 반복이 의도된 창작 글에는 쓰지 마.

이 레슨이 부탁하는 것

같은 프롬프트를 다른 sampling 설정으로 세 번 돌려. temperature=0 출력이 재현 가능, 결정론적, 살짝 지루한 거 알아채. 높은-temperature 출력이 운에 따라 창의적이거나 unhinged 한 거 알아채. top_p 추가가 높은-temperature 출력의 최악을 길들이고 greedy 로 다시 flatten 안 하는 거 알아채. 일어나는 거 보면 민담이 죽어.

Code

세 샘플링 regime — 같은 프롬프트, variance 봐·python

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
prompt = "Write one short imaginative sentence about clouds:"

print("--- temp=0 (greedy, reproducible) ---")
print(generate(model, tok, prompt=prompt, max_tokens=25,
               sampler=make_sampler(temp=0.0), verbose=False))

print()
print("--- temp=0.7, top_p=0.95 (the default zone) ---")
print(generate(model, tok, prompt=prompt, max_tokens=25,
               sampler=make_sampler(temp=0.7, top_p=0.95), verbose=False))

print()
print("--- temp=1.5, top_p=0.95 (creative-adjacent) ---")
print(generate(model, tok, prompt=prompt, max_tokens=25,
               sampler=make_sampler(temp=1.5, top_p=0.95), verbose=False))

# Verified outputs (2026-05-03, Llama-3.2-1B-Instruct-4bit):
#   temp=0    : "As the sun set over the rolling hills, a lone cloud drifted lazily..."  (cliche, but stable)
#   temp=0.7  : something natural-sounding, varies per run
#   temp=1.5  : "They danced on the sun-kissed rooftops in intricate waltz patterns."   (creative)

재현성 — seed 먼저, 그 다음 sample·python

import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
sampler = make_sampler(temp=0.7, top_p=0.95)

mx.random.seed(123)
out1 = generate(model, tok, prompt="One sentence about Mars:", max_tokens=20, sampler=sampler, verbose=False)

mx.random.seed(123)
out2 = generate(model, tok, prompt="One sentence about Mars:", max_tokens=20, sampler=sampler, verbose=False)

assert out1 == out2, "Seeded sampling should be reproducible"
print("Seeded twice, identical output:")
print(out1)

Exercise

세 regime 블록 돌려. 그 다음 temp=2.5, top_p=1.0 (공격적 temperature, top-p clip 없음) 로 네 번째 호출 추가하고 출력 관찰 — 창의적이야, 그냥 incoherent 야? 이제 temp=2.5, top_p=0.9 로 다섯 번째 추가하고 비교. 운동은 top_p 가 temperature 와 독립적으로 튜닝 가능한 가드레일이고, 매우 높은 temperature 는 readable 유지하려면 더 빡빡한 top_p 필요하다고 느끼는 것.