퀴즈 · 6 questions

🔀 Mixture of Experts (MoE)

Layer 마다 많은 expert, 토큰당 몇 개만 발화 — 감당 가능 compute 의 frontier capacity

Level 0Scout

0 XP0/41 lessons0/12 achievements

0/100 XP to next level100 XP to go0% complete

01DeepSeek-V3 가 671B total, 37B active per token. 실제로 무슨 뜻?

Hint

Memory 는 *닿을 수 있는 것*; compute 는 *닿은 것*.

02Dense Transformer 와 MoE Transformer 의 유일한 구조적 차이는?

Hint

두 architecture 나란히 print 하면 한 박스만 다를 거.

03MoE training 에서 'expert collapse' 가 왜 문제?

Hint

부자가 더 부자 되는 실패 모드.

04DeepSeek-R1 과 DeepSeek-V3 가 같은 MoE backbone 공유 (671B-A37B, 256+1 experts top-8). R1 을 'reasoning model' 만드는 것은?

Hint

같은 wiring, 다른 schooling.

05DeepSeek-V3 가 표준 auxiliary load-balancing loss 의 대안으로 도입한 것은?

Hint

트릭은 router 가 underused expert 를 *자연스럽게* 선호하게 만드는 것, 그렇게 하라고 *말하지 않고*.

06True or false: 'MoE 모델의 각 expert 가 의미적 토픽 specialize (예: 하나는 science, 하나는 code).'

Hint

현실이 마케팅 다이어그램보다 더 평범.

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.