Gradient accumulation

문제: 감당 못 하는 batch size

어떤 모델은 큰 batch size 에서 가장 잘 학습돼 — 특히 transformer — 근데 큰 batch 가 GPU 메모리에 안 들어가는 경우가 있어. gradient accumulation 은 통계적 batch 와 물리적 batch 를 분리해: 작은 micro-batch 를 여러 번 돌리고, gradient 를 합치고, 그 합이 원하던 batch 를 대표할 때 비로소 optimizer 를 한 번 호출해. weight update 는 큰 batch 한 번 돌린 것과 (거의) 동일해.

수학을 정직하게 유지하는 법

두 디테일이 정확성을 만들어. 첫째, 각 micro-batch 의 loss 를 accumulation_steps 로 나눠서 합쳐진 gradient 가 합 이 아니라 평균 이 되게 해 — 안 그러면 effective learning rate 가 accumulation 개수만큼 커져. 둘째, apply 와 reset 을 같은 경계에서 해서 optimizer step 하나가 정확히 N 개 micro-batch 를 소비하게 해. 아래는 TensorFlow tape 로 짠 manual loop 형태 — accumulator 를 model attribute 로 두고 train_step() 에 접으면 그게 production 버전이야.

Code

Gradient accumulation (TensorFlow backend)·python

import tensorflow as tf

accumulation_steps = 4  # Effective batch = batch_size x 4
accumulated = None

for step, (x, y) in enumerate(dataset):
    with tf.GradientTape() as tape:
        y_pred = model(x, training=True)
        # Divide so the summed gradient is an AVERAGE
        loss = loss_fn(y, y_pred) / accumulation_steps

    grads = tape.gradient(loss, model.trainable_variables)

    # Accumulate
    if accumulated is None:
        accumulated = grads
    else:
        accumulated = [a + g for a, g in zip(accumulated, grads)]

    # Apply every N micro-batches, then reset
    if (step + 1) % accumulation_steps == 0:
        optimizer.apply_gradients(
            zip(accumulated, model.trainable_variables)
        )
        accumulated = None

문제: 감당 못 하는 batch size

수학을 정직하게 유지하는 법

Code

External links

Exercise

Progress

댓글 0