신경망 디버깅 전략

경험 많은 ML 엔지니어들이 쓰는 체계적 접근

신경망은 대부분 소프트웨어보다 디버깅이 어려운 방식으로 실패해. Loss 곡선이 항상 뭐가 잘못됐는지 보여주는 것도 아니고, 버그가 명백해지기 전 여러 epoch 동안 조용히 있을 수도 있어. 체계적 전략이야.

1. 데이터부터 검증 — 항상. 대부분 training 실패가 데이터 버그. Model 디버깅 전에 shape, 값 범위 (normalize됐나?), label 분포, NaN/inf 부재 확인.

2. 한 batch에 overfit. Model이 한 batch에서 100 epoch 동안 loss를 ~0으로 못 내리면 근본 버그 — 잘못된 loss, 잘못된 출력 activation, label 포맷 불일치, learning rate 너무 작음. 한 batch로 잡으면 몇 시간 절약.

3. run_eagerly로 전체 Python 트레이스백. Tracing 에러가 불투명할 때 compile(..., run_eagerly=True)로 model을 plain Python으로 바꾸면 제대로 된 stack trace 나와.

4. 표준 증상 주시. 각 흔한 버그마다 signature 있어.

Code

Data sanity check + overfit one batch·python

import tensorflow as tf
import numpy as np

# Step 1: verify data
for x, y in train_dataset.take(10):
    assert not tf.reduce_any(tf.math.is_nan(x)), "NaN in inputs!"
    assert not tf.reduce_any(tf.math.is_inf(x)), "Inf in inputs!"
    print(f"x range: [{x.numpy().min():.3f}, {x.numpy().max():.3f}]")
    print(f"y unique: {np.unique(y.numpy())}")
    break

# Step 2: overfit one batch
single_x, single_y = next(iter(train_dataset))
history = model.fit(single_x, single_y, epochs=100, verbose=0)
print(f"Final loss on single batch: {history.history['loss'][-1]:.6f}")
# If not ~0, the bug is in: loss / output activation / label format / LR

흔한 버그 signature 표·python

# Symptom                       → Likely cause
# ------------------------------------------------------------
# Loss stuck at ~log(num_classes)  Init bad / LR too low — try LR x10
# Loss NaN after first batch       LR too high / no input norm /
#                                  wrong loss-activation pair
# Train acc 100%, val acc bad      Overfit — Dropout, L2, smaller model
# Train ≈ Val both bad             Underfit — bigger model / higher LR
# Output always one class          Class imbalance — class_weight or
#                                  focal loss; check final activation

# Detect NaN with a callback
class NaNDetector(tf.keras.callbacks.Callback):
    def on_batch_end(self, batch, logs=None):
        if logs and (tf.math.is_nan(logs['loss']) or
                     tf.math.is_inf(logs['loss'])):
            print(f"\n⚠️  NaN/Inf at batch {batch}!")
            self.model.stop_training = True

경험 많은 ML 엔지니어들이 쓰는 체계적 접근

Code

Exercise

Progress

댓글 0