MultiWorker, TF_CONFIG, 성능 레시피

단일 머신에서 fleet으로

MultiWorkerMirroredStrategy는 MirroredStrategy를 여러 머신에 확장. 각 머신이 모든 GPU에 model 사본 운영; gradient가 NCCL all-reduce로 worker AND device 동기화.

TF_CONFIG는 각 worker가 동료들이랑 자기 역할 알기 위한 JSON 환경변수. cluster 필드가 모든 worker 주소 나열; task 필드가 이 worker의 타입과 index 알려줌.

분산 가기 전 성능 레시피: 항상 단일 GPU 먼저 최적화. Profiler가 input bound <5%? Mixed precision 켜짐? XLA 켜짐? Batch size가 메모리 허용 최대? 각각이 분산 복잡도 추가 전 2–5배 속도 향상 가능.

Code

MultiWorker 설정 — TF_CONFIG·python

import os, json, tensorflow as tf

# Set on each worker before starting training
# Worker 0 (chief):
os.environ["TF_CONFIG"] = json.dumps({
    "cluster": {
        "worker": ["host1:port", "host2:port", "host3:port"],
    },
    "task": {"type": "worker", "index": 0},   # different on each worker
})

# Strategy setup — same code on every worker
communication_options = tf.distribute.experimental.CommunicationOptions(
    implementation=tf.distribute.experimental.CommunicationImplementation.NCCL,
)
strategy = tf.distribute.MultiWorkerMirroredStrategy(
    communication_options=communication_options,
)

with strategy.scope():
    model = build_and_compile_model()

NUM_WORKERS    = 3
GPUS_PER_WORKER = 2
BATCH_PER_REPLICA = 64
GLOBAL_BATCH = BATCH_PER_REPLICA * NUM_WORKERS * GPUS_PER_WORKER

model.fit(dataset, epochs=50)

성능 레시피 스택·python

import tensorflow as tf
from tensorflow.keras import mixed_precision

# 1. Linear LR scaling with batch size
BASE_LR = 1e-3
GLOBAL_BATCH_SIZE = 512
WARMUP_EPOCHS = 5
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=BASE_LR * (GLOBAL_BATCH_SIZE / 32),
    decay_steps=1000,
    warmup_steps=WARMUP_EPOCHS * steps_per_epoch,
)

# 2. AUTOTUNE pipeline
AUTOTUNE = tf.data.AUTOTUNE
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000)
    .batch(GLOBAL_BATCH_SIZE, drop_remainder=True)   # drop_remainder for TPU
    .map(preprocess, num_parallel_calls=AUTOTUNE)
    .prefetch(AUTOTUNE)
    .cache()
)

# 3. XLA compilation
model.compile(
    optimizer=optimizer,
    loss='sparse_categorical_crossentropy',
    jit_compile=True,    # auto on TPU, opt-in on GPU
)

# 4. Mixed precision (matches hardware)
mixed_precision.set_global_policy('mixed_float16')   # GPU
# or 'mixed_bfloat16'                                  # TPU

MultiWorker, TF_CONFIG, 성능 레시피

단일 머신에서 fleet으로

Code

Progress

댓글 0