History Arc

3 개의 변곡점

Deep learning 의 modern 역사는 짧고 강렬해. 2012 — AlexNet: GPU 두 장으로 train 한 CNN 이 ImageNet top-5 error 를 하룻밤에 거의 반으로 깎았어. Neural net 에 회의적이었던 community 가 1 년 만에 pivot 했지. 2014–2017 — architecture 러쉬: VGG, GoogLeNet, ResNet, attention, seq2seq, batch normalization, Adam. 우리가 지금도 쓰는 toolbox 가 이때 만들어졌어. 2017 — Transformer: Attention Is All You Need 가 GPU/TPU 에서 horizontally scale 되는 recurrence-free architecture 를 제안. 2018 년에 BERT 와 GPT 가 NLP 를 먹었고, 2020 년에 vision (ViT) 과 speech 까지 먹었어.

가능하게 한 재료들

알고리즘 천재성만으로 된 게 아냐. 세 가지가 align 해야 했어: data (ImageNet, open web), compute (GPU 가 graphics 부속에서 general-purpose tensor engine 으로), differentiable software (Theano, Caffe, 그 다음 PyTorch 와 TensorFlow). 1995 년 연구자들도 비슷한 idea 가 있었지만 셋 다 없었어.

팁: 'X 가 1986 년에 backprop 발명했다' 를 읽었을 때 올바른 반응은 '그래서 25 년 동안 안 쓰였지, data 와 GPU 가 없었으니까' 야. Algorithm 은 infrastructure 를 기다려.

다음 10 년에 대해 알려주는 것

각 변곡점은 돌이켜보면 명백하고 그 순간엔 충격적이야. Arc 는 멈추지 않았어 — foundation model, multi-modal training, reasoning-oriented post-training 이 현재 frontier. 패턴은 같아: 더 좋은 representation, 더 많은 data, 더 많은 compute, 더 좋은 software. 어떤 재료가 본인 문제의 binding constraint 인지 직관 키우는 게 날짜 외우는 것보다 유용해.

원칙: 이 quest 의 모든 named architecture 는 한 변곡점의 frozen snapshot 이야. 외우라는 게 아니라, 다음 변곡점이 왔을 때 패턴을 알아보라는 거지.

Code

The architecture rush in one screen·python

milestones = [
    (2012, "AlexNet",        "First CNN to win ImageNet by a huge margin"),
    (2014, "VGG",             "Deeper is better, with simple 3x3 convs"),
    (2014, "GoogLeNet",       "Inception modules; multi-scale within a layer"),
    (2014, "Seq2Seq",         "Encoder-decoder for translation"),
    (2015, "Batch Norm",      "Normalization that made deeper nets trainable"),
    (2015, "ResNet",          "Residual connections; 100+ layer networks work"),
    (2014, "Adam",            "Per-parameter learning rates; default optimizer"),
    (2017, "Transformer",     "Attention-only; no recurrence; massively parallel"),
    (2018, "BERT / GPT-1",    "Pretrain a transformer, fine-tune on tasks"),
    (2020, "ViT",             "Transformers eat vision too"),
    (2022, "Diffusion + LLM", "Generative scaling on text and images"),
    (2024, "Reasoning LLMs",  "Test-time compute as a first-class scaling axis"),
]
for year, name, why in milestones:
    print(f"{year}  {name:14}  {why}")

3 개의 변곡점

가능하게 한 재료들

다음 10 년에 대해 알려주는 것

Code

External links

Exercise

Progress

댓글 0