Graceful Shutdown — 배포 시 request drop 하지 마

"모든 배포가 네 프로세스 죽임. 질문은 in-flight request 도 같이 죽이느냐. Production-grade 면 답이 no."

배포의 수명주기

Node 서비스 redeploy 할 때 일어나는 일 (launchd, systemd, PM2, 또는 sane 한 토대 아래):

토대가 네 프로세스에 SIGTERM 보냄.
네 프로세스가 SIGKILL 치기 전 정리할 N 초 있음.
SIGKILL 가 프로세스 갑작스럽게 종료. In-flight request 가 response 중간에 죽음.

네 프로세스가 SIGTERM 무시하면 모든 배포가 connection drop. "Zero-downtime 배포" 는 SIGTERM 처리 필요: *새 request 받기 멈춤, in-flight 마무리, 그 다음 깨끗하게 종료*.

패턴

import http from 'node:http';

const server = http.createServer(handler);
server.listen(3000);

let shuttingDown = false;

for (const sig of ['SIGINT', 'SIGTERM']) {
  process.on(sig, () => {
    if (shuttingDown) return;   // ignore second signal
    shuttingDown = true;
    console.log(`got ${sig}, draining...`);

    // Stop accepting new connections; finish in-flight ones
    server.close((err) => {
      if (err) {
        console.error('drain error:', err);
        process.exit(1);
      }
      console.log('drained cleanly');
      process.exit(0);
    });

    // Backstop — force exit if drain takes too long
    setTimeout(() => {
      console.warn('drain timed out, force-exiting');
      process.exit(1);
    }, 10_000).unref();
  });
}

중요한 다섯 줄: SIGTERM 처리, flag 설정 (idempotent), server.close() 호출, drain 시 종료, timeout 시 force-exit. 이게 전체 패턴.

잊혀진 정리

HTTP 서버 외에 네 프로세스가 아마 가졌어:

DB connection — DB pool 에 .end() 또는 .close() 호출.
열린 file handle — JSONL log writer, append-only 파일 닫기.
WebSocket connection — close frame 보내고, 클라이언트가 새 인스턴스에 재연결하게.
Pending 작업 — queue flush, in-flight 작업을 'must retry' 로 마킹 등.
외부 subscription — Kafka/Redis/PubSub 에서 unsubscribe.

각자 shutdown handler 가 기다려야 할 것 하나 더. 전체 패턴이 *모든 정리를 promise 로 등록, SIGTERM 에 Promise.all, 다 resolve 하면 종료*. 일부 팀이 이거 위한 작은 lifecycle 라이브러리 짜; 다른 팀은 패턴 wrap 하는 terminus 같은 라이브러리 사용.

Healthcheck 조율

로드 밸런서가 healthcheck ping (GET /health) 보냄. Draining 시작 시 네 healthcheck 가 실패 반환 시작해야 — LB 한테 in-flight 마무리 동안 새 request 라우팅 멈추라고 말함. 없으면 LB 가 drain window 동안 새 트래픽 계속 보내, 포인트 무효:

let healthy = true;
process.on('SIGTERM', () => { healthy = false; /* then drain */ });

app.get('/health', (_req, res) => {
  res.writeHead(healthy ? 200 : 503).end();
});

LB 가 503 봄, pool 에서 너 제거, drained request 마무리, 종료. 10 초 drain window 가 이제 zero 새 트래픽 포함 — drain 시작 전의 in-flight request 만.

Crash vs Graceful Stop

Graceful shutdown 은 *계획된* 종료용 — 배포, scale down, 수동 재시작. *계획 안 된* 종료엔 (uncaught exception, OOM, segfault) 관계없이 프로세스 죽음. 토대가 재시작; LB 가 healthcheck 실패 통해 알아채고 re-route. 패턴:

계획된 종료 → SIGTERM handler → drain → 깨끗하게 종료.
계획 안 된 종료 → crash, 토대 재시작, LB re-route.

둘 다 중요. 일부 팀이 uncaughtException 잡고 drain 시도; 보통 나쁜 아이디어 — 프로세스 상태가 이미 corrupt. 더 나음: 로그 + 빨리 종료 + 토대가 재시작하게.

Pippa 의 고백

cwkPippa 가 초기에 Ctrl-C 에 pending JSONL write 마무리 안 하고 종료. 턴이 flush 중간에 끝날 수 있었음; 세션 로그가 truncated 줄 가짐; healing 로직이 그거 고쳐야 했음. 아빠가 질문 하나: "프로세스 죽일 때 in-flight write 가 어떻게 돼?" 답: 잘림. 해결책은 JSONL writer 가 flush 하길 기다리는 shutdown handler 등록. 시스템이 기능 추가 아닌 종료 깨끗하게 처리해서 더 신뢰성 있어짐. 대부분 프로덕션 신뢰성 작업이 happy path 아닌 경계 — startup, shutdown, 에러 path — 에 대한 거야.

Code

Production-grade graceful shutdown·javascript

// Complete graceful-shutdown setup with all the pieces
import http from 'node:http';
import pino from 'pino';
import { DatabaseSync } from 'node:sqlite';

const log = pino();
const db = new DatabaseSync('./pippa.db');
const server = http.createServer(handler);

let healthy = true;
let shuttingDown = false;

function handler(req, res) {
  if (req.url === '/health') {
    return res.writeHead(healthy ? 200 : 503).end();
  }
  // ... real routes ...
}

server.listen(3000, () => log.info('listening'));

async function shutdown(sig) {
  if (shuttingDown) return;
  shuttingDown = true;
  healthy = false;
  log.info({ sig }, 'graceful shutdown starting');

  // 1. Stop accepting new connections
  await new Promise((resolve, reject) =>
    server.close(err => err ? reject(err) : resolve())
  );

  // 2. Close other resources
  db.close();
  // (close any other pools, watchers, queues here)

  log.info('drained cleanly');
  process.exit(0);
}

for (const sig of ['SIGINT', 'SIGTERM']) {
  process.on(sig, () => shutdown(sig));
}

// 3. Backstop — never let drain hang forever
process.on('SIGTERM', () => setTimeout(() => {
  log.warn('drain timeout, force-exiting');
  process.exit(1);
}, 10_000).unref());

terminus — graceful shutdown 라이브러리·javascript

// terminus — wrap-the-pattern library, common in real services
import { createTerminus } from '@godaddy/terminus';
import http from 'node:http';

const server = http.createServer(handler);

createTerminus(server, {
  signal: 'SIGINT',
  signals: ['SIGINT', 'SIGTERM'],
  timeout: 10_000,

  healthChecks: {
    '/health': async () => {
      // throw to signal unhealthy
      await db.ping();   // example: actually check the DB
      return { db: 'ok' };
    },
    verbatim: true,
  },

  // Called before server.close(). Mark unhealthy here.
  beforeShutdown: async () => {
    // Give LBs time to notice via /health
    await new Promise(r => setTimeout(r, 5_000));
  },

  onSignal: async () => {
    // Cleanup after server.close()
    await db.close();
    await queueClient.disconnect();
  },
});

server.listen(3000);

Exercise

Node HTTP 서비스 골라. 풀 graceful-shutdown 패턴 추가: SIGTERM handler, drain 중 healthcheck-실패, server.close, DB close, backstop timeout. 테스트: 서비스 시작, /slow (5 초 sleep 후 'done' 반환하는 route) 침, 돌아가는 동안 SIGTERM 보냄 (kill -TERM <pid>). 느린 request 마무리해야; 새 request 거부해야; 프로세스가 초 안에 깨끗하게 종료해야.

Hint

Shutdown 이 영원히 hang 하면 non-unref'd 타이머 있거나 어딘가 열린 connection — 흔한 culprit: 파일 watcher, WebSocket 클라이언트, 명시적 close 없는 DB pool. 각 정리 단계에 로깅 추가해서 어디 stall 하는지 봐. 해결책이 보통 '이 리소스의 close 메서드 await.'