argument 안 놓치는 chunking 전략

500-토큰 chunk는 default지 법칙 아니야

흔한 starter chunk size는 500 토큰 + 50 토큰 overlap. 일반 prose에 work하는 default야. 법률 문서 (한 단락이 2,000 토큰), 코드 (boundary가 함수나 class), chat log (turn이 의미 단위)에는 재앙. Chunk size는 source 구조 따라야 해.

work하는 전략

Structural chunking — heading, section, function definition, log entry로 split. 문서가 작성된 방식이랑 align.
Sliding window with overlap — 구조 light한 일반 prose. 경계 근처 문장이 이웃이랑 attached 유지되게 overlap.
Semantic chunking — embedding 기반, embedding 거리 sharply 변하는 곳에서 split. free-form text에 유용.
Contextual headers — 각 chunk에 부모 section 1-sentence summary 박아서 retrieval이 topic 안 잃게.

argument-preservation 테스트

각 chunking 전략마다 corpus의 긴 argument (multi-paragraph claim, step-by-step procedure) 가져와. 물어: 전략이 argument를 함께 유지하나? Chunk break이 conclusion을 premise에서 split하면 retrieval이 어느 쪽이든 surface시키는데 다른 쪽 없이, 그래서 모델이 절반 case로 work해.

Code

Contextual chunk header (Anthropic 스타일)·python

def add_context_header(chunk: str, source_title: str, section: str) -> str:
    return (
        f"<source>{source_title}</source>\n"
        f"<section>{section}</section>\n"
        f"<content>\n{chunk}\n</content>"
    )

# After embedding the contextualized chunk, retrieval becomes
# robust to keyword drift across sections.

argument 안 놓치는 chunking 전략

500-토큰 chunk는 default지 법칙 아니야

work하는 전략

argument-preservation 테스트

Code

External links

Exercise

Progress

댓글 0