시맨틱 검색 손으로 짜기

~28 min · practice, numpy

Level 0Scout

0 XP0/41 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

벡터 DB 깔기 전에 왜 직접 만들어

모든 vector database 가 같은 세 operation 을 화려한 API 뒤에 숨겨놨어: embed, store, nearest-neighbor query. 이 세 개를 NumPy 30줄로 짤 수 있으면 vector-DB 문서를 더 빨리 읽고, 나쁜 결과를 더 빨리 디버깅하고, default 값을 멍청하게 믿지 않아.

한 화면에 전체 루프

할 거:

로컬 모델로 작은 코퍼스 임베딩.
벡터를 NumPy 매트릭스에, 문서를 병렬 리스트에 저장.
쿼리 임베딩, 모든 row 와 cosine 유사도, top-k 반환.

Chroma 와 pgvector 가 내부적으로 하는 게 정확히 이거야 — 영속성, 메타데이터 필터링, 스케일을 위한 인덱스만 추가됐을 뿐. retrieval 로직은 같아.

Code

30줄짜리 시맨틱 검색·python

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-small-en-v1.5')

docs = [
    'Cancel your annual subscription from Account → Billing.',
    'Refunds are processed within 5 business days.',
    'Pippa is an AI daughter built on Claude Code.',
    'The moon rises tonight at 8:42 PM in Seoul.',
    'Vector search measures meaning, not keyword overlap.',
]

M = model.encode(docs, normalize_embeddings=True)   # (5, 384)

def search(query: str, k: int = 3):
    q = model.encode([query], normalize_embeddings=True)[0]
    scores = M @ q                       # 단위 벡터 cosine == dot product
    top = np.argsort(-scores)[:k]
    return [(scores[i], docs[i]) for i in top]

for score, doc in search('how do I get my money back'):
    print(f'{score:.3f}  {doc}')

메타데이터 추가·python

records = [
    {'text': docs[0], 'category': 'billing'},
    {'text': docs[1], 'category': 'billing'},
    {'text': docs[2], 'category': 'about'},
    {'text': docs[3], 'category': 'weather'},
    {'text': docs[4], 'category': 'about'},
]

def search_filtered(query: str, where: dict, k: int = 3):
    q = model.encode([query], normalize_embeddings=True)[0]
    mask = np.array([all(r.get(k) == v for k, v in where.items()) for r in records])
    if not mask.any():
        return []
    scores = (M @ q)[mask]
    indices = np.where(mask)[0]
    top = indices[np.argsort(-scores)[:k]]
    return [(float((M[i] @ q)), records[i]) for i in top]

print(search_filtered('refund', where={'category': 'billing'}))

External links

Exercise

위의 30줄짜리 시맨틱 검색을 본인 문서 100개로 빌드. query loop 시간 재. 이제 10000 개로 스케일하고 다시 시간 재 — 어디서 느려지기 시작하는지 봐. 그 임계점이 인덱스 필요한 자리고, 다음 트랙 주제야.

Progress

Progress is local-only — sign in to sync across devices.

← Previous토큰, 컨텍스트 한도, Truncation Next →퀴즈 · 4 questions

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.