Streaming, Tool Use, Structured Output

~30 min · inference, streaming

Level 0스카우트

0 XP0/50 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

Streaming 은 키워드 하나

client.chat_completion(..., stream=True) 가 ChatCompletionStreamOutput 객체 yield 하는 generator 돌려줌. 각 chunk 가 .choices[0].delta.content. 모양이 OpenAI streaming 포맷 mirror — OpenAI stream 핸들하는 같은 UI 코드가 HF 도 핸들.

Tool calling

tools=[{...}] 를 JSON-schema-shaped dict 로 넘기기. 모델이 assistant 메시지에 tool_calls 로 응답. 로컬에서 실행, 결과를 {"role": "tool", ...} 메시지로 append, loop. 계약이 OpenAI 호환: 같은 dict 모양.

Structured output

JSON 모드엔 세 접근: (1) 프롬프트 + 검증, (2) 프로바이더 지원하면 response_format={"type": "json_object"}, (3) outlines 또는 instructor 같은 Pydantic-driven 라이브러리. 접근 3 이 프로바이더 간 가장 reliable.

Code

Streaming chat·python

from huggingface_hub import InferenceClient

client = InferenceClient(model="meta-llama/Llama-3.1-8B-Instruct", provider="hf-inference")

stream = client.chat_completion(
    messages=[{"role": "user", "content": "Count from 1 to 5 slowly."}],
    max_tokens=80,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()

Tool call loop·python

from huggingface_hub import InferenceClient
import json

client = InferenceClient(model="meta-llama/Llama-3.1-70B-Instruct", provider="together")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

messages = [{"role": "user", "content": "What's the weather in Seoul?"}]
resp = client.chat_completion(messages=messages, tools=tools, max_tokens=120)
choice = resp.choices[0]

if choice.message.tool_calls:
    call = choice.message.tool_calls[0]
    args = json.loads(call.function.arguments)
    # 툴 실행한 척
    tool_result = {"city": args["city"], "temp_c": 22, "conditions": "clear"}
    messages.append(choice.message)
    messages.append({"role": "tool", "tool_call_id": call.id, "content": json.dumps(tool_result)})
    final = client.chat_completion(messages=messages, max_tokens=120)
    print(final.choices[0].message.content)

External links

Exercise

Streaming 챗 루프 wire 해서 도착하는 chunk 출력. 그다음 툴 하나 (get_time fixed string 돌려줌) 추가. 모델이 콜할지 결정하는지 검증. response 가 tool call 일 때 vs plain message 일 때 streaming 동작 어떻게 변하는지 메모.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousProvider Routing 과 가격 표면 Next →OpenAI 호환 엔드포인트

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.