모범 사례 - TokenLab

모델 선택

적절한 모델을 선택하면 비용과 품질에 큰 영향을 줄 수 있습니다.

작업 기반 권장사항

작업	권장 모델	이유
단순 Q&A	`gpt-5-mini`, `gemini-3.5-flash`	빠르고 저렴하며 충분히 우수함
복잡한 추론	`gpt-5.4`, `claude-opus-4-6`, `deepseek-r1`	더 나은 논리와 계획 능력
코딩	`claude-sonnet-4-6`, `gpt-4o`, `deepseek-v3-2`	코드에 최적화됨
창의적 글쓰기	`claude-sonnet-4-6`, `gpt-4o`	더 나은 문장 품질
Vision/이미지	`gpt-4o`, `claude-sonnet-4-6`, `gemini-3.5-flash`	네이티브 vision 지원
긴 컨텍스트	`gemini-2.5-pro`, `claude-sonnet-4-6`	1M+ token window
비용 민감형	`gpt-5-mini`, `gemini-3.5-flash`, `deepseek-v3-2`	최고의 가성비

비용 등급

$$$$ Premium: gpt-5.4, claude-opus-4-6
$$$  Standard: claude-sonnet-4-6, gpt-4o
$$   Budget:   gpt-5-mini, gemini-3.5-flash
$    Economy:  deepseek-v3-2, deepseek-r1

비용 최적화

1. 먼저 더 작은 모델 사용

def smart_query(question: str, complexity: str = "auto"):
    """Use cheaper models for simple tasks."""

    if complexity == "simple":
        model = "gpt-5-mini"
    elif complexity == "complex":
        model = "gpt-4o"
    else:
        # Start cheap, escalate if needed
        model = "gpt-5-mini"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": question}]
    )
    return response

2. max_tokens 설정

항상 합리적인 max_tokens 제한을 설정하세요:

# ❌ Bad: No limit, could generate thousands of tokens
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this article"}]
)

# ✅ Good: Limit response length
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this article"}],
    max_tokens=500  # Reasonable limit for a summary
)

3. 프롬프트 최적화

# ❌ Verbose prompt (more input tokens)
prompt = """
I would like you to please help me by analyzing the following text
and providing a comprehensive summary of the main points. Please be
thorough but also concise in your response. The text is as follows:
{text}
"""

# ✅ Concise prompt (fewer tokens)
prompt = "Summarize the key points:\n{text}"

4. 유사한 요청 배치 처리

# ❌ Many small requests
for question in questions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}]
    )

# ✅ Fewer larger requests
combined_prompt = "\n".join([f"{i+1}. {q}" for i, q in enumerate(questions)])
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": f"Answer each question:\n{combined_prompt}"}]
)

성능 최적화

5. UX를 위해 Streaming 사용

Streaming은 체감 성능을 향상시킵니다:

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a long essay"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

6. 대화형 사용에는 빠른 모델 선택

사용 사례	권장	지연 시간
Chat UI	`gpt-5-mini`, `gemini-3.5-flash`	첫 token까지 ~200ms
Tab completion	`claude-haiku-4-5`	첫 token까지 ~150ms
백그라운드 처리	`gpt-4o`, `claude-sonnet-4-6`	첫 token까지 ~500ms

7. 타임아웃 설정

client = OpenAI(
    api_key="sk-your-key",
    base_url="https://api.tokenlab.sh/v1",
    timeout=60.0  # 60 second timeout
)

안정성

8. 재시도 구현

import time
from openai import RateLimitError, APIError

def chat_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
        except APIError as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)
    raise Exception("Max retries exceeded")

9. 오류를 우아하게 처리

from openai import APIError, AuthenticationError, RateLimitError

try:
    response = client.chat.completions.create(...)
except AuthenticationError:
    # Check API key
    notify_admin("Invalid API key")
except RateLimitError:
    # Queue for later or use backup
    add_to_queue(request)
except APIError as e:
    if e.status_code == 402:
        notify_admin("Balance low")
    elif e.status_code >= 500:
        # Server error, retry later
        schedule_retry(request)

10. 대체 모델 사용

FALLBACK_CHAIN = ["gpt-4o", "claude-sonnet-4-6", "gemini-3.5-flash"]

def chat_with_fallback(messages):
    for model in FALLBACK_CHAIN:
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages
            )
        except APIError:
            continue
    raise Exception("All models failed")

보안

11. API 키 보호

# ❌ Never hardcode keys
client = OpenAI(api_key="sk-abc123...")

# ✅ Use environment variables
import os
client = OpenAI(api_key=os.environ["TOKENLAB_API_KEY"])

12. 사용자 입력 검증

def validate_message(content: str) -> bool:
    """Validate user input before sending to API."""
    if len(content) > 100000:
        raise ValueError("Message too long")
    # Add other validation as needed
    return True

13. API 키 제한 설정

다음 용도별로 지출 한도가 있는 별도의 API 키를 생성하세요:

개발/테스트
프로덕션
서로 다른 애플리케이션

모니터링

14. 사용량 추적

정기적으로 대시보드에서 다음 항목을 확인하세요:

모델별 token 사용량
비용 세부 내역
캐시 적중률
오류율

15. 중요한 메트릭 로깅

import logging

response = client.chat.completions.create(...)

logging.info({
    "model": response.model,
    "prompt_tokens": response.usage.prompt_tokens,
    "completion_tokens": response.usage.completion_tokens,
    "total_tokens": response.usage.total_tokens,
})

16. 알림 설정

서비스 중단을 방지하기 위해 대시보드에서 잔액 부족 알림을 구성하세요.

체크리스트

비용 최적화

각 작업에 적절한 모델 사용
max_tokens 제한 설정
프롬프트가 간결함
적절한 위치에 캐싱 활성화
유사한 요청 배치 처리

성능

대화형 UX를 위한 Streaming
실시간 사용을 위한 빠른 모델
타임아웃 구성 완료

안정성

재시도 로직 구현 완료
오류 처리 적용 완료
대체 모델 구성 완료

보안

환경 변수에 API 키 저장
입력 검증
dev/prod용 별도 키 사용
지출 한도 설정

마이그레이션 가이드 이미지 생성

​모델 선택

​작업 기반 권장사항

​비용 등급

​비용 최적화

​1. 먼저 더 작은 모델 사용

​2. max_tokens 설정

​3. 프롬프트 최적화

​4. 유사한 요청 배치 처리

​성능 최적화

​5. UX를 위해 Streaming 사용

​6. 대화형 사용에는 빠른 모델 선택

​7. 타임아웃 설정

​안정성

​8. 재시도 구현

​9. 오류를 우아하게 처리

​10. 대체 모델 사용

​보안

​11. API 키 보호

​12. 사용자 입력 검증

​13. API 키 제한 설정

​모니터링

​14. 사용량 추적

​15. 중요한 메트릭 로깅

​16. 알림 설정

​체크리스트

모델 선택

작업 기반 권장사항

비용 등급

비용 최적화

1. 먼저 더 작은 모델 사용

2. max_tokens 설정

3. 프롬프트 최적화

4. 유사한 요청 배치 처리

성능 최적화

5. UX를 위해 Streaming 사용

6. 대화형 사용에는 빠른 모델 선택

7. 타임아웃 설정

안정성

8. 재시도 구현

9. 오류를 우아하게 처리

10. 대체 모델 사용

보안

11. API 키 보호

12. 사용자 입력 검증

13. API 키 제한 설정

모니터링

14. 사용량 추적

15. 중요한 메트릭 로깅

16. 알림 설정

체크리스트