Best Practices

1. Use Appropriate Models

Choose models based on your use case. Smaller models (3B-7B) are faster and cheaper for simple tasks, while larger models (13B+) provide better quality for complex reasoning.

2. Implement Retry Logic

Use exponential backoff for retries. Respect the Retry-After header for rate limit errors.

3. Stream for Long Responses

Enable streaming for better user experience, especially for longer completions. This reduces perceived latency.

4. Cache Embeddings

Embeddings for the same text are deterministic. Cache them to reduce API calls and costs.

5. Monitor Usage

Track your token usage and costs in the console to optimize your application.