Best Practices
1. Use Appropriate Models
Choose models based on your use case. Smaller models (3B-7B) are faster and cheaper for simple tasks, while larger models (13B+) provide better quality for complex reasoning.
2. Implement Retry Logic
Use exponential backoff for retries. Respect the Retry-After header for rate limit errors.
3. Stream for Long Responses
Enable streaming for better user experience, especially for longer completions. This reduces perceived latency.
4. Cache Embeddings
Embeddings for the same text are deterministic. Cache them to reduce API calls and costs.
5. Monitor Usage
Track your token usage and costs in the console to optimize your application.