|
| 1 | +# Batch Evaluation for Cost Optimization |
| 2 | + |
| 3 | +When running large-scale evaluations, cost can be a significant factor. Ragas now supports OpenAI's Batch API, which offers **up to 50% cost savings** compared to regular API calls, making it ideal for non-urgent evaluation workloads. |
| 4 | + |
| 5 | +## What is Batch Evaluation? |
| 6 | + |
| 7 | +OpenAI's Batch API allows you to submit multiple requests for asynchronous processing at half the cost of synchronous requests. Batch jobs are processed within 24 hours and have separate rate limits, making them perfect for large-scale evaluations where immediate results aren't required. |
| 8 | + |
| 9 | +### Key Benefits |
| 10 | + |
| 11 | +- **50% Cost Savings** on both input and output tokens |
| 12 | +- **Higher Rate Limits** that don't interfere with real-time usage |
| 13 | +- **Guaranteed Processing** within 24 hours (often much sooner) |
| 14 | +- **Large Scale Support** up to 50,000 requests per batch |
| 15 | + |
| 16 | +## Quick Start |
| 17 | + |
| 18 | +### Basic Batch Evaluation |
| 19 | + |
| 20 | +```python |
| 21 | +import os |
| 22 | +from ragas.batch_evaluation import BatchEvaluator, estimate_batch_cost_savings |
| 23 | +from ragas.dataset_schema import SingleTurnSample |
| 24 | +from ragas.metrics import Faithfulness |
| 25 | +from ragas.llms import LangchainLLMWrapper |
| 26 | +from langchain_openai import ChatOpenAI |
| 27 | + |
| 28 | +# Ensure you have your OpenAI API key set |
| 29 | +os.environ["OPENAI_API_KEY"] = "your-openai-api-key" |
| 30 | + |
| 31 | +# Setup LLM with batch support (automatically detected) |
| 32 | +llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini")) |
| 33 | +faithfulness = Faithfulness(llm=llm) |
| 34 | + |
| 35 | +# Prepare your evaluation samples |
| 36 | +samples = [ |
| 37 | + SingleTurnSample( |
| 38 | + user_input="What is the capital of France?", |
| 39 | + response="The capital of France is Paris.", |
| 40 | + retrieved_contexts=["Paris is the capital city of France."] |
| 41 | + ), |
| 42 | + # ... more samples |
| 43 | +] |
| 44 | + |
| 45 | +# Create batch evaluator |
| 46 | +evaluator = BatchEvaluator(metrics=[faithfulness]) |
| 47 | + |
| 48 | +# Run batch evaluation (blocks until completion) |
| 49 | +results = evaluator.evaluate(samples, wait_for_completion=True) |
| 50 | + |
| 51 | +# Check results |
| 52 | +for result in results: |
| 53 | + print(f"Metric: {result.metric_name}") |
| 54 | + print(f"Job ID: {result.job_id}") |
| 55 | + print(f"Success Rate: {result.success_rate:.2%}") |
| 56 | + print(f"Sample Count: {result.sample_count}") |
| 57 | +``` |
| 58 | + |
| 59 | +### Cost Estimation |
| 60 | + |
| 61 | +Before running batch evaluations, you can estimate your cost savings: |
| 62 | + |
| 63 | +```python |
| 64 | +from ragas.batch_evaluation import estimate_batch_cost_savings |
| 65 | + |
| 66 | +# Estimate costs for 1000 samples |
| 67 | +cost_info = estimate_batch_cost_savings( |
| 68 | + sample_count=1000, |
| 69 | + metrics=[faithfulness], |
| 70 | + regular_cost_per_1k_tokens=0.15, # GPT-4o-mini input cost |
| 71 | + batch_discount=0.5 # 50% savings |
| 72 | +) |
| 73 | + |
| 74 | +print(f"Regular API Cost: ${cost_info['regular_cost']}") |
| 75 | +print(f"Batch API Cost: ${cost_info['batch_cost']}") |
| 76 | +print(f"Total Savings: ${cost_info['savings']} ({cost_info['savings_percentage']}%)") |
| 77 | +``` |
| 78 | + |
| 79 | +### Asynchronous Batch Evaluation |
| 80 | + |
| 81 | +For non-blocking operations, use async evaluation: |
| 82 | + |
| 83 | +```python |
| 84 | +import asyncio |
| 85 | + |
| 86 | +async def run_batch_evaluation(): |
| 87 | + evaluator = BatchEvaluator(metrics=[faithfulness]) |
| 88 | + |
| 89 | + # Submit jobs without waiting |
| 90 | + results = await evaluator.aevaluate( |
| 91 | + samples=samples, |
| 92 | + wait_for_completion=False # Don't block |
| 93 | + ) |
| 94 | + |
| 95 | + # Jobs are submitted, check back later |
| 96 | + for result in results: |
| 97 | + print(f"Submitted job {result.job_id} for {result.metric_name}") |
| 98 | + |
| 99 | +# Run async evaluation |
| 100 | +asyncio.run(run_batch_evaluation()) |
| 101 | +``` |
| 102 | + |
| 103 | +## Checking Batch Support |
| 104 | + |
| 105 | +Not all LLMs support batch evaluation. Here's how to check: |
| 106 | + |
| 107 | +```python |
| 108 | +# Check if metric supports batch evaluation |
| 109 | +if faithfulness.supports_batch_evaluation(): |
| 110 | + print(f"✅ {faithfulness.name} supports batch evaluation") |
| 111 | +else: |
| 112 | + print(f"❌ {faithfulness.name} requires regular API") |
| 113 | + |
| 114 | +# Check LLM batch support |
| 115 | +if llm.supports_batch_api(): |
| 116 | + print("✅ LLM supports batch processing") |
| 117 | +else: |
| 118 | + print("❌ LLM does not support batch processing") |
| 119 | +``` |
| 120 | + |
| 121 | +## Supported Models |
| 122 | + |
| 123 | +Currently, batch evaluation is supported for: |
| 124 | +- OpenAI models (ChatOpenAI, AzureChatOpenAI) |
| 125 | +- All metrics that use these LLMs |
| 126 | + |
| 127 | +### Supported Metrics |
| 128 | + |
| 129 | +- ✅ Faithfulness (partial support) |
| 130 | +- 🔄 More metrics coming soon... |
| 131 | + |
| 132 | +For metrics not yet supporting batch evaluation, they will automatically fall back to regular API calls. |
| 133 | + |
| 134 | +## Configuration Options |
| 135 | + |
| 136 | +### BatchEvaluator Parameters |
| 137 | + |
| 138 | +```python |
| 139 | +evaluator = BatchEvaluator( |
| 140 | + metrics=metrics, |
| 141 | + max_batch_size=1000, # Max samples per batch |
| 142 | + poll_interval=300.0, # Status check interval (5 minutes) |
| 143 | + timeout=86400.0 # Max wait time (24 hours) |
| 144 | +) |
| 145 | +``` |
| 146 | + |
| 147 | +### Custom Metadata |
| 148 | + |
| 149 | +Add metadata to track your batch jobs: |
| 150 | + |
| 151 | +```python |
| 152 | +results = evaluator.evaluate( |
| 153 | + samples=samples, |
| 154 | + metadata={ |
| 155 | + "experiment": "model_comparison", |
| 156 | + "version": "v1.0", |
| 157 | + "dataset": "production_qa" |
| 158 | + } |
| 159 | +) |
| 160 | +``` |
| 161 | + |
| 162 | +## Best Practices |
| 163 | + |
| 164 | +### When to Use Batch Evaluation |
| 165 | + |
| 166 | +✅ **Ideal for:** |
| 167 | +- Large-scale evaluations (100+ samples) |
| 168 | +- Non-urgent evaluation workloads |
| 169 | +- Cost optimization scenarios |
| 170 | +- Regular evaluation pipelines |
| 171 | + |
| 172 | +❌ **Avoid for:** |
| 173 | +- Real-time evaluation needs |
| 174 | +- Interactive applications |
| 175 | +- Small datasets (<50 samples) |
| 176 | +- Time-sensitive workflows |
| 177 | + |
| 178 | +### Optimization Tips |
| 179 | + |
| 180 | +1. **Batch Size**: Use 1000-5000 samples per batch for optimal performance |
| 181 | +2. **Model Selection**: Use cost-effective models like `gpt-4o-mini` |
| 182 | +3. **Concurrent Processing**: Submit multiple metrics simultaneously |
| 183 | +4. **Monitoring**: Set up logging for long-running jobs |
| 184 | + |
| 185 | +```python |
| 186 | +import logging |
| 187 | + |
| 188 | +# Enable batch evaluation logging |
| 189 | +logging.basicConfig(level=logging.INFO) |
| 190 | +logger = logging.getLogger('ragas.batch_evaluation') |
| 191 | +``` |
| 192 | + |
| 193 | +## Error Handling |
| 194 | + |
| 195 | +```python |
| 196 | +try: |
| 197 | + results = evaluator.evaluate(samples) |
| 198 | + |
| 199 | + for result in results: |
| 200 | + if result.errors: |
| 201 | + print(f"❌ Errors in {result.metric_name}:") |
| 202 | + for error in result.errors: |
| 203 | + print(f" - {error}") |
| 204 | + else: |
| 205 | + print(f"✅ {result.metric_name}: {result.success_rate:.2%} success") |
| 206 | + |
| 207 | +except Exception as e: |
| 208 | + print(f"Batch evaluation failed: {e}") |
| 209 | +``` |
| 210 | + |
| 211 | +## Low-Level Batch API |
| 212 | + |
| 213 | +For advanced use cases, you can use the low-level batch API directly: |
| 214 | + |
| 215 | +```python |
| 216 | +from ragas.llms.batch_api import create_batch_api, BatchRequest |
| 217 | +from openai import OpenAI |
| 218 | + |
| 219 | +# Direct batch API usage |
| 220 | +client = OpenAI() |
| 221 | +batch_api = create_batch_api(client) |
| 222 | + |
| 223 | +# Create custom requests |
| 224 | +requests = [ |
| 225 | + BatchRequest( |
| 226 | + custom_id="eval-1", |
| 227 | + body={ |
| 228 | + "model": "gpt-4o-mini", |
| 229 | + "messages": [{"role": "user", "content": "Evaluate this response..."}] |
| 230 | + } |
| 231 | + ) |
| 232 | +] |
| 233 | + |
| 234 | +# Submit batch job |
| 235 | +batch_job = batch_api.create_batch(requests) |
| 236 | +print(f"Batch job created: {batch_job.batch_id}") |
| 237 | + |
| 238 | +# Monitor progress |
| 239 | +status = batch_job.get_status() |
| 240 | +print(f"Status: {status.value}") |
| 241 | + |
| 242 | +# Retrieve results when complete |
| 243 | +if status.value == "completed": |
| 244 | + results = batch_job.get_results() |
| 245 | + for result in results: |
| 246 | + print(f"Response for {result.custom_id}: {result.response}") |
| 247 | +``` |
| 248 | + |
| 249 | +## Troubleshooting |
| 250 | + |
| 251 | +### Common Issues |
| 252 | + |
| 253 | +**Issue**: "Batch API not supported for this LLM" |
| 254 | +```python |
| 255 | +# Solution: Use OpenAI-based LLM |
| 256 | +from langchain_openai import ChatOpenAI |
| 257 | +llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini")) |
| 258 | +``` |
| 259 | + |
| 260 | +**Issue**: "Metric does not support batch evaluation" |
| 261 | +```python |
| 262 | +# Solution: Check metric support or wait for future updates |
| 263 | +if not metric.supports_batch_evaluation(): |
| 264 | + print(f"Metric {metric.name} will use regular API") |
| 265 | +``` |
| 266 | + |
| 267 | +**Issue**: Timeout waiting for batch completion |
| 268 | +```python |
| 269 | +# Solution: Use non-blocking evaluation or increase timeout |
| 270 | +results = evaluator.evaluate( |
| 271 | + samples, |
| 272 | + wait_for_completion=False # Don't wait |
| 273 | +) |
| 274 | +# Or increase timeout |
| 275 | +evaluator = BatchEvaluator(timeout=172800.0) # 48 hours |
| 276 | +``` |
| 277 | + |
| 278 | +## Migration from Regular Evaluation |
| 279 | + |
| 280 | +Converting existing evaluations to use batch processing is simple: |
| 281 | + |
| 282 | +### Before (Regular API) |
| 283 | +```python |
| 284 | +from ragas import evaluate |
| 285 | +from ragas.metrics import Faithfulness |
| 286 | + |
| 287 | +results = evaluate( |
| 288 | + dataset=eval_dataset, |
| 289 | + metrics=[Faithfulness(llm=llm)] |
| 290 | +) |
| 291 | +``` |
| 292 | + |
| 293 | +### After (Batch API) |
| 294 | +```python |
| 295 | +from ragas.batch_evaluation import BatchEvaluator |
| 296 | +from ragas.metrics import Faithfulness |
| 297 | + |
| 298 | +# Convert dataset to samples if needed |
| 299 | +samples = [sample for sample in eval_dataset] |
| 300 | + |
| 301 | +evaluator = BatchEvaluator(metrics=[Faithfulness(llm=llm)]) |
| 302 | +results = evaluator.evaluate(samples) |
| 303 | +``` |
| 304 | + |
| 305 | +The batch API provides significant cost savings while maintaining the same evaluation quality, making it an excellent choice for large-scale evaluation workloads. |
0 commit comments