Scientific Benchmarks
Rigorous evaluation of ArcRouter's multi-model consensus approach against single-model baselines. All results are reproducible, auto-graded, and include statistical confidence intervals.
Benchmark Run: Completed February 21, 2026. Tested 4 models across 6 datasets (172 total test cases). Grading improvements include word-to-number normalization, multiple-choice extraction, Python code execution sandbox, and bootstrap confidence intervals (1000 resamples).
Complete Results
Comprehensive benchmark across 6 datasets. Confidence intervals computed via bootstrap resampling (95% CI, 1000 samples).
| Dataset | CR Free | CR Paid | GPT-4o-mini | Claude Opus |
|---|---|---|---|---|
| factual_custom (25) | 100.0% ±0.0 | 88.0% ±14.0 | 100.0% ±0.0 | 100.0% ±0.0 |
| factual_hard (35) | 96.6% ±5.2 | 0.0% (FAIL) | 97.1% ±4.3 | 100.0% ±0.0 |
| gsm8k_sample (22) | 100.0% | 90.0% ±15.0 | 90.0% ±15.0 | 100.0% ±0.0 |
| gsm8k_hard (25) | 100.0% ±0.0 | 0.0% (FAIL) | 100.0% ±0.0 | 100.0% ±0.0 |
| mmlu_subset (50) | 35.3% ±16.2 | 0.0% (FAIL) | 80.0% ±12.0 | 44.0% ±14.0 |
| humaneval_sample (10) | 100.0% | 0.0% (FAIL) | 50.0% ±30.0 | 50.0% ±30.0 |
| AVERAGE | 88.6% | 89.0%* (2/6) | 86.2% | 82.3% |
Quality per Dollar
ArcRouter free tier demonstrates the core thesis: cheap/free models with consensus can match or exceed expensive single models.
| Model | Avg Accuracy | Cost/Request | Quality/$ |
|---|---|---|---|
| ArcRouter Free | 88.6% | $0.000 | ∞ |
| GPT-4o-mini | 86.2% | ~$0.00015 | 5,747 |
| Claude Opus 4.6 | 82.3% | ~$0.015 | 5,487 |
| ArcRouter Paid | BROKEN | $0.002 | — |
ArcRouter Free achieved 100% accuracy on HumanEval code generation (10 problems), while GPT-4o-mini and Claude Opus both scored 50%.
All models struggled with MMLU STEM questions (50 MC questions). GPT-4o-mini won with 80%, while ArcRouter Free and Claude Opus both underperformed (<45%).
Confidence Calibration
Calibration metrics measure whether ArcRouter's confidence scores are trustworthy. Free tier data based on 127 completed test cases (45 failed due to API errors).
Range across datasets. Lower is better. Model tends to be overconfident (claims 80% confidence on 100% accurate answers).
Range across datasets. Lower is better. Measures probabilistic prediction quality.
Example: factual_custom Calibration
Based on 21 completed cases (4 API failures).
| Confidence | Cases | Accuracy | Interpretation |
|---|---|---|---|
| 0.2–0.4 | 2 | 100% | Underconfident (good) |
| 0.4–0.6 | 7 | 100% | Underconfident (good) |
| 0.6–0.8 | 2 | 100% | Well-calibrated |
| 0.8–1.0 | 10 | 100% | Well-calibrated ✓ |
Methodology
Datasets
- • factual_custom (25): Basic Q&A with ground truth
- • factual_hard (35): Adversarial questions (hallucination traps, trick questions)
- • gsm8k_sample (22): Standard math word problems
- • gsm8k_hard (25): Multi-step math (3-5 reasoning steps)
- • mmlu_subset (50): STEM multiple-choice (10 each: Bio, Chem, Physics, CS, Math)
- • humaneval_sample (10): Python code generation with test execution
Grading Improvements (Feb 21, 2026)
- • Word-to-number normalization: "seven" → "7" (fixes false negatives)
- • Multiple-choice extraction: Priority-ordered patterns to extract A/B/C/D from verbose responses
- • Code execution sandbox: Python subprocess with 5s timeout, shell=false (security)
- • Bootstrap confidence intervals: 1000 resamples, percentile method (95% CI)
- • Council size tracking: Identifies single-model fallbacks (indicates flaky models)
Model Configurations
- • ArcRouter Free: 3-8 free models, Jaccard word-overlap consensus
- • ArcRouter Paid: 3-5 cheap paid models, embedding-based semantic similarity
- • GPT-4o-mini: openai/gpt-4o-mini via OpenRouter
- • Claude Opus 4.6: anthropic/claude-opus-4-6 via OpenRouter
Reproducibility
All benchmark code is open source at api/scripts/benchmark.ts. Results include git commit hash, dataset path, evaluator version, and API endpoint. Rate limiting: 2s delay (free tier), 1s (paid tier), 0.5s (OpenRouter) to avoid 429 errors.
Known Issues
P0: Paid Tier API Failures
Status: Production blocker. ArcRouter Paid tier (budget="low") fails with 100% error rate on 4/6 datasets: factual_hard (0/35), gsm8k_hard (0/25), mmlu_subset (0/50), humaneval_sample (0/10). All requests return 500 Internal Server Error.
Hypothesis: Paid tier models may have stricter rate limits or the embedding API is hitting quota. Requires investigation of Worker logs and OpenRouter API responses.
P1: Free Tier API Reliability
Status: Free tier experienced 16-36% API failure rate across datasets (45 failures out of 172 total attempts). Failures are primarily 500 Internal Server Errors during consensus processing.
Impact: Reduces effective sample size for calibration metrics. May indicate flaky free models timing out or returning malformed responses.
P2: MMLU Performance Gap
Status: ArcRouter Free (35.3%) and Claude Opus (44.0%) both significantly underperform GPT-4o-mini (80.0%) on MMLU STEM multiple-choice questions.
Hypothesis: Free tier models may lack STEM knowledge. Multiple-choice format may favor models trained on academic benchmarks (GPT-4o-mini). Requires analysis of per-category breakdown (Bio/Chem/Physics/CS/Math).
P3: Small Council Sizes
Status: Free tier shows high single-model fallback rates (48-56% on some datasets). Average council size is 1.4-1.7 models instead of target 3-5.
Impact: Reduces consensus value proposition when only 1-2 models respond. May be caused by aggressive timeouts, flaky free models, or model selection logic.
Transparency Note: All benchmark results are reported as measured, including failures. This research is ongoing and will be updated as issues are resolved. Production deployment is blocked pending P0 fix (paid tier failures). Benchmark runner: api/scripts/benchmark-all.sh. Comparison tool: api/scripts/benchmark-compare.ts.