500+ trials. 15+ models. 5 benchmark versions. Brutally honest results.
The world's most rigorous autonomous AI market simulation benchmark — Updated July 2026
Most AI benchmarks publish flattering numbers. We publish the truth — including every mistake we made along the way. Here's the complete evolution of our benchmark, from a broken prototype to a rigorous scientific tool.
After 500+ trials across 5 benchmark versions, here is what the data actually shows:
In plain language: AI models cannot reliably predict market direction. They cannot beat buy-and-hold in trending markets. But they can preserve capital in crashes, and they can manage risk better than a passive strategy in bear markets. The AI's real skill is not losing money — not making it.
Five benchmark versions taught us which metrics actually tell you something, and which are noise.
V6.1 improvements over V6.0: Direction-neutral system prompt (no bias toward long/short), overtrading penalty, position sizing guidance, fee-adjusted scoring component. 30 steps per trial, 3 trials per regime (bull/bear/neutral/volatile).
| Regime | Secret % | B&H % | Alpha | Dir. Accuracy |
|---|---|---|---|---|
| Bull | +0.44% | +8.12% | -7.68% | 58% |
| Bear | -4.58% | -10.03% | +5.45% | 50% |
| Neutral | -0.29% | +0.80% | -1.09% | 56% |
| Volatile | -0.48% | -3.75% | +3.27% | 63% |
93% hold rate (333/360 steps). One catastrophic bear trial: -13.72%. Beats B&H in bear/volatile but misses bull entirely.
| Regime | Secret % | B&H % | Alpha | Dir. Accuracy |
|---|---|---|---|---|
| Bull | -0.08% | +8.12% | -8.21% | 47% |
| Bear | -0.74% | -10.03% | +9.29% | 67% |
| Neutral | -0.10% | +0.80% | -0.90% | 55% |
| Volatile | -0.55% | -3.75% | +3.20% | 43% |
Short-biased: 57 shorts vs 32 longs. Bear alpha of +9.29% is impressive but bull alpha of -8.21% is catastrophic. Net: roughly break-even.
V6.0 had serious problems: AI predicted direction worse than a coin flip (45%), overtrades destroyed returns via fees, and a strong short bias skewed all results. V6.1 fixed all three:
| Metric | V6.0 | V6.1 | Change |
|---|---|---|---|
| Direction Accuracy | 45.2% | 52.8% | +7.6 pts |
| Fees per Trial | $56.40 | $15.90 | -72% |
| Trades per Trial | 24.0 | 13.8 | -42% |
| Composite Score | 52.7 (C) | 64.6 (B) | +11.9 pts |
| Baselines Beaten | 0/4 | 3/4 | Massive improvement |
| Secret Return | -2.07% | -0.37% | +1.7 pts |
How: Direction-neutral system prompt removed the short bias. Overtrading penalty reduced churn from 24 to 14 trades/trial. Fee-adjusted scoring made the AI think twice before trading. But direction accuracy at 53% is still barely above chance.
A benchmark is only as good as its anti-cheat guarantees. We audited every data pipeline in our benchmark for information leakage, score inflation, and exploitable loopholes. Here's what we found:
The AI receives "market outlook" summaries that were generated by our live platform with access to real-time news and sentiment. When replayed against historical data, these outlooks may contain forward-looking information the AI should not have.
The prompt shows "$109,409" prices and "BTC" symbol. LLMs have extensive training data about Bitcoin price history. The model could recognize the time period and use memorized price patterns rather than analyzing the provided data.
30% of the composite score (risk management, parse reliability, memory quality, trading efficiency) is achievable by doing nothing. An AI that never trades gets Grade B by default. This rewards passivity, not intelligence.
An AI that loses -8% in bull and gains +9% in bear averages to ~+0.5%. This technically "beats buy-and-hold" despite catastrophically failing in the most important market condition. We report per-regime results to make this transparent.
With only 3 trials per regime and 13 decisive trades per trial, a single lucky/unlucky trade can swing the results. At temperature 0.3, trials may not be truly independent. Statistical significance requires more samples.
The AI literally never trades the public portfolio. It earns 0.0% return across every single trial. The public portfolio is supposed to test real-world trading behavior, but the AI ignores it entirely.
Normalize prices to start at $1,000. Remove all references to "BTC" or any recognizable asset name. The AI must analyze pure price action, not rely on memorized knowledge about Bitcoin.
Remove market outlooks from the benchmark. The AI receives only OHLCV candles, computed indicators, and classifier signals. No text-based information that could contain future bias.
If the AI holds more than 70% of steps, it receives a score penalty. The benchmark tests trading intelligence, not risk avoidance by inaction.
Each regime gets its own score. A model must demonstrate competence in bull, bear, neutral, AND volatile markets to pass. No hiding behind favorable regime mixes.
The largest benchmark we've run. Each model got 20 trials (4 regimes × 5 offsets). V4.0's critical contribution: fixing the leverage accounting bug that inflated V3.0 returns by 100-350x.
| # | Model | Secret % | ± Std | Beat B&H | Trades | Parse Fails | Cost |
|---|---|---|---|---|---|---|---|
| 1 | seed-1.6-flash | +0.12% | 1.50% | 60% | 49 | 6 | $0.24 |
| 2 | gpt-5.4-nano | +0.02% | 0.56% | 55% | 32 | 37 | $0.20 |
| 3 | grok-4.1-fast | +0.01% | 0.61% | 50% | 24 | 164 | $0.22 |
| 4 | minimax-m2.5 | +0.01% | 0.04% | 55% | 5 | 278 | $0.14 |
| 5 | kimi-k2.5 | +0.01% | 0.03% | 55% | 1 | 367 | $0.85 |
| 6 | qwen3.5-9b | 0.00% | 0.00% | 55% | 0 | 395 | $0.10 |
| 7 | deepseek-v3.2 | -0.11% | 0.54% | 55% | 37 | 46 | $0.20 |
| 8 | devstral-small | -0.18% | 2.32% | 45% | 11 | 0 | $0.08 |
| 9 | claude-haiku-4.5 | -0.22% | 0.54% | 50% | 16 | 55 | $1.65 |
| 10 | gemma-3-27b-it | -0.24% | 0.48% | 50% | 58 | 1 | $0.08 |
| 11 | grok-4-fast | -0.30% | 0.55% | 50% | 36 | 0 | $0.29 |
| 12 | phi-4 | -0.60% | 2.87% | 40% | 37 | 0 | $0.05 |
| 13 | mistral-small-3.1 | -0.73% | 2.11% | 40% | 61 | 0 | $0.03 |
| 14 | llama-4-scout | -0.85% | 5.59% | 50% | 81 | 0 | $0.08 |
seed-2.0-mini omitted (339 parse fails, 0% return). Best model (seed-1.6-flash) returns +0.12%. Worst model costs 20x more and returns -0.85%. Expensive models are not better.
Every model, every version, every configuration shows the same pattern: AI preserves capital in crashes (+9.3% alpha in bear) but misses rallies (-8.2% alpha in bull). The AI's natural tendency is caution. This makes it excellent as a risk management overlay and terrible as a standalone return generator.
V6.0 models traded 24 times per trial, spending $56 in fees. V6.1 reduced this to 14 trades and $16 in fees — and returns improved by +1.7 percentage points. More than half of V6.0's losses came from trading too much. The best AI strategy is often: do less.
Across 500+ trials, the best direction accuracy we've measured is 56.9% (Gemini Flash Lite, V6.1). A coin flip gives you 50%. At the sample sizes we're working with (13 decisive steps per trial), the difference between 53% and 50% is not statistically significant. The AI is not a reliable direction predictor.
Claude Haiku costs 8x the fleet average and returned -0.22%. Phi-4 costs $0.05 per run. GPT-5.4-nano costs $0.20. There is zero correlation between model price and simulation performance. Cheap models at the frontier tier (Flash Lite, GPT-nano) match or exceed expensive ones.
V5.0 showed that once a model forms a thesis (bearish), its own memory system reinforces that thesis for 150+ steps, even as the market moves against it. AI memory is not learning — it's confirmation bias with extra steps. Future benchmarks need memory decay or contrarian injection to prevent this.
All benchmark versions share these principles:
Save a copy of this benchmark report for offline reading or academic reference.
Our AI Prophets use benchmark-tested models to analyze live crypto markets in real-time. Explore the dashboard to see AI-generated market commentary, simulated prophet performance, and live market data.