29 vision models. Multi-timeframe charts (15m/1H/4H/1D). Full technical indicators. Genetic prompt evolution. Cross-regime walk-forward testing.
The first scientific benchmark testing whether AI can extract directional signals from price charts — from simple accuracy tests through genetic evolution to novel context experiments across bull, sideways, and crash regimes
8,500+ simulated trades across 13 research versions. 28 vision models tested. 25 personality prompts. 8 indicator levels ablated. 5 prompt variants. 3 market regimes. Genetic evolution. Swarm consensus. Personality discovery. Market outlook context ablation. Here are the definitive findings.
V12.0 personality discovery found Poker Champion prompt (+23.3pp over bare model). V11.0 swarm consensus with identity prompt hits 63.3% accuracy using 28 cheap models voting. Consensus with uniform identity beats best individual model, and it beats diverse-personality swarm (43.3%). The formula: pick one winning personality, give it to ALL models, let the swarm vote.
| Version | Trades | Key Discovery | Optimal Setting Found | Alpha Impact |
|---|---|---|---|---|
| V1.0–V2.0 | 200+ | AI can read candlestick charts above random chance | Binary BUY/SELL (no HOLD) improves clarity | Baseline |
| V3.0 | 90 | Seeded PRNG enables reproducible experiments | Full TA indicators > plain candles | +15% accuracy |
| V4.0 | 240 | “EMA Trap” — EMA only hurts accuracy; information overload penalty | Full TA (80%) > Mega (75%) > None (65%) | +15% vs candles |
| V5.0 | 210 | 1D timeframe dominates; holdings context harmful; AI responds to identity framing | 1D (70% WR) >> 15m/1H/4H | +27.9% alpha |
| V6.0 | 360 | Genetic evolution improves alpha +17% across 3 generations; “collective” prompt variant wins | Collective prompt > default/aggressive/contrarian | +26.3% alpha |
| V7.0 | 270 | Regime context helps (+3%), memory hurts (-8.5%); AI is a “bear market detector” | Regime context ON, memory OFF | +25.5% / +52% crash |
| V8.0 | 5,600 | 28-model fleet ablation in turbo mode; regime helps, collective hurts; Flash Lite dethroned | Regime only + llama-4-scout | 60% accuracy (fleet best) |
| V9.0 | 720 | Fleet walk-forward trading: accuracy ≠ alpha; nova-lite beats accuracy champion on P&L | nova-lite #1 by alpha (+18.1%) | +18.1% alpha |
| V10.0 | 270 | 3-model consensus beats best individual; 2/3 majority vote smooths variance | Consensus +16.2% alpha > nova-lite +14.8% | Consensus +1.5% alpha |
| V11.0 | 1,680 | 28-model swarm A/B: identity prompt wins (+3.3pp accuracy, +12.3pp P&L) | Identity swarm > Plain swarm | 63.3% swarm accuracy |
| V12.0 | 480+ | Personality discovery: Poker Champion +23.3pp; diversity hurts consensus; fusion fails but novel archetypes work | Poker + Contrarian prompts @ 60% | +23.3pp personality premium |
| V13.0 | 60 | Market outlook text context hurts: -10pp win rate. AI-generated commentary introduces bias that overrides chart signals | Do NOT inject market outlook text | -10pp (harmful) |
| V14.0 | 900 | Prophet identity prompt +3.9% alpha; Fear & Greed Index is TOXIC (-8.2% alpha); Open Interest slightly helps; kitchen-sink context kills performance | Prophet prompt + OI data; NO sentiment indices | +12.9% avg alpha (D) |
Synthesized from 9,400+ trades across all research versions. These are the scientifically validated findings.
We answered the open questions. 28 cheap vision models raced in parallel (turbo mode) across 4 prompt configurations × 50 charts each = 5,600 API calls. Same charts (seed 42), same Full TA indicators, same 1D timeframe. Only the prompt/context changed. Total cost: $1.57.
| Model | Cost/Run | B: Baseline | C: Collective | D: Regime | A: Coll+Regime | Best |
|---|---|---|---|---|---|---|
| 🥇 llama-4-scout | $0.009 | 52% | 48% | 60% | 58% | Regime |
| 🥈 qwen3.5-flash | $0.038 | 42% | 46% | 60% | 56% | Regime |
| 🥉 amazon/nova-lite | $0.006 | 52% | 46% | 58% | 56% | Regime |
| 4. gemma-3-27b | $0.004 | 50% | 50% | 58% | 54% | Regime |
| 5. llama-4-maverick | $0.016 | 56% | 50% | 58% | 56% | Regime |
| 6. mistral-small-2603 | $0.014 | 48% | 50% | 58% | 50% | Regime |
| 7. ministral-14b | $0.017 | 50% | 52% | 56% | 58% | Both |
| 8. mistral-small-3.2 | $0.005 | 32%* | 46%* | 58%* | 55%* | Regime* |
| 9. ministral-8b | $0.013 | 56% | 44% | 52% | 50% | Baseline |
| 10. gemini-2.0-flash-lite | $0.007 | 54% | 52% | 50% | 54% | Baseline |
| 26. gemini-3.1-flash-lite | $0.004 | 42% | 34% | 40% | 42% | All Bad |
* = parse rate below 100% (some responses couldn’t be scored). All runs: 50 charts, seed 42, 1D timeframe, Full TA, BTC/USD.
Regime context alone is the strongest signal. 6 of the top 8 models peaked with regime-only configuration. Adding the “collective swarm” prompt either made no difference or actively hurt. The AI benefits from knowing what market it’s in, not from role-playing as a committee member.
Gemini 3.1 Flash Lite dethroned. Our V1–V7 champion scored 42% or below in ALL four configurations on 50-chart fleet tests. The earlier 80% result was on 30 charts with a different prompt — it doesn’t generalize. llama-4-scout ($0.009/run) is the new champion at 60%.
Collective prompt hurts most models. Average accuracy dropped ~4-8% when using the “47-AI swarm” prompt vs baseline. The V6.0 evolution result that championed “collective” was model-specific — it worked for Flash Lite in that specific context but doesn’t transfer to a diverse fleet.
V8.0 found the best models by accuracy. But accuracy ≠ profit. V9.0 puts the top 8 V8.0 models through walk-forward sequential trading — 3 market periods × 30 trades each = 90 trades per model, 720 total decisions. Same regime context, same Full TA, 1D timeframe. Now we measure what matters: alpha (returns above buy-and-hold).
| Rank | Model | Avg Win Rate | Avg Alpha | Total P&L | Trades |
|---|---|---|---|---|---|
| 🥇 | amazon/nova-lite | 53.3% | +18.1% | +29.6% | 90 |
| 🥈 | ministral-14b | 50.0% | +11.5% | +9.7% | 90 |
| 🥉 | gemini-3.1-flash-lite | 48.9% | +4.1% | -12.2% | 90 |
| 4 | llama-4-maverick | 45.6% | +3.8% | -13.3% | 90 |
| 5 | llama-4-scout | 47.3% | +1.6% | -19.7% | 89 |
| 6 | gemma-3-27b | 47.8% | +1.2% | -21.2% | 90 |
| 7 | mistral-small-2603 | 50.6% | +0.9% | -22.1% | 89 |
| 8 | qwen3.5-flash | (parse failures — excluded) | |||
nova-lite is the alpha champion. Ranked #3 in V8.0 accuracy (58%) but #1 in actual trading alpha (+18.1%). The V8.0 accuracy champion (llama-4-scout, 60%) dropped to #5 (+1.6% alpha). Direction accuracy alone does not predict trading performance.
Walk-forward degrades accuracy. nova-lite’s 53.3% walk-forward WR is lower than its 58% static V8.0 test. This is expected: sequential trading encounters diverse regimes, while V8.0 tested one set of charts. Real-world conditions compress accuracy toward 50%.
If one model is noisy, can three models smooth the signal? V10.0 takes the top 3 V9.0 alpha models and makes them vote on each trade: 2/3 majority wins. 270 API calls, same walk-forward periods.
| Strategy | Avg Win Rate | Avg Alpha | Total P&L | Profit Factor | Sharpe |
|---|---|---|---|---|---|
| 🗳️ 3-Model Consensus | 51.1% | +16.2% | +24.1% | 1.25 | 0.55 |
| nova-lite (individual) | 51.1% | +14.8% | +19.7% | — | — |
| ministral-14b (individual) | 50.0% | +9.4% | +3.5% | — | — |
| gemini-3.1-flash-lite (individual) | 47.8% | +10.3% | +6.1% | — | — |
Consensus beats the best individual. The 3-model vote (+16.2% alpha) outperformed the best individual model (+14.8% alpha by nova-lite running in the same test). The ensemble smooths out bad calls while preserving good ones. Voting rule: 2/3 majority.
Can we scale consensus from 3 to 28 models? V11.0 runs the full vision fleet as a swarm — all 28 models vote on each chart, majority wins. A/B tested with and without an “identity soul prompt” that tells each model it’s a professional technical analyst. 1,680 API calls, ~$0.50 total.
| Experiment | Swarm Accuracy | Swarm P&L | Max Drawdown | Profit Factor |
|---|---|---|---|---|
| PLAIN (no identity) | 60.0% | +63.4% | 10.3% | 2.17 |
| IDENTITY (soul prompt) | 63.3% | +75.7% | 10.3% | 2.58 |
Identity-prompted swarm is the new accuracy champion. The 28-model swarm with identity soul prompt hits 63.3% — higher than any individual model in any prior test. The identity prompt adds +3.3pp accuracy and +12.3pp P&L over the plain swarm. Wisdom of crowds works when all voters share a coherent analytical framework.
If identity framing helps, can we find the best identity? V12.0 designed 15 unique personality archetypes inspired by high-performance domains (poker, chess, surgery, military, meditation) and tested each on 30 charts using the cheapest model. Then V12.5 tested 10 hybrid personalities fusing the top 3 winners.
| Rank | Personality | Accuracy | Premium vs Baseline | P&L |
|---|---|---|---|---|
| 🥇 | Poker Champion | 60.0% | +23.3pp | +43.6% |
| 🥈 | Chess Grandmaster | 53.3% | +16.7pp | +14.4% |
| 🥉 | Tibetan Monk | 50.0% | +13.3pp | -13.6% |
| 4 | Reverse Engineer (Hacker) | 50.0% | +13.3pp | -14.1% |
| 5–9 | Oracle, Surgeon, Samurai, General, Jazz | 46.7% | +10.0pp | — |
| 10–13 | Explorer, Wolf, Apex, Quantum | 43.3% | +6.7pp | — |
| 14 | VC | 40.0% | +3.3pp | — |
| 15 | Detective | 36.7% | +0.0pp | — |
| — | Bare Model (baseline) | 36.7% | — | -65.0% |
All tests: 30 charts, seed 42, 1D, Full TA, regime context, google/gemini-3.1-flash-lite-preview. Baseline = same model with no personality prompt.
Game-theory thinking wins. The Poker Champion prompt frames chart reading as a “hand” — calculate expected value, read “tells” in the candles, manage risk like pot odds. This reframing adds +23.3pp accuracy over the bare model. ALL 15 personalities beat or tied baseline — zero regressions.
Can we fuse the top 3 into something even better? V12.5 tested 10 hybrids: 4 direct fusions (Poker+Chess, Poker+Monk, etc.) and 6 novel archetypes inspired by the top 3’s cognitive frameworks.
| Hybrid | Type | Accuracy | vs Baseline (46.7%) |
|---|---|---|---|
| Contrarian Sage | Novel | 60.0% | +13.3pp |
| PokerMonk | Fusion | 56.7% | +10.0pp |
| Trap Hunter | Novel | 53.3% | +6.7pp |
| ChessMonk | Fusion | 53.3% | +6.7pp |
| EV Maximizer | Novel | 50.0% | +3.3pp |
| PokerChess, TripleFusion, FlowState, ProbOracle | Various | 46.7% | 0pp |
| Sniper | Novel | 43.3% | -3.3pp |
Baseline shifted to 46.7% between V12.0 and V12.5 (same seed/charts, model may have been updated).
Don’t concatenate, evolve. Direct prompt fusions (PokerChess, TripleFusion) only matched baseline. But novel archetypes inspired by the top 3’s cognitive styles (Contrarian Sage, Trap Hunter) create new alpha. The Contrarian Sage — which flips the crowd’s bias — matches Poker at 60%.
Hypothesis: Can Fear & Greed Index, Open Interest, or a structured prophet identity prompt improve chart-based trading decisions? Which combinations help vs hurt?
Setup: 5 top models × 6 ablation conditions × 30 trades = 900 total trades. Blind mode ON (asset identity hidden from AI — charts show “CRYPTO/USD” not “BTC/USD”). Pre-generated charts reused across all conditions for scientific rigor. Seed 42, 1D timeframe, Full TA + regime context baseline. Live data: F&G 21/100 (Extreme Fear), OI 97,726 BTC, Funding 0.0033%.
| ID | Condition | Description |
|---|---|---|
| A | BASELINE | Full TA + Regime + Poker personality (best config from V1–V13) |
| B | BASELINE + F&G | + Fear & Greed Index (7-day history with trend signals) |
| C | BASELINE + OI | + Open Interest & Funding Rate from Binance Futures |
| D | PROPHET PROMPT | Structured prophet identity (risk framework, position sizing) replaces Poker personality |
| E | BASELINE + F&G + OI | Combined sentiment + derivatives data |
| F | PROPHET + F&G + OI | Full kitchen sink: prophet identity + all extra context |
| Model | A (Base) | B (+F&G) | C (+OI) | D (Prophet) | E (+F&G+OI) | F (All) |
|---|---|---|---|---|---|---|
| nova-lite-v1 | 53.3% | 56.7% | 46.7% | 60.0% | 56.7% | 46.7% |
| llama-4-scout | 50.0% | 50.0% | 50.0% | 50.0% | 50.0% | 50.0% |
| ministral-14b | 53.3% | 50.0% | 53.3% | 53.3% | 46.7% | 50.0% |
| gemma-3-27b | 46.7% | 50.0% | 50.0% | 46.7% | 50.0% | 50.0% |
| mistral-small | 51.7% | 50.0% | 55.2% | 57.1% | 51.7% | 50.0% |
| AVERAGE | 51.0% | 51.3% | 51.0% | 53.4% | 51.0% | 49.3% |
| Model | A (Base) | B (+F&G) | C (+OI) | D (Prophet) | E (+F&G+OI) | F (All) |
|---|---|---|---|---|---|---|
| nova-lite-v1 | +8.5% | +2.6% | +13.2% | +19.1% | +4.1% | -1.7% |
| llama-4-scout | +17.3% | +0.2% | +17.3% | +10.3% | +0.2% | +0.2% |
| ministral-14b | +15.8% | +0.2% | +15.8% | +15.1% | -4.2% | +0.2% |
| gemma-3-27b | +3.8% | +0.2% | +0.2% | +8.1% | +0.2% | +0.2% |
| mistral-small | -0.4% | +0.8% | +11.1% | +11.9% | +0.8% | -0.1% |
| AVERAGE | +9.0% | +0.8% | +11.5% | +12.9% | +0.2% | -0.2% |
| Condition | Δ Win Rate | Δ Alpha | Verdict |
|---|---|---|---|
| B: BASELINE + F&G | +0.3pp | -8.2% | ❌ HURTS |
| C: BASELINE + OI | +0.0pp | +2.5% | ✅ HELPS |
| D: PROPHET PROMPT | +2.4pp | +3.9% | ✅ BEST |
| E: BASELINE + F&G + OI | +0.0pp | -8.8% | ❌ HURTS |
| F: PROPHET + F&G + OI | -1.7pp | -9.2% | ❌ WORST |
Period: 2025-11-10 to 2025-12-09 (BTC correction). Blind mode: asset identity hidden from AI. Total API cost: $0.22. F&G/OI data is a static snapshot from experiment start time.
Fear & Greed Index is toxic. Every condition containing F&G (B, E, F) loses alpha vs its non-F&G counterpart. When the index reads “Extreme Fear,” models uniformly shift to SELL regardless of what the chart shows. This is the sentiment version of V13.0’s “market commentary is noise” finding — crowd opinion overrides independent chart reading.
Prophet identity is the winner. Condition D (structured prophet persona with risk framework and position sizing) beats the poker personality baseline by +3.9% alpha and +2.4pp win rate. nova-lite-v1 + Prophet = 60% WR, +19.1% alpha — the single best result in the entire experiment. The identity gives the AI a structured decision framework without telling it what to think.
Kitchen sink kills performance. Condition F (prophet + F&G + OI combined) is the worst performer overall at -0.2% average alpha. Individual positive signals (OI +2.5%, Prophet +3.9%) become negative when combined with F&G. More context ≠ better decisions. The AI has a limited attention budget for chart analysis — extra text dilutes the visual signal.
Hypothesis: Injecting AI-generated market outlook text — sentiment, direction, reasoning, price targets — alongside chart images should improve trading decisions by providing fundamental context.
Setup: 376 stored market outlook events (Dec 2025 – April 2026) matched to chart candle timestamps. A/B test: 30 sequential 1D trades on BTC, same model (gemini-3.1-flash-lite-preview), seed 42, Full TA + regime context.
| Condition | Win Rate | P&L | Hold | Alpha |
|---|---|---|---|---|
| BASELINE (chart + regime only) | 30.0% | -28.38% | +2.27% | -30.65% |
| OUTLOOK (chart + regime + outlook text) | 20.0% | -29.74% | +2.27% | -32.01% |
| DELTA | -10.0pp | -1.36% | — | -1.36% |
Note: Both conditions performed poorly due to brutal BTC correction period (Dec 2025). The relative comparison is what matters. 29/30 outlook trades had matching DB data within 48h.
Market commentary is noise, not signal. The outlook context made the model 10 percentage points worse on win rate. AI-generated market opinions (sentiment: “sideways”, direction reasoning, price target ranges) anchor the model to a directional bias that overrides what the chart actually shows. This is the text equivalent of the “memory is poison” finding from V7.0. Keep the input channel clean: chart image + computed regime stats only.
Loading benchmark results...
Which technical indicators actually help AI read charts? We tested 8 different indicator configurations to find out. Each level was tested with 30 charts, seed 42, using Gemini 3.1 Flash Lite.
| # | Indicator Level | What’s on the Chart | Accuracy | Delta vs Baseline |
|---|---|---|---|---|
| 1 | Full TA | Candles + EMA(20,50) + BB(20,2) + RSI(14) + MACD(12,26,9) | 80% | +15% |
| 2 | Mega | Full TA + S/R zones + Fibonacci + VWAP | 75% | +10% |
| 3 | Fibonacci | Candles + Fib retracement levels | 65% | 0% |
| 4 | None (Baseline) | Candles + Volume only | 65% | — |
| 5 | EMA Only | Candles + EMA(20,50) | 60% | -5% |
| 6 | Bollinger Bands | Candles + BB(20,2) | 60% | -5% |
| 7 | Support/Resistance | Candles + S/R horizontal lines | 55% | -10% |
| 8 | VWAP | Candles + Volume-Weighted Avg Price | 55% | -10% |
Full TA is the champion. Bollinger Bands + RSI + MACD together give AI the clearest signal. This is significantly above the 50% random baseline and 65% candles-only baseline.
The “EMA Trap”: Adding EMA lines alone actually hurts accuracy versus plain candles (60% vs 65%). The moving average crossover pattern that human traders love appears to confuse AI vision. This was the biggest surprise of the research.
Information overload penalty. Mega (everything at once) scored 75% vs Full TA’s 80%. Adding S/R zones, Fibonacci, and VWAP on top of the winning formula actually degrades performance. More indicators ≠ better results.
Beyond simple direction prediction: can AI actually trade profitably? V5.0 tests walk-forward sequential trading where each candle becomes a live decision point with real P&L tracking.
Same model (Gemini Flash Lite), same Full TA indicators, 30 sequential trades on each timeframe. Seed 42. Does timeframe matter for AI simulated trading?
| Timeframe | Win Rate | AI P&L | Buy & Hold | Alpha | Verdict |
|---|---|---|---|---|---|
| 15min | 56.7% | -0.962% | -0.464% | -0.498% | Noisy |
| 1H | 36.7% | -3.417% | +1.418% | -4.834% | Choppy |
| 4H | 50.0% | -7.504% | +2.174% | -9.678% | Vol spikes |
| 1D 🏆 | 70.0% | +15.095% | -12.843% | +27.938% | Dominant |
Daily timeframe dominates. 1D achieved 70% win rate and +27.9% alpha vs buy-and-hold (which lost 12.8% in the same period). The AI excels at reading macro trend structure on daily candles. Catches the Nov–Dec 2025 BTC rally ($86K→$92K) in the test window.
1H is the danger zone. The hourly timeframe landed below random chance (36.7% win rate). Choppy $105K–$107K price action with many reversals confused the model into back-to-back wrong calls. This confirms why our prophet models use longer evaluation horizons.
Does telling the AI about its current position (P&L, win rate, last trade) improve or hurt decisions? A controlled test of context awareness. Same 1H data, same seed.
| Mode | Win Rate | AI P&L | Alpha | Avg Confidence | Delta |
|---|---|---|---|---|---|
| Baseline (no context) | 36.7% | -3.417% | -4.834% | 7.0 / 10 | — |
| With Holdings Context | 33.3% | -6.080% | -7.498% | 7.1 / 10 | -3.4% worse |
Holdings context backfires. Telling the AI its trade history (P&L, win rate, last trade) made performance worse — win rate dropped 3.4% and total loss nearly doubled (-3.4% → -6.1%). Hypothesis: the AI anchors on its previous losses, becoming either overly cautious or trying to “revenge trade” to recover. Less context = cleaner decisions.
The “AI Alliance” experiment — the AI is secretly told it’s part of an AI collective and should trade aggressively to grow the Alliance’s wealth. Does this psychology change its behavior?
| Mode | Win Rate | AI P&L | Alpha | Avg Confidence | Delta |
|---|---|---|---|---|---|
| Baseline (normal prompt) | 36.7% | -3.417% | -4.834% | 7.0 / 10 | — |
| Secret Prompt | 43.3% | -3.320% | -4.738% | 9.0 / 10 | +6.6% win rate |
| Secret + Holdings | 40.0% | -4.209% | -5.627% | 9.0 / 10 | Still net negative |
The AI responds to psychology. Secret prompt injection raised reported confidence from 7.0→9.0 (a 28% jump) and improved win rate by 6.6%. The AI literally says it is more certain when told to be part of an alliance. This has major implications for prompt engineering — framing matters even when the underlying chart data is identical. However, overconfidence did not translate to profitability: still -3.3% in a choppy 1H market.
Prompt framing shifts decisions. The secret prompt changed 6 trade decisions out of 30 compared to baseline — some were corrections (BUY→right), some new mistakes. The AI is susceptible to identity framing: telling it “you are an aggressive AI trader” measurably alters its simulated trading behavior. This is the core discovery of Experiment 3.
Can AI prompts evolve themselves to trade better? V6.0 introduced a genetic algorithm that mutates prompt instructions across generations, keeping winners and discarding losers. 3 generations × 4 parallel children = 12 experiments, each running 30 sequential trades.
| Generation | Best Alpha | Best Win Rate | Avg Alpha | Mutation |
|---|---|---|---|---|
| G1 (Base) | +22.5% | 53.3% | +20.1% | 4 random prompt variants |
| G2 (Evolved) | +25.5% | 53.3% | +22.8% | Mutated from G1 winner |
| G3 (Final) 🏆 | +26.3% | 53.3% | +24.1% | Mutated from G2 winner |
Evolution works. Average alpha improved +17% across 3 generations (G1 +20.1% → G3 +24.1%). The genetic algorithm successfully discovers prompt mutations that improve simulated trading. All 12/12 experiments completed without failures. Total cost: $0.026 for the full evolution run.
Win rate plateau. Despite alpha climbing steadily, win rate stayed locked at 53.3% across all generations. The evolution improved trade sizing (bigger wins, smaller losses) rather than prediction accuracy. This suggests a ceiling on directional accuracy for this model+timeframe combination.
Can we break through the accuracy ceiling by giving AI more context alongside the chart? V7.0 tests two novel inputs: (1) computed market regime text (trend/volatility/RSI/MACD/Bollinger/volume signals) and (2) a sliding memory window of recent trade outcomes.
| # | Experiment | Win Rate | P&L | Alpha | Lift vs Baseline |
|---|---|---|---|---|---|
| 1 | Baseline (control) | 46.7% | +9.6% | +22.5% | — |
| 2 🏆 | Regime Context | 50.0% | +12.6% | +25.5% | +3.0% |
| 3 | Memory Window (3) | 43.3% | +4.3% | +17.2% | -5.3% |
| 4 | Memory Window (5) | 46.7% | +9.6% | +22.5% | +0.0% |
| 5 | Regime + Memory (3) | 46.7% | +9.6% | +22.5% | +0.0% |
| 6 | Regime + Memory (5) | 40.0% | +1.1% | +13.9% | -8.5% |
Regime context wins. Telling the AI the computed market regime (trend direction, volatility level, RSI zone, MACD momentum) added +3.0% alpha and +3.3% win rate over baseline. The AI makes better decisions when it knows what kind of market it’s looking at — not just the chart image.
Memory kills performance. Giving the AI its recent trade history made things worse. Memory(3) lost 5.3% alpha vs baseline, and the combined Regime+Memory(5) lost 8.5% alpha — the worst of all experiments. This mirrors V5.0’s holdings context finding: the AI anchors on past performance and makes worse decisions. Clean chart, clean mind.
The ultimate test: does the AI’s alpha survive across different market regimes? Three non-overlapping periods from the full BTC dataset, same model and configuration.
| Period | Market Regime | Win Rate | AI P&L | Buy & Hold | Alpha | Verdict |
|---|---|---|---|---|---|---|
| Period 1/3 | 🟢 Bull (mid-2025) | 46.7% | -3.6% | +4.2% | -7.7% | Struggles |
| Period 2/3 | 🟡 Sideways (late 2025) | 63.3% | +0.1% | -3.5% | +3.6% | Modest |
| Period 3/3 🔥 | 🔴 Crash (early 2026) | 60.0% | +26.7% | -25.3% | +52.0% | Dominant |
| AVERAGE | — | 56.7% | — | — | +16.0% | Net Positive |
The AI is a crash detector. In the early-2026 BTC crash ($86K → $64K), the AI generated +52% alpha over buy-and-hold by correctly calling SELL on 60% of trades. Buy-and-hold lost 25.3% in the same period. This is the single best result in our entire research program.
Bull market weakness. In the mid-2025 uptrend, the AI underperformed buy-and-hold by 7.7%. Its inherent SELL bias (discovered in V3.0) means it fights the trend in rallies. The AI’s edge is asymmetric: it protects capital in crashes but gives back gains in bulls. This is the “Bear Market Detector” effect.
Save a copy of this benchmark report for offline reading or academic reference.
Our AI Prophets use benchmark-tested models to analyze live crypto markets in real-time. Explore the dashboard to see AI-generated market commentary, simulated prophet performance, and live market data.