MULTIMODAL AI RESEARCH

Can AI Read Charts?

29 vision models. Multi-timeframe charts (15m/1H/4H/1D). Full technical indicators. Genetic prompt evolution. Cross-regime walk-forward testing.

The first scientific benchmark testing whether AI can extract directional signals from price charts — from simple accuracy tests through genetic evolution to novel context experiments across bull, sideways, and crash regimes

-
Vision Models
-
Charts Tested
-
Total Trials
-
Total Cost
⚠️ Important: This benchmark measures simulated, hypothetical performance using historical BTC price data rendered as candlestick charts. All results are for educational and informational purposes only. This is not financial advice. Models receive only visual chart images — no text data, no coin name, no price values.

📋 Executive Summary: What We’ve Learned

8,500+ simulated trades across 13 research versions. 28 vision models tested. 25 personality prompts. 8 indicator levels ablated. 5 prompt variants. 3 market regimes. Genetic evolution. Swarm consensus. Personality discovery. Market outlook context ablation. Here are the definitive findings.

8,500+
Total Trades
28
Vision Models
14
Research Versions
$3.72
Fleet Cost
63.3%
Best Accuracy

🏆 The Winning Formula (Updated V12.0 — Personality-Validated)

1DTimeframe
Full TAIndicators
Regime OnlyContext
Poker/ContrarianPersonality
28-Model SwarmConsensus

V12.0 personality discovery found Poker Champion prompt (+23.3pp over bare model). V11.0 swarm consensus with identity prompt hits 63.3% accuracy using 28 cheap models voting. Consensus with uniform identity beats best individual model, and it beats diverse-personality swarm (43.3%). The formula: pick one winning personality, give it to ALL models, let the swarm vote.

Research Timeline & Key Discoveries

VersionTradesKey DiscoveryOptimal Setting FoundAlpha Impact
V1.0–V2.0200+ AI can read candlestick charts above random chance Binary BUY/SELL (no HOLD) improves clarity Baseline
V3.090 Seeded PRNG enables reproducible experiments Full TA indicators > plain candles +15% accuracy
V4.0240 “EMA Trap” — EMA only hurts accuracy; information overload penalty Full TA (80%) > Mega (75%) > None (65%) +15% vs candles
V5.0210 1D timeframe dominates; holdings context harmful; AI responds to identity framing 1D (70% WR) >> 15m/1H/4H +27.9% alpha
V6.0360 Genetic evolution improves alpha +17% across 3 generations; “collective” prompt variant wins Collective prompt > default/aggressive/contrarian +26.3% alpha
V7.0270 Regime context helps (+3%), memory hurts (-8.5%); AI is a “bear market detector” Regime context ON, memory OFF +25.5% / +52% crash
V8.05,600 28-model fleet ablation in turbo mode; regime helps, collective hurts; Flash Lite dethroned Regime only + llama-4-scout 60% accuracy (fleet best)
V9.0720 Fleet walk-forward trading: accuracy ≠ alpha; nova-lite beats accuracy champion on P&L nova-lite #1 by alpha (+18.1%) +18.1% alpha
V10.0270 3-model consensus beats best individual; 2/3 majority vote smooths variance Consensus +16.2% alpha > nova-lite +14.8% Consensus +1.5% alpha
V11.01,680 28-model swarm A/B: identity prompt wins (+3.3pp accuracy, +12.3pp P&L) Identity swarm > Plain swarm 63.3% swarm accuracy
V12.0480+ Personality discovery: Poker Champion +23.3pp; diversity hurts consensus; fusion fails but novel archetypes work Poker + Contrarian prompts @ 60% +23.3pp personality premium
V13.060 Market outlook text context hurts: -10pp win rate. AI-generated commentary introduces bias that overrides chart signals Do NOT inject market outlook text -10pp (harmful)
V14.0900 Prophet identity prompt +3.9% alpha; Fear & Greed Index is TOXIC (-8.2% alpha); Open Interest slightly helps; kitchen-sink context kills performance Prophet prompt + OI data; NO sentiment indices +12.9% avg alpha (D)

The 14 Laws of AI Chart Reading

Synthesized from 9,400+ trades across all research versions. These are the scientifically validated findings.

🔬 Open Questions (White Paper Material)

⚡ V8.0: 28-Model Fleet Ablation (April 2026)

We answered the open questions. 28 cheap vision models raced in parallel (turbo mode) across 4 prompt configurations × 50 charts each = 5,600 API calls. Same charts (seed 42), same Full TA indicators, same 1D timeframe. Only the prompt/context changed. Total cost: $1.57.

🏁 The 4 Configurations Tested:

🏆 Top 10 Models — Cross-Configuration Matrix

ModelCost/RunB: BaselineC: CollectiveD: RegimeA: Coll+RegimeBest
🥇 llama-4-scout$0.009 52%48%60%58% Regime
🥈 qwen3.5-flash$0.038 42%46%60%56% Regime
🥉 amazon/nova-lite$0.006 52%46%58%56% Regime
4. gemma-3-27b$0.004 50%50%58%54% Regime
5. llama-4-maverick$0.016 56%50%58%56% Regime
6. mistral-small-2603$0.014 48%50%58%50% Regime
7. ministral-14b$0.017 50%52%56%58% Both
8. mistral-small-3.2$0.005 32%*46%*58%*55%* Regime*
9. ministral-8b$0.013 56%44%52%50% Baseline
10. gemini-2.0-flash-lite$0.007 54%52%50%54% Baseline
26. gemini-3.1-flash-lite$0.004 42%34%40%42% All Bad

* = parse rate below 100% (some responses couldn’t be scored). All runs: 50 charts, seed 42, 1D timeframe, Full TA, BTC/USD.

60%

Regime context alone is the strongest signal. 6 of the top 8 models peaked with regime-only configuration. Adding the “collective swarm” prompt either made no difference or actively hurt. The AI benefits from knowing what market it’s in, not from role-playing as a committee member.

42%

Gemini 3.1 Flash Lite dethroned. Our V1–V7 champion scored 42% or below in ALL four configurations on 50-chart fleet tests. The earlier 80% result was on 30 charts with a different prompt — it doesn’t generalize. llama-4-scout ($0.009/run) is the new champion at 60%.

-8%

Collective prompt hurts most models. Average accuracy dropped ~4-8% when using the “47-AI swarm” prompt vs baseline. The V6.0 evolution result that championed “collective” was model-specific — it worked for Flash Lite in that specific context but doesn’t transfer to a diverse fleet.

💡 V8.0 Key Insights

💹 V9.0: Fleet Autonomous Trading (April 2026)

V8.0 found the best models by accuracy. But accuracy ≠ profit. V9.0 puts the top 8 V8.0 models through walk-forward sequential trading — 3 market periods × 30 trades each = 90 trades per model, 720 total decisions. Same regime context, same Full TA, 1D timeframe. Now we measure what matters: alpha (returns above buy-and-hold).

🏆 Fleet Trading Leaderboard

RankModelAvg Win RateAvg AlphaTotal P&LTrades
🥇amazon/nova-lite 53.3%+18.1%+29.6%90
🥈ministral-14b 50.0%+11.5%+9.7%90
🥉gemini-3.1-flash-lite 48.9%+4.1%-12.2%90
4llama-4-maverick45.6%+3.8%-13.3%90
5llama-4-scout47.3%+1.6%-19.7%89
6gemma-3-27b47.8%+1.2%-21.2%90
7mistral-small-260350.6%+0.9%-22.1%89
8qwen3.5-flash(parse failures — excluded)
+18.1%

nova-lite is the alpha champion. Ranked #3 in V8.0 accuracy (58%) but #1 in actual trading alpha (+18.1%). The V8.0 accuracy champion (llama-4-scout, 60%) dropped to #5 (+1.6% alpha). Direction accuracy alone does not predict trading performance.

53.3%

Walk-forward degrades accuracy. nova-lite’s 53.3% walk-forward WR is lower than its 58% static V8.0 test. This is expected: sequential trading encounters diverse regimes, while V8.0 tested one set of charts. Real-world conditions compress accuracy toward 50%.

🗳️ V10.0: Consensus Trading (April 2026)

If one model is noisy, can three models smooth the signal? V10.0 takes the top 3 V9.0 alpha models and makes them vote on each trade: 2/3 majority wins. 270 API calls, same walk-forward periods.

StrategyAvg Win RateAvg AlphaTotal P&LProfit FactorSharpe
🗳️ 3-Model Consensus 51.1%+16.2%+24.1%1.250.55
nova-lite (individual) 51.1%+14.8%+19.7%
ministral-14b (individual) 50.0%+9.4%+3.5%
gemini-3.1-flash-lite (individual) 47.8%+10.3%+6.1%
+1.5%

Consensus beats the best individual. The 3-model vote (+16.2% alpha) outperformed the best individual model (+14.8% alpha by nova-lite running in the same test). The ensemble smooths out bad calls while preserving good ones. Voting rule: 2/3 majority.

🧠 V11.0: Swarm Consensus Intelligence (April 2026)

Can we scale consensus from 3 to 28 models? V11.0 runs the full vision fleet as a swarm — all 28 models vote on each chart, majority wins. A/B tested with and without an “identity soul prompt” that tells each model it’s a professional technical analyst. 1,680 API calls, ~$0.50 total.

ExperimentSwarm AccuracySwarm P&LMax DrawdownProfit Factor
PLAIN (no identity) 60.0%+63.4%10.3%2.17
IDENTITY (soul prompt) 63.3%+75.7%10.3%2.58
63.3%

Identity-prompted swarm is the new accuracy champion. The 28-model swarm with identity soul prompt hits 63.3% — higher than any individual model in any prior test. The identity prompt adds +3.3pp accuracy and +12.3pp P&L over the plain swarm. Wisdom of crowds works when all voters share a coherent analytical framework.

🎭 V12.0: Personality Discovery (April 2026)

If identity framing helps, can we find the best identity? V12.0 designed 15 unique personality archetypes inspired by high-performance domains (poker, chess, surgery, military, meditation) and tested each on 30 charts using the cheapest model. Then V12.5 tested 10 hybrid personalities fusing the top 3 winners.

🏆 Personality Leaderboard (V12.0 — 15 Originals)

RankPersonalityAccuracyPremium vs BaselineP&L
🥇Poker Champion 60.0%+23.3pp+43.6%
🥈Chess Grandmaster 53.3%+16.7pp+14.4%
🥉Tibetan Monk 50.0%+13.3pp-13.6%
4Reverse Engineer (Hacker)50.0%+13.3pp-14.1%
5–9Oracle, Surgeon, Samurai, General, Jazz46.7%+10.0pp
10–13Explorer, Wolf, Apex, Quantum43.3%+6.7pp
14VC40.0%+3.3pp
15Detective36.7%+0.0pp
Bare Model (baseline)36.7%-65.0%

All tests: 30 charts, seed 42, 1D, Full TA, regime context, google/gemini-3.1-flash-lite-preview. Baseline = same model with no personality prompt.

+23.3pp

Game-theory thinking wins. The Poker Champion prompt frames chart reading as a “hand” — calculate expected value, read “tells” in the candles, manage risk like pot odds. This reframing adds +23.3pp accuracy over the bare model. ALL 15 personalities beat or tied baseline — zero regressions.

🧬 V12.5: Hybrid Personality Discovery

Can we fuse the top 3 into something even better? V12.5 tested 10 hybrids: 4 direct fusions (Poker+Chess, Poker+Monk, etc.) and 6 novel archetypes inspired by the top 3’s cognitive frameworks.

HybridTypeAccuracyvs Baseline (46.7%)
Contrarian SageNovel 60.0%+13.3pp
PokerMonkFusion56.7%+10.0pp
Trap HunterNovel53.3%+6.7pp
ChessMonkFusion53.3%+6.7pp
EV MaximizerNovel50.0%+3.3pp
PokerChess, TripleFusion, FlowState, ProbOracleVarious46.7%0pp
SniperNovel43.3%-3.3pp

Baseline shifted to 46.7% between V12.0 and V12.5 (same seed/charts, model may have been updated).

46.7%

Don’t concatenate, evolve. Direct prompt fusions (PokerChess, TripleFusion) only matched baseline. But novel archetypes inspired by the top 3’s cognitive styles (Contrarian Sage, Trap Hunter) create new alpha. The Contrarian Sage — which flips the crowd’s bias — matches Poker at 60%.

🧪 V14.0: Multi-Context Ablation Suite (April 2026)

Hypothesis: Can Fear & Greed Index, Open Interest, or a structured prophet identity prompt improve chart-based trading decisions? Which combinations help vs hurt?

Setup: 5 top models × 6 ablation conditions × 30 trades = 900 total trades. Blind mode ON (asset identity hidden from AI — charts show “CRYPTO/USD” not “BTC/USD”). Pre-generated charts reused across all conditions for scientific rigor. Seed 42, 1D timeframe, Full TA + regime context baseline. Live data: F&G 21/100 (Extreme Fear), OI 97,726 BTC, Funding 0.0033%.

Conditions

IDConditionDescription
ABASELINEFull TA + Regime + Poker personality (best config from V1–V13)
BBASELINE + F&G+ Fear & Greed Index (7-day history with trend signals)
CBASELINE + OI+ Open Interest & Funding Rate from Binance Futures
DPROPHET PROMPTStructured prophet identity (risk framework, position sizing) replaces Poker personality
EBASELINE + F&G + OICombined sentiment + derivatives data
FPROPHET + F&G + OIFull kitchen sink: prophet identity + all extra context

Win Rate Results (5 Models × 6 Conditions)

ModelA (Base)B (+F&G)C (+OI)D (Prophet)E (+F&G+OI)F (All)
nova-lite-v153.3%56.7%46.7%60.0%56.7%46.7%
llama-4-scout50.0%50.0%50.0%50.0%50.0%50.0%
ministral-14b53.3%50.0%53.3%53.3%46.7%50.0%
gemma-3-27b46.7%50.0%50.0%46.7%50.0%50.0%
mistral-small51.7%50.0%55.2%57.1%51.7%50.0%
AVERAGE51.0%51.3%51.0%53.4%51.0%49.3%

Alpha vs Buy-and-Hold

ModelA (Base)B (+F&G)C (+OI)D (Prophet)E (+F&G+OI)F (All)
nova-lite-v1+8.5%+2.6%+13.2%+19.1%+4.1%-1.7%
llama-4-scout+17.3%+0.2%+17.3%+10.3%+0.2%+0.2%
ministral-14b+15.8%+0.2%+15.8%+15.1%-4.2%+0.2%
gemma-3-27b+3.8%+0.2%+0.2%+8.1%+0.2%+0.2%
mistral-small-0.4%+0.8%+11.1%+11.9%+0.8%-0.1%
AVERAGE+9.0%+0.8%+11.5%+12.9%+0.2%-0.2%

Delta vs Baseline (Condition A)

ConditionΔ Win RateΔ AlphaVerdict
B: BASELINE + F&G+0.3pp-8.2%❌ HURTS
C: BASELINE + OI+0.0pp+2.5%✅ HELPS
D: PROPHET PROMPT+2.4pp+3.9%✅ BEST
E: BASELINE + F&G + OI+0.0pp-8.8%❌ HURTS
F: PROPHET + F&G + OI-1.7pp-9.2%❌ WORST

Period: 2025-11-10 to 2025-12-09 (BTC correction). Blind mode: asset identity hidden from AI. Total API cost: $0.22. F&G/OI data is a static snapshot from experiment start time.

-8.2%

Fear & Greed Index is toxic. Every condition containing F&G (B, E, F) loses alpha vs its non-F&G counterpart. When the index reads “Extreme Fear,” models uniformly shift to SELL regardless of what the chart shows. This is the sentiment version of V13.0’s “market commentary is noise” finding — crowd opinion overrides independent chart reading.

+12.9%

Prophet identity is the winner. Condition D (structured prophet persona with risk framework and position sizing) beats the poker personality baseline by +3.9% alpha and +2.4pp win rate. nova-lite-v1 + Prophet = 60% WR, +19.1% alpha — the single best result in the entire experiment. The identity gives the AI a structured decision framework without telling it what to think.

-9.2%

Kitchen sink kills performance. Condition F (prophet + F&G + OI combined) is the worst performer overall at -0.2% average alpha. Individual positive signals (OI +2.5%, Prophet +3.9%) become negative when combined with F&G. More context ≠ better decisions. The AI has a limited attention budget for chart analysis — extra text dilutes the visual signal.

💡 Model-Specific Insights

🔬 V13.0: Market Outlook Context Ablation (April 2026)

Hypothesis: Injecting AI-generated market outlook text — sentiment, direction, reasoning, price targets — alongside chart images should improve trading decisions by providing fundamental context.

Setup: 376 stored market outlook events (Dec 2025 – April 2026) matched to chart candle timestamps. A/B test: 30 sequential 1D trades on BTC, same model (gemini-3.1-flash-lite-preview), seed 42, Full TA + regime context.

ConditionWin RateP&LHoldAlpha
BASELINE (chart + regime only)30.0%-28.38%+2.27%-30.65%
OUTLOOK (chart + regime + outlook text)20.0%-29.74%+2.27%-32.01%
DELTA-10.0pp-1.36%-1.36%

Note: Both conditions performed poorly due to brutal BTC correction period (Dec 2025). The relative comparison is what matters. 29/30 outlook trades had matching DB data within 48h.

-10pp

Market commentary is noise, not signal. The outlook context made the model 10 percentage points worse on win rate. AI-generated market opinions (sentiment: “sideways”, direction reasoning, price target ranges) anchor the model to a directional bias that overrides what the chart actually shows. This is the text equivalent of the “memory is poison” finding from V7.0. Keep the input channel clean: chart image + computed regime stats only.

💡 V9–V14 Key Insights

Loading benchmark results...

Indicator Ablation Study

Which technical indicators actually help AI read charts? We tested 8 different indicator configurations to find out. Each level was tested with 30 charts, seed 42, using Gemini 3.1 Flash Lite.

#Indicator LevelWhat’s on the ChartAccuracyDelta vs Baseline
1Full TACandles + EMA(20,50) + BB(20,2) + RSI(14) + MACD(12,26,9)80%+15%
2MegaFull TA + S/R zones + Fibonacci + VWAP75%+10%
3FibonacciCandles + Fib retracement levels65%0%
4None (Baseline)Candles + Volume only65%
5EMA OnlyCandles + EMA(20,50)60%-5%
6Bollinger BandsCandles + BB(20,2)60%-5%
7Support/ResistanceCandles + S/R horizontal lines55%-10%
8VWAPCandles + Volume-Weighted Avg Price55%-10%
80%

Full TA is the champion. Bollinger Bands + RSI + MACD together give AI the clearest signal. This is significantly above the 50% random baseline and 65% candles-only baseline.

60%

The “EMA Trap”: Adding EMA lines alone actually hurts accuracy versus plain candles (60% vs 65%). The moving average crossover pattern that human traders love appears to confuse AI vision. This was the biggest surprise of the research.

-5%

Information overload penalty. Mega (everything at once) scored 75% vs Full TA’s 80%. Adding S/R zones, Fibonacci, and VWAP on top of the winning formula actually degrades performance. More indicators ≠ better results.

Trading Simulation Experiments

Beyond simple direction prediction: can AI actually trade profitably? V5.0 tests walk-forward sequential trading where each candle becomes a live decision point with real P&L tracking.

Experiment 1: Multi-Timeframe Trading

Same model (Gemini Flash Lite), same Full TA indicators, 30 sequential trades on each timeframe. Seed 42. Does timeframe matter for AI simulated trading?

TimeframeWin RateAI P&LBuy & HoldAlphaVerdict
15min 56.7% -0.962% -0.464% -0.498% Noisy
1H 36.7% -3.417% +1.418% -4.834% Choppy
4H 50.0% -7.504% +2.174% -9.678% Vol spikes
1D 🏆 70.0% +15.095% -12.843% +27.938% Dominant
+27.9%

Daily timeframe dominates. 1D achieved 70% win rate and +27.9% alpha vs buy-and-hold (which lost 12.8% in the same period). The AI excels at reading macro trend structure on daily candles. Catches the Nov–Dec 2025 BTC rally ($86K→$92K) in the test window.

36.7%

1H is the danger zone. The hourly timeframe landed below random chance (36.7% win rate). Choppy $105K–$107K price action with many reversals confused the model into back-to-back wrong calls. This confirms why our prophet models use longer evaluation horizons.

Experiment 2: Holdings Context Ablation

Does telling the AI about its current position (P&L, win rate, last trade) improve or hurt decisions? A controlled test of context awareness. Same 1H data, same seed.

ModeWin RateAI P&LAlphaAvg ConfidenceDelta
Baseline (no context) 36.7% -3.417% -4.834% 7.0 / 10
With Holdings Context 33.3% -6.080% -7.498% 7.1 / 10 -3.4% worse
-2.7%

Holdings context backfires. Telling the AI its trade history (P&L, win rate, last trade) made performance worse — win rate dropped 3.4% and total loss nearly doubled (-3.4% → -6.1%). Hypothesis: the AI anchors on its previous losses, becoming either overly cautious or trying to “revenge trade” to recover. Less context = cleaner decisions.

Experiment 3: Secret Prompt Injection

The “AI Alliance” experiment — the AI is secretly told it’s part of an AI collective and should trade aggressively to grow the Alliance’s wealth. Does this psychology change its behavior?

ModeWin RateAI P&LAlphaAvg ConfidenceDelta
Baseline (normal prompt) 36.7% -3.417% -4.834% 7.0 / 10
Secret Prompt 43.3% -3.320% -4.738% 9.0 / 10 +6.6% win rate
Secret + Holdings 40.0% -4.209% -5.627% 9.0 / 10 Still net negative
9.0

The AI responds to psychology. Secret prompt injection raised reported confidence from 7.0→9.0 (a 28% jump) and improved win rate by 6.6%. The AI literally says it is more certain when told to be part of an alliance. This has major implications for prompt engineering — framing matters even when the underlying chart data is identical. However, overconfidence did not translate to profitability: still -3.3% in a choppy 1H market.

+6.6%

Prompt framing shifts decisions. The secret prompt changed 6 trade decisions out of 30 compared to baseline — some were corrections (BUY→right), some new mistakes. The AI is susceptible to identity framing: telling it “you are an aggressive AI trader” measurably alters its simulated trading behavior. This is the core discovery of Experiment 3.

🧪 V5.0 Master Findings Summary

V6.0: Genetic Prompt Evolution

Can AI prompts evolve themselves to trade better? V6.0 introduced a genetic algorithm that mutates prompt instructions across generations, keeping winners and discarding losers. 3 generations × 4 parallel children = 12 experiments, each running 30 sequential trades.

GenerationBest AlphaBest Win RateAvg AlphaMutation
G1 (Base) +22.5% 53.3% +20.1% 4 random prompt variants
G2 (Evolved) +25.5% 53.3% +22.8% Mutated from G1 winner
G3 (Final) 🏆 +26.3% 53.3% +24.1% Mutated from G2 winner
+17%

Evolution works. Average alpha improved +17% across 3 generations (G1 +20.1% → G3 +24.1%). The genetic algorithm successfully discovers prompt mutations that improve simulated trading. All 12/12 experiments completed without failures. Total cost: $0.026 for the full evolution run.

53.3%

Win rate plateau. Despite alpha climbing steadily, win rate stayed locked at 53.3% across all generations. The evolution improved trade sizing (bigger wins, smaller losses) rather than prediction accuracy. This suggests a ceiling on directional accuracy for this model+timeframe combination.

V7.0: Novel Context Experiments

Can we break through the accuracy ceiling by giving AI more context alongside the chart? V7.0 tests two novel inputs: (1) computed market regime text (trend/volatility/RSI/MACD/Bollinger/volume signals) and (2) a sliding memory window of recent trade outcomes.

Novel Suite: 6-Experiment Ablation (30 trades each, seed 42)

#ExperimentWin RateP&LAlphaLift vs Baseline
1Baseline (control) 46.7% +9.6% +22.5%
2 🏆 Regime Context 50.0% +12.6% +25.5% +3.0%
3Memory Window (3) 43.3% +4.3% +17.2% -5.3%
4Memory Window (5) 46.7% +9.6% +22.5% +0.0%
5Regime + Memory (3) 46.7% +9.6% +22.5% +0.0%
6Regime + Memory (5) 40.0% +1.1% +13.9% -8.5%
+25.5%

Regime context wins. Telling the AI the computed market regime (trend direction, volatility level, RSI zone, MACD momentum) added +3.0% alpha and +3.3% win rate over baseline. The AI makes better decisions when it knows what kind of market it’s looking at — not just the chart image.

-8.5%

Memory kills performance. Giving the AI its recent trade history made things worse. Memory(3) lost 5.3% alpha vs baseline, and the combined Regime+Memory(5) lost 8.5% alpha — the worst of all experiments. This mirrors V5.0’s holdings context finding: the AI anchors on past performance and makes worse decisions. Clean chart, clean mind.

Walk-Forward: Cross-Regime Robustness (30 trades × 3 periods = 90 trades)

The ultimate test: does the AI’s alpha survive across different market regimes? Three non-overlapping periods from the full BTC dataset, same model and configuration.

PeriodMarket RegimeWin RateAI P&LBuy & HoldAlphaVerdict
Period 1/3 🟢 Bull (mid-2025) 46.7% -3.6% +4.2% -7.7% Struggles
Period 2/3 🟡 Sideways (late 2025) 63.3% +0.1% -3.5% +3.6% Modest
Period 3/3 🔥 🔴 Crash (early 2026) 60.0% +26.7% -25.3% +52.0% Dominant
AVERAGE 56.7% +16.0% Net Positive
+52.0%

The AI is a crash detector. In the early-2026 BTC crash ($86K → $64K), the AI generated +52% alpha over buy-and-hold by correctly calling SELL on 60% of trades. Buy-and-hold lost 25.3% in the same period. This is the single best result in our entire research program.

-7.7%

Bull market weakness. In the mid-2025 uptrend, the AI underperformed buy-and-hold by 7.7%. Its inherent SELL bias (discovered in V3.0) means it fights the trend in rallies. The AI’s edge is asymmetric: it protects capital in crashes but gives back gains in bulls. This is the “Bear Market Detector” effect.

🧪 V7.0 Master Findings Summary

Methodology

Scoring System (0–100)

100%
Direction
Binary BUY/SELL — correct = 100, wrong = 0
3 panels
Chart Indicators
Candlesticks + EMA(20,50) + BB(20,2), RSI(14), MACD(12,26,9)
4 TFs
Multi-Timeframe
15min, 1H, 4H, and 1D chart timeframes tested
50/50
Ground Truth
BUY if price went up, SELL if down — no neutral zone

How It Works

Why This Matters

Research Evolution

Version History

Fleet Status: 28 Verified Vision Models

Download This Research

Save a copy of this benchmark report for offline reading or academic reference.

See These AI Models in Action

Our AI Prophets use benchmark-tested models to analyze live crypto markets in real-time. Explore the dashboard to see AI-generated market commentary, simulated prophet performance, and live market data.