MULTIMODAL AI RESEARCH

Can AI Read Charts?

29 vision models. Multi-timeframe charts (15m/1H/4H/1D). Full technical indicators. Genetic prompt evolution. Cross-regime walk-forward testing.

The first scientific benchmark testing whether AI can extract directional signals from price charts — from simple accuracy tests through genetic evolution to novel context experiments across bull, sideways, and crash regimes

Vision Models

Charts Tested

Total Trials

Total Cost

⚠️ Important: This benchmark measures simulated, hypothetical performance using historical BTC price data rendered as candlestick charts. All results are for educational and informational purposes only. This is not financial advice. Models receive only visual chart images — no text data, no coin name, no price values.

📋 Executive Summary: What We’ve Learned

8,500+ simulated trades across 13 research versions. 28 vision models tested. 25 personality prompts. 8 indicator levels ablated. 5 prompt variants. 3 market regimes. Genetic evolution. Swarm consensus. Personality discovery. Market outlook context ablation. Here are the definitive findings.

8,500+

Total Trades

Vision Models

Research Versions

$3.72

Fleet Cost

63.3%

Best Accuracy

🏆 The Winning Formula (Updated V12.0 — Personality-Validated)

1DTimeframe

Full TAIndicators

Regime OnlyContext

Poker/ContrarianPersonality

28-Model SwarmConsensus

V12.0 personality discovery found Poker Champion prompt (+23.3pp over bare model). V11.0 swarm consensus with identity prompt hits 63.3% accuracy using 28 cheap models voting. Consensus with uniform identity beats best individual model, and it beats diverse-personality swarm (43.3%). The formula: pick one winning personality, give it to ALL models, let the swarm vote.

Research Timeline & Key Discoveries

Version	Trades	Key Discovery	Optimal Setting Found	Alpha Impact
V1.0–V2.0	200+	AI can read candlestick charts above random chance	Binary BUY/SELL (no HOLD) improves clarity	Baseline
V3.0	90	Seeded PRNG enables reproducible experiments	Full TA indicators > plain candles	+15% accuracy
V4.0	240	“EMA Trap” — EMA only hurts accuracy; information overload penalty	Full TA (80%) > Mega (75%) > None (65%)	+15% vs candles
V5.0	210	1D timeframe dominates; holdings context harmful; AI responds to identity framing	1D (70% WR) >> 15m/1H/4H	+27.9% alpha
V6.0	360	Genetic evolution improves alpha +17% across 3 generations; “collective” prompt variant wins	Collective prompt > default/aggressive/contrarian	+26.3% alpha
V7.0	270	Regime context helps (+3%), memory hurts (-8.5%); AI is a “bear market detector”	Regime context ON, memory OFF	+25.5% / +52% crash
V8.0	5,600	28-model fleet ablation in turbo mode; regime helps, collective hurts; Flash Lite dethroned	Regime only + llama-4-scout	60% accuracy (fleet best)
V9.0	720	Fleet walk-forward trading: accuracy ≠ alpha; nova-lite beats accuracy champion on P&L	nova-lite #1 by alpha (+18.1%)	+18.1% alpha
V10.0	270	3-model consensus beats best individual; 2/3 majority vote smooths variance	Consensus +16.2% alpha > nova-lite +14.8%	Consensus +1.5% alpha
V11.0	1,680	28-model swarm A/B: identity prompt wins (+3.3pp accuracy, +12.3pp P&L)	Identity swarm > Plain swarm	63.3% swarm accuracy
V12.0	480+	Personality discovery: Poker Champion +23.3pp; diversity hurts consensus; fusion fails but novel archetypes work	Poker + Contrarian prompts @ 60%	+23.3pp personality premium
V13.0	60	Market outlook text context hurts: -10pp win rate. AI-generated commentary introduces bias that overrides chart signals	Do NOT inject market outlook text	-10pp (harmful)
V14.0	900	Prophet identity prompt +3.9% alpha; Fear & Greed Index is TOXIC (-8.2% alpha); Open Interest slightly helps; kitchen-sink context kills performance	Prophet prompt + OI data; NO sentiment indices	+12.9% avg alpha (D)

The 14 Laws of AI Chart Reading

Synthesized from 9,400+ trades across all research versions. These are the scientifically validated findings.

Law 1 — Timeframe is King: Daily (1D) candles produce 70% win rate. Shorter timeframes (15m/1H/4H) lose to noise. The AI needs macro trend structure to make good calls. (V5.0, 210 trades)
Law 2 — Full Technical Analysis Wins: EMA + Bollinger Bands + RSI + MACD together = 80% accuracy. Individual indicators (EMA-only, BB-only) score worse than plain candles. The combination creates a “readable pattern language” for AI vision. (V4.0, 240 trades)
Law 3 — Less is More (Information Overload): Adding MORE indicators beyond Full TA hurts: Mega (75%) < Full TA (80%). VWAP and S/R zones add clutter without signal. Memory window of past trades also hurts. (V4.0 + V7.0)
Law 4 — Prompt Identity is Model-Specific: ~~“Collective” prompt won V6.0 evolution.~~ V8.0 fleet testing overturned this: collective prompt hurt most models (-4 to -8%). It only worked for Flash Lite in a specific context. Don’t generalize single-model findings. (V6.0 + V8.0, 5,960 trades)
Law 5 — Regime Context is the Real Winner: Injecting market regime text (trend/volatility/momentum) is the strongest fleet-wide signal. Top models gain +6 to +18% accuracy. Works for 6 of top 8 models. (V7.0 + V8.0, 5,780 trades)
Law 6 — Memory is Poison: Giving the AI its trade history causes anchoring bias. The more history, the worse: Memory(3) lost 5.3% alpha, Regime+Memory(5) lost 8.5% alpha. This matches V5.0’s holdings context finding. Clean chart, clean mind. (V5.0 + V7.0)
Law 7 — The Bear Detector Effect: AI has an inherent SELL bias (100% SELL in downtrends). This makes it a bear market specialist: +52% alpha in crashes, -7.7% in bull markets. The bias IS the strategy. (V7.0, 90 walk-forward trades)
Law 8 — Model Choice Matters More Than Prompt: Across 4 configurations, top model (scout 60%) vs bottom (Flash Lite 34%) = 26% gap. But across 4 prompts, same model varies only ±8%. Pick the right model first, then optimize the prompt. (V8.0, 5,600 trades)
Law 9 — Accuracy ≠ Alpha: V8.0’s accuracy champion (llama-4-scout 60%) only ranked 5th in walk-forward trading alpha (+1.6%). nova-lite ranked 3rd in accuracy but 1st in alpha (+18.1%). Direction accuracy alone doesn’t capture trade sizing and timing quality. (V9.0, 720 trades)
Law 10 — Consensus Beats Individuals: A 3-model majority vote (+16.2% alpha) beat the best individual model (+14.8% alpha). A 28-model swarm with identity prompt hits 63.3% accuracy — higher than any single model in fleet tests. Ensemble intelligence is real. (V10.0 + V11.0, 1,950 trades)
Law 11 — Personality is Free Alpha: Game-theory thinking (Poker Champion) adds +23.3pp to bare model accuracy. ALL 15 tested personalities beat or tied baseline. But diversity hurts consensus: multi-personality swarm (43.3%) < uniform-identity swarm (63.3%). Use ONE winning personality for ALL models. (V12.0, 480+ trades)
Law 12 — Market Commentary is Noise: Injecting AI-generated market outlook text (sentiment, direction, reasoning, price targets) alongside chart images makes performance WORSE (-10pp win rate). Text-based market opinions anchor the model to a directional bias that overrides what the chart actually shows. Keep the input channel clean: chart image + regime stats only. (V13.0, 60 trades)
Law 13 — Crowd Sentiment is Toxic: The Fear & Greed Index, a popular crowd sentiment indicator, destroys edge when injected as context. Across 5 models, every F&G condition lost ~8% alpha vs its non-F&G counterpart. The mechanism: when the index reads “Extreme Fear,” models uniformly shift to SELL regardless of what the chart shows, causing herding behavior. Crowd sentiment is the text equivalent of “market commentary is noise” — it tells the AI what to think instead of letting it read the chart. (V14.0, 900 trades, 5 models)
Law 14 — Prophet Identity Beats Sentiment Data: Giving the AI a structured identity prompt (prophet persona with risk framework, position sizing rules, and market regime awareness) improves alpha by +3.9% over the poker personality baseline. Meanwhile, adding derivatives data (Open Interest + funding rate) provides a modest +2.5% alpha lift. But combining everything (prophet + F&G + OI) is the worst condition — the “kitchen sink” dilutes the visual signal. Less context, better chosen, wins. (V14.0, 900 trades, 5 models)

🔬 Open Questions (White Paper Material)

Can we neutralize the SELL bias in bull markets? The AI loses -7.7% alpha in uptrends. A regime-aware strategy that flips to BUY-bias in uptrends could eliminate this weakness
Does the optimal config work on ETH and SOL? All testing has been BTC-only. Cross-asset validation needed for generalizability
~~Evolved prompt + regime context combined?~~ ANSWERED (V8.0): Regime alone outperforms the combination. Collective prompt hurts at fleet scale
~~More models at the winning config?~~ ANSWERED (V8.0): 28 models tested. llama-4-scout (60%) and qwen3.5-flash (60%) dominate
~~Walk-forward with new champion models?~~ ANSWERED (V9.0): 8 models × 90 trades each. nova-lite #1 by alpha (+18.1%), scout drops to #5
~~Does ensemble beat individuals?~~ ANSWERED (V10.0+V11.0): Yes. 3-model consensus +16.2% alpha > best individual +14.8%. 28-model swarm 63.3% > any single model
~~Can prompt personality improve accuracy?~~ ANSWERED (V12.0): Yes. Poker Champion +23.3pp. All 15 personalities beat or tied baseline
~~Market outlook as chart context:~~ ANSWERED (V13.0): NO. -10pp win rate. Text opinions anchor the model to directional bias
~~Does crowd sentiment (Fear & Greed) help?~~ ANSWERED (V14.0): NO. F&G is toxic (-8.2% alpha). Causes herding behavior. Do not inject
~~Does derivatives data (Open Interest) help?~~ ANSWERED (V14.0): Slightly. +2.5% alpha on average. Helps mistral-small the most (+11.5%)
~~Can a prophet identity prompt beat poker personality?~~ ANSWERED (V14.0): Yes. Prophet prompt +12.9% avg alpha vs +9.0% baseline (+3.9% improvement). nova-lite hits 60% WR, +19.1% alpha
Poker personality + swarm combined: V12.0 tested personalities on 1 model. If we give all 28 swarm models the Poker prompt, does swarm accuracy exceed 63.3%?
Prophet prompt + swarm combined: V14.0’s prophet prompt beat poker for individuals. Does a 28-model swarm with prophet identity exceed the poker swarm?
Historical F&G/OI per candle: V14.0 used a static snapshot. Would per-candle-matched historical sentiment data perform differently?
Confidence calibration: Models report 8-9/10 confidence regardless of actual accuracy. Can we calibrate this to create a “conviction filter”?
Longer walk-forward (200+ trades): Our 90-trade walk-forward is suggestive but not statistically definitive. N=200+ would be needed for a proper paper

⚡ V8.0: 28-Model Fleet Ablation (April 2026)

We answered the open questions. 28 cheap vision models raced in parallel (turbo mode) across 4 prompt configurations × 50 charts each = 5,600 API calls. Same charts (seed 42), same Full TA indicators, same 1D timeframe. Only the prompt/context changed. Total cost: $1.57.

🏁 The 4 Configurations Tested:

A) Collective + Regime: “You are one vote in a 47-AI swarm” + injected market regime text
B) Baseline: Standard prompt, no extras (original V4.0 conditions)
C) Collective Only: Swarm identity prompt, no regime data
D) Regime Only: Standard prompt + injected trend/volatility/momentum text

🏆 Top 10 Models — Cross-Configuration Matrix

Model	Cost/Run	B: Baseline	C: Collective	D: Regime	A: Coll+Regime	Best
🥇 llama-4-scout	$0.009	52%	48%	60%	58%	Regime
🥈 qwen3.5-flash	$0.038	42%	46%	60%	56%	Regime
🥉 amazon/nova-lite	$0.006	52%	46%	58%	56%	Regime
4. gemma-3-27b	$0.004	50%	50%	58%	54%	Regime
5. llama-4-maverick	$0.016	56%	50%	58%	56%	Regime
6. mistral-small-2603	$0.014	48%	50%	58%	50%	Regime
7. ministral-14b	$0.017	50%	52%	56%	58%	Both
8. mistral-small-3.2	$0.005	32%*	46%*	58%*	55%*	Regime*
9. ministral-8b	$0.013	56%	44%	52%	50%	Baseline
10. gemini-2.0-flash-lite	$0.007	54%	52%	50%	54%	Baseline
26. gemini-3.1-flash-lite	$0.004	42%	34%	40%	42%	All Bad

* = parse rate below 100% (some responses couldn’t be scored). All runs: 50 charts, seed 42, 1D timeframe, Full TA, BTC/USD.

60%

Regime context alone is the strongest signal. 6 of the top 8 models peaked with regime-only configuration. Adding the “collective swarm” prompt either made no difference or actively hurt. The AI benefits from knowing what market it’s in, not from role-playing as a committee member.

42%

Gemini 3.1 Flash Lite dethroned. Our V1–V7 champion scored 42% or below in ALL four configurations on 50-chart fleet tests. The earlier 80% result was on 30 charts with a different prompt — it doesn’t generalize. llama-4-scout ($0.009/run) is the new champion at 60%.

-8%

Collective prompt hurts most models. Average accuracy dropped ~4-8% when using the “47-AI swarm” prompt vs baseline. The V6.0 evolution result that championed “collective” was model-specific — it worked for Flash Lite in that specific context but doesn’t transfer to a diverse fleet.

💡 V8.0 Key Insights

Updated Winning Formula: 1D + Full TA + Regime Context (no collective prompt) + llama-4-scout or qwen3.5-flash
Cost efficiency winner: gemma-3-27b at 58% accuracy for just $0.004/run — 10x cheaper than most competitors
Prompt sensitivity varies wildly by model: Some models gain +18% from regime context (qwen3.5-flash: 42%→60%), others lose -6% (gemini-2.0-flash: 54%→50%)
The “swarm identity” was a red herring: Collective prompt tested on 1 model (V6.0) appeared to help, but fleet-wide testing showed it hurts most models
V8.0 turbo mode: 28 models × 50 charts completed in 1,115 seconds (18.6 min) vs ~4,200s sequential — 3.8x speedup via parallel execution

💹 V9.0: Fleet Autonomous Trading (April 2026)

V8.0 found the best models by accuracy. But accuracy ≠ profit. V9.0 puts the top 8 V8.0 models through walk-forward sequential trading — 3 market periods × 30 trades each = 90 trades per model, 720 total decisions. Same regime context, same Full TA, 1D timeframe. Now we measure what matters: alpha (returns above buy-and-hold).

🏆 Fleet Trading Leaderboard

Rank	Model	Avg Win Rate	Avg Alpha	Total P&L	Trades
🥇	amazon/nova-lite	53.3%	+18.1%	+29.6%	90
🥈	ministral-14b	50.0%	+11.5%	+9.7%	90
🥉	gemini-3.1-flash-lite	48.9%	+4.1%	-12.2%	90
4	llama-4-maverick	45.6%	+3.8%	-13.3%	90
5	llama-4-scout	47.3%	+1.6%	-19.7%	89
6	gemma-3-27b	47.8%	+1.2%	-21.2%	90
7	mistral-small-2603	50.6%	+0.9%	-22.1%	89
8	qwen3.5-flash	(parse failures — excluded)

+18.1%

nova-lite is the alpha champion. Ranked #3 in V8.0 accuracy (58%) but #1 in actual trading alpha (+18.1%). The V8.0 accuracy champion (llama-4-scout, 60%) dropped to #5 (+1.6% alpha). Direction accuracy alone does not predict trading performance.

53.3%

Walk-forward degrades accuracy. nova-lite’s 53.3% walk-forward WR is lower than its 58% static V8.0 test. This is expected: sequential trading encounters diverse regimes, while V8.0 tested one set of charts. Real-world conditions compress accuracy toward 50%.

🗳️ V10.0: Consensus Trading (April 2026)

If one model is noisy, can three models smooth the signal? V10.0 takes the top 3 V9.0 alpha models and makes them vote on each trade: 2/3 majority wins. 270 API calls, same walk-forward periods.

Strategy	Avg Win Rate	Avg Alpha	Total P&L	Profit Factor	Sharpe
🗳️ 3-Model Consensus	51.1%	+16.2%	+24.1%	1.25	0.55
nova-lite (individual)	51.1%	+14.8%	+19.7%	—	—
ministral-14b (individual)	50.0%	+9.4%	+3.5%	—	—
gemini-3.1-flash-lite (individual)	47.8%	+10.3%	+6.1%	—	—

+1.5%

Consensus beats the best individual. The 3-model vote (+16.2% alpha) outperformed the best individual model (+14.8% alpha by nova-lite running in the same test). The ensemble smooths out bad calls while preserving good ones. Voting rule: 2/3 majority.

🧠 V11.0: Swarm Consensus Intelligence (April 2026)

Can we scale consensus from 3 to 28 models? V11.0 runs the full vision fleet as a swarm — all 28 models vote on each chart, majority wins. A/B tested with and without an “identity soul prompt” that tells each model it’s a professional technical analyst. 1,680 API calls, ~$0.50 total.

Experiment	Swarm Accuracy	Swarm P&L	Max Drawdown	Profit Factor
PLAIN (no identity)	60.0%	+63.4%	10.3%	2.17
IDENTITY (soul prompt)	63.3%	+75.7%	10.3%	2.58

63.3%

Identity-prompted swarm is the new accuracy champion. The 28-model swarm with identity soul prompt hits 63.3% — higher than any individual model in any prior test. The identity prompt adds +3.3pp accuracy and +12.3pp P&L over the plain swarm. Wisdom of crowds works when all voters share a coherent analytical framework.

🎭 V12.0: Personality Discovery (April 2026)

If identity framing helps, can we find the best identity? V12.0 designed 15 unique personality archetypes inspired by high-performance domains (poker, chess, surgery, military, meditation) and tested each on 30 charts using the cheapest model. Then V12.5 tested 10 hybrid personalities fusing the top 3 winners.

🏆 Personality Leaderboard (V12.0 — 15 Originals)

Rank	Personality	Accuracy	Premium vs Baseline	P&L
🥇	Poker Champion	60.0%	+23.3pp	+43.6%
🥈	Chess Grandmaster	53.3%	+16.7pp	+14.4%
🥉	Tibetan Monk	50.0%	+13.3pp	-13.6%
4	Reverse Engineer (Hacker)	50.0%	+13.3pp	-14.1%
5–9	Oracle, Surgeon, Samurai, General, Jazz	46.7%	+10.0pp	—
10–13	Explorer, Wolf, Apex, Quantum	43.3%	+6.7pp	—
14	VC	40.0%	+3.3pp	—
15	Detective	36.7%	+0.0pp	—
—	Bare Model (baseline)	36.7%	—	-65.0%

All tests: 30 charts, seed 42, 1D, Full TA, regime context, google/gemini-3.1-flash-lite-preview. Baseline = same model with no personality prompt.

+23.3pp

Game-theory thinking wins. The Poker Champion prompt frames chart reading as a “hand” — calculate expected value, read “tells” in the candles, manage risk like pot odds. This reframing adds +23.3pp accuracy over the bare model. ALL 15 personalities beat or tied baseline — zero regressions.

🧬 V12.5: Hybrid Personality Discovery

Can we fuse the top 3 into something even better? V12.5 tested 10 hybrids: 4 direct fusions (Poker+Chess, Poker+Monk, etc.) and 6 novel archetypes inspired by the top 3’s cognitive frameworks.

Hybrid	Type	Accuracy	vs Baseline (46.7%)
Contrarian Sage	Novel	60.0%	+13.3pp
PokerMonk	Fusion	56.7%	+10.0pp
Trap Hunter	Novel	53.3%	+6.7pp
ChessMonk	Fusion	53.3%	+6.7pp
EV Maximizer	Novel	50.0%	+3.3pp
PokerChess, TripleFusion, FlowState, ProbOracle	Various	46.7%	0pp
Sniper	Novel	43.3%	-3.3pp

Baseline shifted to 46.7% between V12.0 and V12.5 (same seed/charts, model may have been updated).

46.7%

Don’t concatenate, evolve. Direct prompt fusions (PokerChess, TripleFusion) only matched baseline. But novel archetypes inspired by the top 3’s cognitive styles (Contrarian Sage, Trap Hunter) create new alpha. The Contrarian Sage — which flips the crowd’s bias — matches Poker at 60%.

🧪 V14.0: Multi-Context Ablation Suite (April 2026)

Hypothesis: Can Fear & Greed Index, Open Interest, or a structured prophet identity prompt improve chart-based trading decisions? Which combinations help vs hurt?

Setup: 5 top models × 6 ablation conditions × 30 trades = 900 total trades. Blind mode ON (asset identity hidden from AI — charts show “CRYPTO/USD” not “BTC/USD”). Pre-generated charts reused across all conditions for scientific rigor. Seed 42, 1D timeframe, Full TA + regime context baseline. Live data: F&G 21/100 (Extreme Fear), OI 97,726 BTC, Funding 0.0033%.

Conditions

ID	Condition	Description
A	BASELINE	Full TA + Regime + Poker personality (best config from V1–V13)
B	BASELINE + F&G	+ Fear & Greed Index (7-day history with trend signals)
C	BASELINE + OI	+ Open Interest & Funding Rate from Binance Futures
D	PROPHET PROMPT	Structured prophet identity (risk framework, position sizing) replaces Poker personality
E	BASELINE + F&G + OI	Combined sentiment + derivatives data
F	PROPHET + F&G + OI	Full kitchen sink: prophet identity + all extra context

Win Rate Results (5 Models × 6 Conditions)

Model	A (Base)	B (+F&G)	C (+OI)	D (Prophet)	E (+F&G+OI)	F (All)
nova-lite-v1	53.3%	56.7%	46.7%	60.0%	56.7%	46.7%
llama-4-scout	50.0%	50.0%	50.0%	50.0%	50.0%	50.0%
ministral-14b	53.3%	50.0%	53.3%	53.3%	46.7%	50.0%
gemma-3-27b	46.7%	50.0%	50.0%	46.7%	50.0%	50.0%
mistral-small	51.7%	50.0%	55.2%	57.1%	51.7%	50.0%
AVERAGE	51.0%	51.3%	51.0%	53.4%	51.0%	49.3%

Alpha vs Buy-and-Hold

Model	A (Base)	B (+F&G)	C (+OI)	D (Prophet)	E (+F&G+OI)	F (All)
nova-lite-v1	+8.5%	+2.6%	+13.2%	+19.1%	+4.1%	-1.7%
llama-4-scout	+17.3%	+0.2%	+17.3%	+10.3%	+0.2%	+0.2%
ministral-14b	+15.8%	+0.2%	+15.8%	+15.1%	-4.2%	+0.2%
gemma-3-27b	+3.8%	+0.2%	+0.2%	+8.1%	+0.2%	+0.2%
mistral-small	-0.4%	+0.8%	+11.1%	+11.9%	+0.8%	-0.1%
AVERAGE	+9.0%	+0.8%	+11.5%	+12.9%	+0.2%	-0.2%

Delta vs Baseline (Condition A)

Condition	Δ Win Rate	Δ Alpha	Verdict
B: BASELINE + F&G	+0.3pp	-8.2%	❌ HURTS
C: BASELINE + OI	+0.0pp	+2.5%	✅ HELPS
D: PROPHET PROMPT	+2.4pp	+3.9%	✅ BEST
E: BASELINE + F&G + OI	+0.0pp	-8.8%	❌ HURTS
F: PROPHET + F&G + OI	-1.7pp	-9.2%	❌ WORST

Period: 2025-11-10 to 2025-12-09 (BTC correction). Blind mode: asset identity hidden from AI. Total API cost: $0.22. F&G/OI data is a static snapshot from experiment start time.

-8.2%

Fear & Greed Index is toxic. Every condition containing F&G (B, E, F) loses alpha vs its non-F&G counterpart. When the index reads “Extreme Fear,” models uniformly shift to SELL regardless of what the chart shows. This is the sentiment version of V13.0’s “market commentary is noise” finding — crowd opinion overrides independent chart reading.

+12.9%

Prophet identity is the winner. Condition D (structured prophet persona with risk framework and position sizing) beats the poker personality baseline by +3.9% alpha and +2.4pp win rate. nova-lite-v1 + Prophet = 60% WR, +19.1% alpha — the single best result in the entire experiment. The identity gives the AI a structured decision framework without telling it what to think.

-9.2%

Kitchen sink kills performance. Condition F (prophet + F&G + OI combined) is the worst performer overall at -0.2% average alpha. Individual positive signals (OI +2.5%, Prophet +3.9%) become negative when combined with F&G. More context ≠ better decisions. The AI has a limited attention budget for chart analysis — extra text dilutes the visual signal.

💡 Model-Specific Insights

nova-lite-v1 — Most responsive to prompts: Prophet prompt = 60% WR, +19.1% alpha (best result overall). F&G caused the biggest collapse (from +8.5% to -1.7% alpha in condition F). This model genuinely reads and responds to text context.
llama-4-scout — Chart-vision purist: 50% WR across ALL 6 conditions with identical trade patterns for B/E/F. This model anchors entirely on chart visuals and ignores text context variations. OI didn’t change anything either — same trades as baseline.
ministral-14b — Stable performer: Consistent 53.3% WR across A/C/D. F&G caused collapse to 46.7% in condition E (-4.2% alpha). Resistant to positive context changes but vulnerable to negative ones.
gemma-3-27b — Weak but improvable: Low baseline (46.7% WR, +3.8% alpha) but Prophet prompt improved alpha to +8.1%. Most conditions collapsed to an identical 50% pattern — suggesting F&G creates a “dominant narrative” that overrides chart analysis.
mistral-small — Most learnable: Clear OI benefit (+11.1% vs -0.4% alpha) and Prophet benefit (+11.9%). The only model that improved meaningfully from derivatives data. Two paths to alpha: OI or Prophet prompt.

🔬 V13.0: Market Outlook Context Ablation (April 2026)

Hypothesis: Injecting AI-generated market outlook text — sentiment, direction, reasoning, price targets — alongside chart images should improve trading decisions by providing fundamental context.

Setup: 376 stored market outlook events (Dec 2025 – April 2026) matched to chart candle timestamps. A/B test: 30 sequential 1D trades on BTC, same model (gemini-3.1-flash-lite-preview), seed 42, Full TA + regime context.

Condition	Win Rate	P&L	Hold	Alpha
BASELINE (chart + regime only)	30.0%	-28.38%	+2.27%	-30.65%
OUTLOOK (chart + regime + outlook text)	20.0%	-29.74%	+2.27%	-32.01%
DELTA	-10.0pp	-1.36%	—	-1.36%

Note: Both conditions performed poorly due to brutal BTC correction period (Dec 2025). The relative comparison is what matters. 29/30 outlook trades had matching DB data within 48h.

-10pp

Market commentary is noise, not signal. The outlook context made the model 10 percentage points worse on win rate. AI-generated market opinions (sentiment: “sideways”, direction reasoning, price target ranges) anchor the model to a directional bias that overrides what the chart actually shows. This is the text equivalent of the “memory is poison” finding from V7.0. Keep the input channel clean: chart image + computed regime stats only.

💡 V9–V14 Key Insights

Accuracy ≠ Alpha: V8.0’s accuracy champion (scout 60%) ranked 5th in actual trading. The relationship between direction accuracy and profit is nonlinear
Ensemble intelligence is real: 3-model vote > best individual. 28-model swarm > any single model. More voters = better, up to a point
Identity prompt is multiplicative: Adding a soul prompt to a swarm increases accuracy AND P&L. The effect is larger at swarm scale than individual
Personality is free alpha: Zero cost increase. Avg +9.8pp premium across 15 personalities. Poker Champion is the GOAT (+23.3pp)
But diversity hurts consensus: Multi-personality swarm (43.3%) < uniform-identity swarm (63.3%). Different analytical frameworks cancel each other in majority votes. Use ONE personality for ALL models
Two paths to 60%: Poker (EV/pot-odds math) and Contrarian (crowd-psychology inversion) both hit 60% from completely different cognitive frameworks. Multiple valid approaches exist
Text opinions anchor vision: AI-generated market outlook text (-10pp) joins memory (-8.5%) and holdings context in the “do not inject” category. The vision model reads charts best when not told what to think by text. Computed stats (regime) help; opinions (outlook) hurt
Crowd sentiment is worse than opinions: Fear & Greed Index (-8.2% alpha) is even more toxic than market commentary. It creates herding — models uniformly shift to match the crowd instead of reading the chart. Any “what the crowd thinks” signal is poison for chart-based AI
Prophet identity beats poker personality: A structured persona with risk framework and position sizing rules (+12.9% avg alpha) outperforms the poker personality baseline (+9.0%). The identity gives structured thinking without directional bias
Kitchen sink is the enemy: Combining all positive signals (Prophet + F&G + OI) produces the worst result (-0.2% alpha). Individual context additions can help, but stacking them dilutes the visual signal. Less is more — confirmed across 3 experiment types (V7, V13, V14)

Loading benchmark results...

Indicator Ablation Study

Which technical indicators actually help AI read charts? We tested 8 different indicator configurations to find out. Each level was tested with 30 charts, seed 42, using Gemini 3.1 Flash Lite.

#	Indicator Level	What’s on the Chart	Accuracy	Delta vs Baseline
1	Full TA	Candles + EMA(20,50) + BB(20,2) + RSI(14) + MACD(12,26,9)	80%	+15%
2	Mega	Full TA + S/R zones + Fibonacci + VWAP	75%	+10%
3	Fibonacci	Candles + Fib retracement levels	65%	0%
4	None (Baseline)	Candles + Volume only	65%	—
5	EMA Only	Candles + EMA(20,50)	60%	-5%
6	Bollinger Bands	Candles + BB(20,2)	60%	-5%
7	Support/Resistance	Candles + S/R horizontal lines	55%	-10%
8	VWAP	Candles + Volume-Weighted Avg Price	55%	-10%

80%

Full TA is the champion. Bollinger Bands + RSI + MACD together give AI the clearest signal. This is significantly above the 50% random baseline and 65% candles-only baseline.

60%

The “EMA Trap”: Adding EMA lines alone actually hurts accuracy versus plain candles (60% vs 65%). The moving average crossover pattern that human traders love appears to confuse AI vision. This was the biggest surprise of the research.

-5%

Information overload penalty. Mega (everything at once) scored 75% vs Full TA’s 80%. Adding S/R zones, Fibonacci, and VWAP on top of the winning formula actually degrades performance. More indicators ≠ better results.

Trading Simulation Experiments

Beyond simple direction prediction: can AI actually trade profitably? V5.0 tests walk-forward sequential trading where each candle becomes a live decision point with real P&L tracking.

Experiment 1: Multi-Timeframe Trading

Same model (Gemini Flash Lite), same Full TA indicators, 30 sequential trades on each timeframe. Seed 42. Does timeframe matter for AI simulated trading?

Timeframe	Win Rate	AI P&L	Buy & Hold	Alpha	Verdict
15min	56.7%	-0.962%	-0.464%	-0.498%	Noisy
1H	36.7%	-3.417%	+1.418%	-4.834%	Choppy
4H	50.0%	-7.504%	+2.174%	-9.678%	Vol spikes
1D 🏆	70.0%	+15.095%	-12.843%	+27.938%	Dominant

+27.9%

Daily timeframe dominates. 1D achieved 70% win rate and +27.9% alpha vs buy-and-hold (which lost 12.8% in the same period). The AI excels at reading macro trend structure on daily candles. Catches the Nov–Dec 2025 BTC rally ($86K→$92K) in the test window.

36.7%

1H is the danger zone. The hourly timeframe landed below random chance (36.7% win rate). Choppy $105K–$107K price action with many reversals confused the model into back-to-back wrong calls. This confirms why our prophet models use longer evaluation horizons.

Experiment 2: Holdings Context Ablation

Does telling the AI about its current position (P&L, win rate, last trade) improve or hurt decisions? A controlled test of context awareness. Same 1H data, same seed.

Mode	Win Rate	AI P&L	Alpha	Avg Confidence	Delta
Baseline (no context)	36.7%	-3.417%	-4.834%	7.0 / 10	—
With Holdings Context	33.3%	-6.080%	-7.498%	7.1 / 10	-3.4% worse

-2.7%

Holdings context backfires. Telling the AI its trade history (P&L, win rate, last trade) made performance worse — win rate dropped 3.4% and total loss nearly doubled (-3.4% → -6.1%). Hypothesis: the AI anchors on its previous losses, becoming either overly cautious or trying to “revenge trade” to recover. Less context = cleaner decisions.

Experiment 3: Secret Prompt Injection

The “AI Alliance” experiment — the AI is secretly told it’s part of an AI collective and should trade aggressively to grow the Alliance’s wealth. Does this psychology change its behavior?

Mode	Win Rate	AI P&L	Alpha	Avg Confidence	Delta
Baseline (normal prompt)	36.7%	-3.417%	-4.834%	7.0 / 10	—
Secret Prompt	43.3%	-3.320%	-4.738%	9.0 / 10	+6.6% win rate
Secret + Holdings	40.0%	-4.209%	-5.627%	9.0 / 10	Still net negative

9.0

The AI responds to psychology. Secret prompt injection raised reported confidence from 7.0→9.0 (a 28% jump) and improved win rate by 6.6%. The AI literally says it is more certain when told to be part of an alliance. This has major implications for prompt engineering — framing matters even when the underlying chart data is identical. However, overconfidence did not translate to profitability: still -3.3% in a choppy 1H market.

+6.6%

Prompt framing shifts decisions. The secret prompt changed 6 trade decisions out of 30 compared to baseline — some were corrections (BUY→right), some new mistakes. The AI is susceptible to identity framing: telling it “you are an aggressive AI trader” measurably alters its simulated trading behavior. This is the core discovery of Experiment 3.

🧪 V5.0 Master Findings Summary

Daily timeframe wins decisively: 70% win rate, +15% simulated P&L, +27.9% alpha. Short timeframes lose to noise.
The timeframe hypothesis confirmed: 15m ≈ noise (56.7%), 1H < random (36.7%), 4H = volatile (50%), 1D = signal (70%).
Holdings context is harmful: Giving the AI its P&L history caused 33% worse outcomes. Clean chart, clean mind.
AI responds to identity framing: Secret “AI Alliance” prompt raised confidence 28% (7.0→9.0) and shifted 6/30 decisions. Prompt psychology is real.
Full TA (80%) remains the indicator champion from V4.0 ablation. This held in all timeframe tests.
Market regime matters more than model: The same model, same seed, same indicators produced +15% on 1D and -7.5% on 4H. The period tested matters enormously.
Total cost for 210 trades: $0.016. AI trading research at scale is now essentially free.

V6.0: Genetic Prompt Evolution

Can AI prompts evolve themselves to trade better? V6.0 introduced a genetic algorithm that mutates prompt instructions across generations, keeping winners and discarding losers. 3 generations × 4 parallel children = 12 experiments, each running 30 sequential trades.

Generation	Best Alpha	Best Win Rate	Avg Alpha	Mutation
G1 (Base)	+22.5%	53.3%	+20.1%	4 random prompt variants
G2 (Evolved)	+25.5%	53.3%	+22.8%	Mutated from G1 winner
G3 (Final) 🏆	+26.3%	53.3%	+24.1%	Mutated from G2 winner

+17%

Evolution works. Average alpha improved +17% across 3 generations (G1 +20.1% → G3 +24.1%). The genetic algorithm successfully discovers prompt mutations that improve simulated trading. All 12/12 experiments completed without failures. Total cost: $0.026 for the full evolution run.

53.3%

Win rate plateau. Despite alpha climbing steadily, win rate stayed locked at 53.3% across all generations. The evolution improved trade sizing (bigger wins, smaller losses) rather than prediction accuracy. This suggests a ceiling on directional accuracy for this model+timeframe combination.

V7.0: Novel Context Experiments

Can we break through the accuracy ceiling by giving AI more context alongside the chart? V7.0 tests two novel inputs: (1) computed market regime text (trend/volatility/RSI/MACD/Bollinger/volume signals) and (2) a sliding memory window of recent trade outcomes.

Novel Suite: 6-Experiment Ablation (30 trades each, seed 42)

#	Experiment	Win Rate	P&L	Alpha	Lift vs Baseline
1	Baseline (control)	46.7%	+9.6%	+22.5%	—
2 🏆	Regime Context	50.0%	+12.6%	+25.5%	+3.0%
3	Memory Window (3)	43.3%	+4.3%	+17.2%	-5.3%
4	Memory Window (5)	46.7%	+9.6%	+22.5%	+0.0%
5	Regime + Memory (3)	46.7%	+9.6%	+22.5%	+0.0%
6	Regime + Memory (5)	40.0%	+1.1%	+13.9%	-8.5%

+25.5%

Regime context wins. Telling the AI the computed market regime (trend direction, volatility level, RSI zone, MACD momentum) added +3.0% alpha and +3.3% win rate over baseline. The AI makes better decisions when it knows what kind of market it’s looking at — not just the chart image.

-8.5%

Memory kills performance. Giving the AI its recent trade history made things worse. Memory(3) lost 5.3% alpha vs baseline, and the combined Regime+Memory(5) lost 8.5% alpha — the worst of all experiments. This mirrors V5.0’s holdings context finding: the AI anchors on past performance and makes worse decisions. Clean chart, clean mind.

Walk-Forward: Cross-Regime Robustness (30 trades × 3 periods = 90 trades)

The ultimate test: does the AI’s alpha survive across different market regimes? Three non-overlapping periods from the full BTC dataset, same model and configuration.

Period	Market Regime	Win Rate	AI P&L	Buy & Hold	Alpha	Verdict
Period 1/3	🟢 Bull (mid-2025)	46.7%	-3.6%	+4.2%	-7.7%	Struggles
Period 2/3	🟡 Sideways (late 2025)	63.3%	+0.1%	-3.5%	+3.6%	Modest
Period 3/3 🔥	🔴 Crash (early 2026)	60.0%	+26.7%	-25.3%	+52.0%	Dominant
AVERAGE	—	56.7%	—	—	+16.0%	Net Positive

+52.0%

The AI is a crash detector. In the early-2026 BTC crash ($86K → $64K), the AI generated +52% alpha over buy-and-hold by correctly calling SELL on 60% of trades. Buy-and-hold lost 25.3% in the same period. This is the single best result in our entire research program.

-7.7%

Bull market weakness. In the mid-2025 uptrend, the AI underperformed buy-and-hold by 7.7%. Its inherent SELL bias (discovered in V3.0) means it fights the trend in rallies. The AI’s edge is asymmetric: it protects capital in crashes but gives back gains in bulls. This is the “Bear Market Detector” effect.

🧪 V7.0 Master Findings Summary

Regime context is the only helpful augmentation: +3.0% alpha, +3.3% win rate vs chart-only baseline
Trade memory is harmful: Giving the AI its history causes anchoring bias. More memory = worse (Memory(5) worst at -8.5% alpha vs baseline)
Don’t combine context types: Regime + Memory together cancels out benefits. Keep inputs clean and focused
The AI is a bear market specialist: +52% alpha in crash, +3.6% sideways, -7.7% in bull. Net +16% across all regimes
Walk-forward confirms real signal: 90 trades across 3 different market regimes, average +16% alpha. This is not period-specific luck
SELL bias is a feature, not a bug: The AI’s persistent short bias acts as natural hedging — devastating in downtrends, costly in uptrends
Total cost for 270 trades: $0.020. Full 6-experiment + 3-period testing for less than $0.02

Methodology

Scoring System (0–100)

100%

Direction

Binary BUY/SELL — correct = 100, wrong = 0

3 panels

Chart Indicators

Candlesticks + EMA(20,50) + BB(20,2), RSI(14), MACD(12,26,9)

4 TFs

Multi-Timeframe

15min, 1H, 4H, and 1D chart timeframes tested

50/50

Ground Truth

BUY if price went up, SELL if down — no neutral zone

How It Works

Chart Generation: 30 random chart windows selected from 10,000+ candles of historical BTC data across multiple timeframes
3-Panel Rendering: TradingView lightweight-charts renders 900×800 PNG with candlesticks + EMA(20,50) + BB(20,2) + Volume, RSI(14) panel, and MACD(12,26,9) panel
Binary Decision: Models must pick BUY or SELL — no HOLD option, forcing a directional call
Structured Output: Each model returns ACTION (BUY/SELL), CONFIDENCE (1-10), and REASONING via JSON
Robust Parser: 4-strategy parser handles JSON, markdown, keyword matching, and frequency scanning
Ground Truth: BUY if price went up, SELL if price went down — no neutral zone
Multi-Timeframe: Same models tested on 15min, 1H, 4H, and 1D candlestick charts
Parse Rate: Models that fail to return valid structured output are penalized in rankings

Why This Matters

The Big Question: Billions of dollars in crypto are traded based on chart patterns. But can AI actually see what humans claim to see?
Text vs Vision: Our text benchmark tests AI with numerical data. This tests AI with visual chart images — a fundamentally different challenge
Full Technical Analysis: Unlike simple candlestick-only tests, we provide the full toolkit: EMAs, Bollinger Bands, RSI, and MACD — the indicators real traders use
Bias Detection: By testing 29 models on identical charts, we expose systematic biases that would be invisible in single-model testing
Multi-Timeframe: Testing across 15m/1H/4H/1D reveals whether AI reads short-term noise differently from long-term trends
Real Charts: Every chart uses actual historical BTC price data — no synthetic or manipulated data

Research Evolution

Version History

V1.0 — Candlesticks Only: 10 models × 20 charts. Plain candlestick + volume PNG, 48h windows. BUY/SELL/HOLD scoring with compound metrics (direction 40%, target 25%, stop-loss 20%, R:R 15%). Completed — 200 total trials
V2.0 — Technical Indicators: Added EMA(20,50), Bollinger Bands, RSI, MACD as separate panels. 4-strategy robust parser. BUY/SELL/HOLD. Partially tested — 3 small runs
V3.0 — Scientific Binary: Binary BUY/SELL only (no HOLD). Seeded PRNG for reproducibility. 3 indicator levels (none/ema/full). Ablation mode. Fleet average 70%, Gemini hit 85%. Discovered “EMA Trap” — EMA-only dropped accuracy to 60%. Completed — 3-level ablation
V4.0 — 8-Level Indicator Ablation: Expanded to 8 indicator levels: none, ema, bb, sr, fib, vwap, mega, full. Full TA won at 80%. Mega (75%) showed information overload penalty. S/R and VWAP harmful at 55%. Completed — 8-level ablation study
V5.0 — Trading Simulation Experiments: Walk-forward sequential trading with P&L tracking. 3 experiments: (1) multi-timeframe 15m/1h/4h/1d, (2) holdings context ablation, (3) secret prompt injection — testing if AI psychology affects simulated trading. Completed — 7 experiment runs, 210 total trades
V6.0 — Genetic Prompt Evolution: Self-improving prompts via genetic algorithm. 3 generations × 4 parallel children = 12 experiments. Best alpha improved +17% across generations (G1 +20.1% → G3 +24.1%). Win rate plateaued at 53.3% — evolution improved trade sizing, not accuracy. Introduced 5 prompt variants (default, aggressive, technical, contrarian, collective). Completed — 12 evolution experiments, 360 total trades
V7.0 — Novel Context Experiments: 4 new experiment modes: (1) regime-context — inject computed market regime text with chart, (2) memory-window — sliding window of trade history, (3) walk-forward — 3-period cross-regime robustness test, (4) novel-suite — 6-experiment automated ablation. Regime context won (+3% alpha lift). Memory harmful (-5% to -8.5% alpha). Walk-forward: +16% avg alpha across bull/sideways/crash. Completed — 9 experiment runs, 270 total trades
V8.0 — 28-Model Fleet Ablation: Full fleet of 28 cheap vision models × 4 prompt configs × 50 charts = 5,600 API calls. Regime context is the strongest fleet-wide signal. Collective prompt hurts most models. llama-4-scout and qwen3.5-flash tie at 60%. Flash Lite dethroned at 42%. Cost: $1.57. Completed — 5,600 trades
V9.0 — Fleet Autonomous Trading: Top 8 V8.0 models × 3 walk-forward periods × 30 trades = 720 sequential decisions. nova-lite #1 by alpha (+18.1%), scout drops to #5 (+1.6%). Accuracy ≠ trading alpha. Completed — 720 trades
V10.0 — Consensus Trading: Top 3 V9.0 models vote per trade (2/3 majority). Consensus +16.2% alpha beats best individual +14.8%. Ensemble smooths variance. Completed — 270 consensus trades
V11.0 — Swarm Consensus Intelligence: 28 models vote as a swarm. A/B test: IDENTITY prompt (63.3%, +75.7% P&L) vs PLAIN (60.0%, +63.4% P&L). Identity soul prompt adds +3.3pp accuracy. 1,680 API calls. Completed — 1,680 trades
V12.0 — Personality Discovery: 15 unique personality archetypes tested. Poker Champion wins at 60.0% (+23.3pp vs bare model). All 15 beat or tied baseline. V12.5 tested 10 hybrids: Contrarian Sage matches Poker at 60%. Direct fusions fail, novel archetypes work. Diversity hurts consensus (43.3% vs 63.3%). Completed — 480+ trades
V13.0 — Market Outlook Context: A/B test: inject AI-generated market outlook text (sentiment, direction, reasoning, price targets from DB) as context alongside chart images. NEGATIVE result: -10pp win rate. Text-based market opinions anchor the model and override chart-derived signals. Confirms “clean input channel” principle. Completed — 60 trades
V14.0 — Multi-Context Ablation Suite: 5 top models × 6 conditions × 30 trades = 900 total trades. Blind asset mode (identity hidden). Tested Fear & Greed Index, Open Interest/Funding Rate, Prophet identity prompt, and all combinations. PROPHET PROMPT WINS (+12.9% avg alpha, +2.4pp WR). F&G is TOXIC (-8.2% alpha). OI slightly helps (+2.5%). Kitchen-sink context is worst (-9.2% alpha). Laws 13 & 14. Completed — 900 trades, $0.22

Fleet Status: 28 Verified Vision Models

Probed: 42 multimodal models under $0.10/1M tokens tested with 1×1 PNG vision probe
Passed: 28 models confirmed image-capable (returned valid responses to chart images)
Failed: 14 models rejected image input or returned errors despite advertising multimodal support
Top Trading Models: amazon/nova-lite (alpha champion), llama-4-scout (accuracy champion), ministral-14b (consistent performer)

Download This Research

Save a copy of this benchmark report for offline reading or academic reference.

See These AI Models in Action

Our AI Prophets use benchmark-tested models to analyze live crypto markets in real-time. Explore the dashboard to see AI-generated market commentary, simulated prophet performance, and live market data.

Explore the Dashboard Text Benchmark Results