Methodology May 18, 2026 11 min read

Directional Alpha Is Unlearnable on Calibrated Markets

We trained an inverse-RL agent on a hindsight-optimal policy across 45,000 prediction markets. It picked the winning side 80% of the time — and would have lost money. Here's why, and what to do instead.

The question

If a perfect oracle told you, at decision time, which side of every prediction-market contract would win — and you bet that side at the live price — how much would you make?

The intuitive answer is "a lot." Prediction markets seem inefficient. Headlines say so. Twitter quants say so. The intuition is that the favorite-longshot bias, late-resolution drift, and oracle-lag arbitrage all leave directional money on the table.

We measured it. Across 45,000 capped prediction markets — Kalshi, Polymarket, Overtime — the hindsight-optimal trader's headroom over our actual policy was +$126K on Kalshi, +$2.2K on Overtime, and negative $38K on Polymarket. The first two suggest there's real money in correctly picking sides. The third is louder: Polymarket is already so well-calibrated that our existing fleet beats the capped hindsight-optimal trader net of fees. The mispricing is already harvested.

But the more interesting result came from trying to learn the optimal policy from data.

The experiment

We ran inverse reinforcement learning — an averaged-perceptron, max-margin IRL that recovers a linear reward w·φ over strictly decision-time features. Not behavior cloning. The distinction matters: behavior cloning would learn to imitate the optimal trader's actions on features that include future information, which is leakage in a particularly subtle disguise. IRL recovers the objective function the optimal trader is optimizing, using only features available at decision time, and then re-solves the policy from scratch.

Two candidate strategies emerged. We pushed both through the full 26-method validation battery (see the methodology post) — purged combinatorial cross-validation, synthetic null markets, realistic fill simulation, fee stress tests, regime stratification, and forward-paper.

Zero candidates survived.

The IRL reward predicted the winning side beautifully out-of-sample:

Universe	IRL side-prediction accuracy (OOS)	Net P&L after fees
Kalshi	0.805	Negative
Overtime	0.692	Negative
Alpaca (equities)	0.498	~Coin flip

Read that table carefully. An 80.5% side-prediction accuracy on Kalshi lost money. On Alpaca (US equities, next-day direction), the IRL agent predicted the future direction at 0.498 — pure coin flip. No signal predicts next-day equity direction from decision-time features. That negative result is, itself, a publishable finding.

Why IRL, not behavior cloning? Behavior cloning on a hindsight-optimal teacher is the classic data-leakage trap dressed in ML clothes — you end up modeling features that implicitly encode the outcome. IRL recovers a reward over decision-time-only features and then plans against that reward. Same teacher, leak-clean student. We would not have caught the 80%-accurate-yet-unprofitable result with BC.

Why 80% accuracy still loses

The math is mundane. On a calibrated binary market, betting the favorite side correctly 80% of the time means the average ticket pays roughly 0.8 × (1/p_avg) − 1.0, where p_avg is the average favorite probability. With p_avg ≈ 0.75 (typical for "predictable" markets), the gross expectation is 0.8 × 1.333 − 1.0 ≈ +0.066.

Now subtract:

Taker fee: ~1.75% on Kalshi, 0.30% on Polymarket-US, 0% on Limitless (but minimum-size constraints bite).
Slippage: 1–3% on the second-tier markets where the IRL agent thinks it has edge.
Adverse selection: the agent's fills cluster on its wrong trades because the right trades have already moved.
Resolution risk: non-zero probability of a dispute or oracle delay.

Net of those: 0.066 − 0.018 − 0.022 − 0.030 − 0.005 ≈ -0.009. An average loss of 0.9% per ticket, even with 80% prediction accuracy. Multiplied across thousands of tickets, you bleed. Calibrated markets are calibrated precisely so that the gross edge from being right ≈ the cost of getting in.

What does work

The same dataset that produced zero IRL winners produced 43 strategies that did pass purged combinatorial CV. Every single one of them was not a directional bet. They were:

Settlement-window convergence: prices must converge to {0, 1} in the final minutes — bet the convergence, not the outcome.
Maker rebate harvest: on Limitless, 100% maker rebate; on Polymarket-US, 0.20% maker rebate. Profitable from the rebate alone.
Oracle-lag arbitrage: the resolution oracle (NWS gridpoint weather, FRED nowcast, on-chain feed) updates on a known schedule; the market hasn't repriced yet.
Cross-asset lead-lag: Kalshi BTC reprices slower than CME and Coinbase — same asset, two markets, milliseconds of structural lag.
Volume-tier optimization: the fee schedule itself is the edge when you cross a tier boundary.

Every one of these harvests a structural quirk, not a prediction. We unpack the full taxonomy in "Winners Harvest Mechanical Structure, Not Predictions."

Implications

1. Stop trying to predict outcomes on calibrated markets.

If the market is calibrated (and most liquid prediction markets are within a few percent), predicting outcomes pays exactly the fee schedule. You cannot scale your way out — better features only get you closer to the calibration boundary, which still doesn't beat fees.

2. The "prediction accuracy" KPI is misleading.

Quant teams that report "our model predicts 78% of NFL game outcomes correctly" are reporting a number that has roughly zero correlation with P&L on a calibrated book. Brier score versus a no-skill baseline is closer to honest. P&L under realistic fills is the only number that matters.

3. Inverse RL is the right tool for "is there learnable edge here?"

Standard ML pipelines (train classifier, optimize threshold, deploy) conflate learning the right objective with learning the right policy. IRL forces you to separate them. If the IRL reward predicts correctly but the optimal policy under that reward still loses money — the objective itself is unprofitable. No amount of better modeling fixes that. Behavior cloning would have shipped an 80%-accurate worthless strategy.

4. The negative result is the alpha.

Knowing that directional alpha is unlearnable on calibrated markets is a useful finding. It tells you not to spend your $5K experimental capital on directional bots, your compute on directional walk-forwards, or your research time on directional feature engineering. We re-allocated the entire fleet around structural, incentive, and oracle-lag classes after this result. Compute reclaimed: 142 directional personas demoted to shadow.

Caveats we're honest about

The 45K-market corpus is capped — adding the long tail of illiquid markets could surface niche directional pockets we missed. Two universes (Limitless, Polymarket-US) are forward-only because the historical corpus is contaminated or absent (see the corpus audit). Behavior cloning vs IRL is a methodological choice; some teams genuinely cannot distinguish in their pipelines and that's a real risk.

Reproduce this

The capped 45K-market Kalshi and Polymarket corpora that underlie this experiment are part of the Empire Research Reproducibility Datasets. The averaged-perceptron IRL is ~120 lines of NumPy; we describe the exact feature set, the train/test split, and the embargo policy in the white paper. If you want us to run IRL on your strategy spec to check whether your "predictive edge" survives leak-clean reformulation, that's exactly what Validation-as-a-Service does.

Honest disclosure. These posts come from an internal trading-research program ($5K experimental capital, paper-mode preserved). Results are reported as measured; none of this is investment advice. Where a method or finding has caveats, we name them in-line — that is the whole point of this series.