Methodology May 18, 2026 10 min read

How 5 of 6 Backtest Corpora Were Contaminated

A genuine-corpus audit across six prediction-market and equity universes uncovered the same handful of contamination patterns over and over. They are common, structural, and almost never caught by standard pipelines. Five concrete defenses.

The trigger

We had a kalshi corpus showing roughly +$177K of paper P&L across the persona fleet. The numbers were too good. They survived purged cross-validation, they survived Deflated Sharpe, they survived embargo. Every gate said "ship it."

Two months later, after a series of failed forward-paper validations, we rebuilt the corpus from raw tape — and the same fleet showed a -$165K loss. A swing of roughly $343K, on the same code, against a regenerated "truth."

That delta is what contamination feels like. It does not look like contamination during the backtest. It looks like edge. The whole point of this post is the audit method that finally caught it — applied across all six of our universes, with results.

The six universes, audited

Universe	Original corpus state	Audit result
Kalshi	305K markets, multi-year	Contaminated — rebuilt from Becker 72.1M-trade dataset
Polymarket (intl)	~190K markets	Contaminated — YES-token-only rebuild required
Limitless	~36K markets	Constant-0.5 yes_price bug — rebuilt from genuine tape
Polymarket-US	~empty	Bridge missing — forward-only
Alpaca (equities)	382K minute-bars	Scorer mismatch — equities vs probabilities
Overtime (sports)	127K on-chain tickets	Calibration-bucket mismatch — Thales crypto map

Five of six corpora had a different contamination class. The patterns repeat across the industry; we list them below in the order we now check.

The five contamination patterns

L1. Post-resolution price embedded in features

The most common, the most insidious, and the easiest to ship by accident. The corpus stores the full market price history. When the persona derives a feature like "5-minute realized volatility," nothing in the pipeline prevents the rolling window from including bars after resolution — which are, on a binary market, all pinned to {0, 1}. The feature now leaks the outcome with near-perfect SNR.

Detection: clip every per-market feature window at the market's resolution_ts − ε. Then re-run the persona on a "post-resolution price scrubbed" corpus and diff the P&L. Any persona whose P&L collapses by >30% was L1-leaking. This was the dominant pattern in the original Kalshi corpus.

L2. Post-resolution price embedded in calibration

Subtler. The persona scores a signal, the system maps that signal to a probability via PAV isotonic. If the isotonic training set includes post-resolution price ticks for the same markets the persona is being scored on — leak. The persona's "calibrated P(YES)" is then trained on the answer.

Detection: fit isotonic on a strict pre-resolution-only slice, then evaluate on a hold-out where every market's full pre/post-resolution split was respected. If the Brier delta between the leaky-fit and clean-fit calibration exceeds 5%, the persona was L2-leaking.

L3. Full-sample fits — the "I tuned hyperparameters on the test set" problem with a more academic name

Any model component that touches the entire dataset before train/test partitioning is full-sample-fit. Common offenders: feature normalization (mean/std computed on the full series), regime detector (clusters fit on full series), threshold optimizer (best-by-grid on full series). Each of these looks like preprocessing. Each leaks future information into past trades.

Detection: combinatorial purged cross-validation with embargo (López de Prado, 2018). Any preprocessing that requires a fit must be re-fit inside each CV fold's training partition only. The fleet's CV implementation purges every market whose duration overlaps a test fold and embargoes a configurable buffer on both sides.

L4. Snapshot misalignment

The corpus has a market table (one row per market) and a price-history table (many rows per market). If the market table's "outcome" column was populated at scrape time, but the price-history table is the historical record, then for any market that resolved after the scrape but before the backtest run, the outcome was wrong. Backtests built on this corpus look genuine but are silently labelled wrong on a sliding window of recent markets.

Detection: every market row must carry a scrape_ts AND a resolution_ts. The backtest must reject any market where resolution_ts > scrape_ts. We caught this in the Polymarket intl corpus — it was responsible for a meaningful slice of the false positives.

L5. Truncated-tape leakage

Documented in zostaff/ai-quant-researcher: if the tape was truncated at the time of scrape, every market that resolved after the truncation point looks "open" in the corpus. Backtests that simulate trading "open" markets get phantom fills against ghost liquidity that, in reality, never existed.

Detection: count the volume of fills against markets whose state is "open" at the latest tape ts but whose resolution timestamp is in the past. Any non-zero count is L5.

The diagnostic we shipped: F1214

Each of the five patterns above is one detector. The full suite runs every nightly walk-forward and writes a per-persona report. The pass criterion is:

contamination_score = max(L1_leak, L2_leak, L3_leak, L4_leak, L5_leak)
ship_gate = (contamination_score < 0.05)

Where each Lk_leak is the P&L delta between the contaminated and clean corpus, normalized. The gate is hard. A persona that drops >5% on any single contamination test is not eligible for live trading.

Of the 43 strategies that survived purged-CV on the rebuilt corpus, 0 passed the full F1224 26-method battery — but they all passed F1214 contamination. That distinction matters: F1214 is necessary but not sufficient. Clean corpus → still no guarantee of profitable execution.

Why these are not "junior mistakes" Every contamination pattern above appears in production code from teams of competent quants. The reason is structural: each pattern emerges from a useful abstraction (cache the price history, normalize features, fit a calibration). The leakage is the cost of the abstraction. Catching it requires a deliberate, mechanical test against a regenerated corpus — there is no "just be careful" version that works.

The rebuild method

For Kalshi specifically, we rebuilt from Jon Becker's 72.1M-trade dataset (publicly available, ~36 GiB compressed). The pipeline:

Strip post-resolution ticks. Each market's price history is truncated at resolution_ts with a 60-second buffer.
Mark provenance. Every row carries scrape_ts, resolution_ts, and a genuine_flag that is set only if both timestamps are present and the price history terminates at-or-before resolution_ts.
Reject heuristic. Drop any market whose orderbook history has a gap longer than the median trade interval × 10 — these are likely partial scrapes.
Spot check. 100 random markets are hand-audited against the original source for outcome correctness.
Lock the corpus. Once accepted, the corpus is read-only and SHA-256-fingerprinted. Walk-forward runs verify the fingerprint before execution.

Limitless required a different fix entirely — the original corpus had a constant 0.5 yes_price column from a downstream join bug. We pulled fresh from the Limitless API and replayed.

What changed after the audit

The kalshi fleet's projected go-live P&L on the rebuilt corpus is +$12,814 from 33 strategies that survive purged-CV, versus the +$177K the contaminated corpus had reported. That delta is real money the system would have lost trying to ship the original "winners." The contamination wasn't a 10% haircut; it was a sign flip.

We packaged the rebuilt corpora + the F1214 contamination test suite as the Empire Research Reproducibility Datasets. Buyers get the genuine tape, the contamination test code, and a fingerprint file that lets them verify the corpus is the one we audited.

If you have a backtest, run these five tests

L1: truncate every per-market price window at resolution_ts − ε. Re-run. Diff P&L.
L2: re-fit your calibration on pre-resolution-only data. Re-score. Diff Brier.
L3: combinatorial purged CV with embargo, every preprocessing step inside the fold.
L4: check that resolution_ts < scrape_ts on every market your backtest touches.
L5: count fills against markets whose state is "open" at tape-end but whose resolution is in the past.

If any of those diffs exceed 5%, the backtest is contaminated. The fix is corpus discipline, not feature engineering. If you want us to run the audit on your corpus, that's the entry tier of Validation-as-a-Service.

Honest disclosure. These posts come from an internal trading-research program ($5K experimental capital, paper-mode preserved). Results are reported as measured; none of this is investment advice. Where a method or finding has caveats, we name them in-line — that is the whole point of this series.