How 5 of 6 Backtest Corpora Were Contaminated
A genuine-corpus audit across six prediction-market and equity universes uncovered the same handful of contamination patterns over and over. They are common, structural, and almost never caught by standard pipelines. Five concrete defenses.
The trigger
We had a kalshi corpus showing roughly +$177K of paper P&L across the persona fleet. The numbers were too good. They survived purged cross-validation, they survived Deflated Sharpe, they survived embargo. Every gate said "ship it."
Two months later, after a series of failed forward-paper validations, we rebuilt the corpus from raw tape — and the same fleet showed a -$165K loss. A swing of roughly $343K, on the same code, against a regenerated "truth."
That delta is what contamination feels like. It does not look like contamination during the backtest. It looks like edge. The whole point of this post is the audit method that finally caught it — applied across all six of our universes, with results.
The six universes, audited
| Universe | Original corpus state | Audit result |
|---|---|---|
| Kalshi | 305K markets, multi-year | Contaminated — rebuilt from Becker 72.1M-trade dataset |
| Polymarket (intl) | ~190K markets | Contaminated — YES-token-only rebuild required |
| Limitless | ~36K markets | Constant-0.5 yes_price bug — rebuilt from genuine tape |
| Polymarket-US | ~empty | Bridge missing — forward-only |
| Alpaca (equities) | 382K minute-bars | Scorer mismatch — equities vs probabilities |
| Overtime (sports) | 127K on-chain tickets | Calibration-bucket mismatch — Thales crypto map |
Five of six corpora had a different contamination class. The patterns repeat across the industry; we list them below in the order we now check.
The five contamination patterns
L1. Post-resolution price embedded in features
The most common, the most insidious, and the easiest to ship by accident. The corpus stores the full market price history. When the persona derives a feature like "5-minute realized volatility," nothing in the pipeline prevents the rolling window from including bars after resolution — which are, on a binary market, all pinned to {0, 1}. The feature now leaks the outcome with near-perfect SNR.
Detection: clip every per-market feature window at the market's resolution_ts − ε. Then re-run the
persona on a "post-resolution price scrubbed" corpus and diff the P&L. Any persona whose P&L collapses by
>30% was L1-leaking. This was the dominant pattern in the original Kalshi corpus.
L2. Post-resolution price embedded in calibration
Subtler. The persona scores a signal, the system maps that signal to a probability via PAV isotonic. If the isotonic training set includes post-resolution price ticks for the same markets the persona is being scored on — leak. The persona's "calibrated P(YES)" is then trained on the answer.
Detection: fit isotonic on a strict pre-resolution-only slice, then evaluate on a hold-out where every market's full pre/post-resolution split was respected. If the Brier delta between the leaky-fit and clean-fit calibration exceeds 5%, the persona was L2-leaking.
L3. Full-sample fits — the "I tuned hyperparameters on the test set" problem with a more academic name
Any model component that touches the entire dataset before train/test partitioning is full-sample-fit. Common offenders: feature normalization (mean/std computed on the full series), regime detector (clusters fit on full series), threshold optimizer (best-by-grid on full series). Each of these looks like preprocessing. Each leaks future information into past trades.
Detection: combinatorial purged cross-validation with embargo (López de Prado, 2018). Any preprocessing that requires a fit must be re-fit inside each CV fold's training partition only. The fleet's CV implementation purges every market whose duration overlaps a test fold and embargoes a configurable buffer on both sides.
L4. Snapshot misalignment
The corpus has a market table (one row per market) and a price-history table (many rows per market). If the market table's "outcome" column was populated at scrape time, but the price-history table is the historical record, then for any market that resolved after the scrape but before the backtest run, the outcome was wrong. Backtests built on this corpus look genuine but are silently labelled wrong on a sliding window of recent markets.
Detection: every market row must carry a scrape_ts AND a resolution_ts. The backtest
must reject any market where resolution_ts > scrape_ts. We caught this in the Polymarket intl
corpus — it was responsible for a meaningful slice of the false positives.
L5. Truncated-tape leakage
Documented in zostaff/ai-quant-researcher: if the tape was truncated at the time of scrape, every market that resolved after the truncation point looks "open" in the corpus. Backtests that simulate trading "open" markets get phantom fills against ghost liquidity that, in reality, never existed.
Detection: count the volume of fills against markets whose state is "open" at the latest tape ts but whose resolution timestamp is in the past. Any non-zero count is L5.
The diagnostic we shipped: F1214
Each of the five patterns above is one detector. The full suite runs every nightly walk-forward and writes a per-persona report. The pass criterion is:
contamination_score = max(L1_leak, L2_leak, L3_leak, L4_leak, L5_leak)
ship_gate = (contamination_score < 0.05)
Where each Lk_leak is the P&L delta between the contaminated and clean corpus, normalized.
The gate is hard. A persona that drops >5% on any single contamination test is not eligible for live trading.
Of the 43 strategies that survived purged-CV on the rebuilt corpus, 0 passed the full F1224 26-method battery — but they all passed F1214 contamination. That distinction matters: F1214 is necessary but not sufficient. Clean corpus → still no guarantee of profitable execution.
The rebuild method
For Kalshi specifically, we rebuilt from Jon Becker's 72.1M-trade dataset (publicly available, ~36 GiB compressed). The pipeline:
- Strip post-resolution ticks. Each market's price history is truncated at
resolution_tswith a 60-second buffer. - Mark provenance. Every row carries
scrape_ts,resolution_ts, and agenuine_flagthat is set only if both timestamps are present and the price history terminates at-or-beforeresolution_ts. - Reject heuristic. Drop any market whose orderbook history has a gap longer than the median trade interval × 10 — these are likely partial scrapes.
- Spot check. 100 random markets are hand-audited against the original source for outcome correctness.
- Lock the corpus. Once accepted, the corpus is read-only and SHA-256-fingerprinted. Walk-forward runs verify the fingerprint before execution.
Limitless required a different fix entirely — the original corpus had a constant 0.5 yes_price column from a downstream join bug. We pulled fresh from the Limitless API and replayed.
What changed after the audit
The kalshi fleet's projected go-live P&L on the rebuilt corpus is +$12,814 from 33 strategies that survive purged-CV, versus the +$177K the contaminated corpus had reported. That delta is real money the system would have lost trying to ship the original "winners." The contamination wasn't a 10% haircut; it was a sign flip.
We packaged the rebuilt corpora + the F1214 contamination test suite as the Empire Research Reproducibility Datasets. Buyers get the genuine tape, the contamination test code, and a fingerprint file that lets them verify the corpus is the one we audited.
If you have a backtest, run these five tests
- L1: truncate every per-market price window at
resolution_ts − ε. Re-run. Diff P&L. - L2: re-fit your calibration on pre-resolution-only data. Re-score. Diff Brier.
- L3: combinatorial purged CV with embargo, every preprocessing step inside the fold.
- L4: check that
resolution_ts < scrape_tson every market your backtest touches. - L5: count fills against markets whose state is "open" at tape-end but whose resolution is in the past.
If any of those diffs exceed 5%, the backtest is contaminated. The fix is corpus discipline, not feature engineering. If you want us to run the audit on your corpus, that's the entry tier of Validation-as-a-Service.
Honest disclosure. These posts come from an internal trading-research program ($5K experimental capital, paper-mode preserved). Results are reported as measured; none of this is investment advice. Where a method or finding has caveats, we name them in-line — that is the whole point of this series.