The 26-Method Validation Battery Most Quants Skip
A five-family validation stack: statistical robustness, look-ahead audits, execution realism, regime stratification, forward-paper. Of 43 strategies that passed our purged-CV gate, zero passed the full battery. The methods, the order, and why each one matters.
The premise
A backtest that "works" is necessary but nowhere near sufficient. Strategies pass naive backtests for dozens of reasons that have nothing to do with edge. Over-fitting. Leak. Capacity mis-estimation. Regime selection. Survivorship. The methods below are an attempt to systematically falsify each of those before live capital ever touches the strategy.
The battery is organized into five families, applied in order. A strategy must pass family k before being eligible for family k+1. This is not a checklist; it's a sieve. Each family kills strategies the previous family let through.
Of the 43 strategies in our proven-positive purged-CV set, zero pass the full battery. Five are clean on the first four families and are currently pending the 14-day forward-paper SPAN gate. That number — zero out of 43 — is the most honest summary of where retail quant research actually is.
Family A — Statistical robustness (5 methods)
This family asks: is the historical P&L distinguishable from random?
A1. Purged combinatorial cross-validation with embargo
The gold standard for time-series CV (López de Prado, 2018). All preprocessing — including normalization, calibration fits, regime detection — is performed inside each training fold only. Markets whose duration overlaps a test fold are purged from training. A configurable embargo buffer is applied on both sides. Run for k folds; require the persona to be net-positive in >⌈k×0.6⌉ folds.
A2. Deflated Sharpe ratio
Bailey & López de Prado's correction for multiple-testing inflation. With N tested strategies and a maximum
observed Sharpe of S_max, the deflated Sharpe corrects for the family-wise alpha. We require
DSR > 0 with the multiplicity count set to the total number of personas tested in the fleet
(~2,400), not just the survivors. This is brutal but correct.
A3. White's reality check / SPA test
Bootstraps the Sharpe under a stationary block-bootstrap null. The strategy must outperform the null at p < 0.05. Implemented with 5,000 bootstrap replicates, block length ≈ √n.
A4. Probability of Backtest Overfitting (PBO)
The fraction of combinatorial CV folds in which the strategy ranks below median out-of-sample given that it ranked above median in-sample. PBO > 0.5 implies the strategy is more likely to disappoint than persist. We require PBO < 0.3.
A5. Synthetic null markets
Generate calibrated random markets that match the real market's price-process moments. Run the strategy on both. If the strategy is profitable on synthetic markets, the "edge" is an artifact of the price process, not of the market's mispricing. We've seen strategies that pass A1–A4 fail A5 — the persona was just betting that markets revert to their starting price, which calibrated random markets also do.
Family B — Look-ahead audits (6 methods)
This family asks: does the strategy use information from the future?
B1. Post-resolution price scrubbing (L1)
Truncate every per-market price window at resolution_ts − ε. Re-run. P&L delta > 30% = fail.
B2. Calibration leak (L2)
Re-fit calibration on pre-resolution-only data. Re-score. Brier delta > 5% = fail.
B3. Full-sample fit detector (L3)
Walk every preprocessing call. Any fit on data outside the training fold = fail.
B4. Snapshot misalignment (L4)
Require resolution_ts < scrape_ts on every traded market. Any violation = fail.
B5. Truncated-tape leakage (L5)
Count fills against markets whose state is "open" at tape-end but whose resolution is in the past. Non-zero = fail.
B6. Time-shuffle test
Randomly permute the timestamps of incoming features (preserving market identity). Re-run the strategy. If the time-shuffled strategy is profitable, the persona is not using temporal information — it's using something that survives the shuffle, which usually means it's using the outcome label in disguise.
Family C — Execution realism (5 methods)
This family asks: would the strategy actually fill at the prices the backtest assumed?
C1. Realistic fill simulator
For every signal, replace the assumed mid-price fill with a fill priced against the actual orderbook depth at
the signal timestamp. Apply a market-impact model proportional to size / depth. Compare to the
mid-price backtest. Delta > 20% = fail. This single test killed roughly 30% of candidates that passed Family A
and Family B.
C2. Adverse selection
For maker-style strategies, model the asymmetric fill rate: when the market is moving against you, your maker quote fills quickly; when the market is moving with you, your quote does not fill (you got picked off). The backtest must use the asymmetric fill rate, not the symmetric one.
C3. Fee stress test
Add 25%, 50%, 100% to the fee schedule. Re-run. A strategy with a 50% margin to the breakeven fee level is robust; a strategy that goes negative at +25% fees is fee-tier dependent and will die when the venue updates its schedule.
C4. Slippage stress test
Add 25 bps, 50 bps, 100 bps of slippage per leg. Same threshold: 50% margin to breakeven.
C5. Capacity test
Scale the assumed position size 2×, 5×, 10×, 25×. Re-run with proportional market-impact. Plot P&L vs. size. The strategy's capacity ceiling is where the P&L curve goes flat. A strategy with capacity ≈ $5K is research-grade, not deployable.
Family D — Regime stratification (5 methods)
This family asks: is the edge consistent across market conditions, or does it require a specific regime?
D1. Bull / bear / chop partition
Partition the backtest period into bull, bear, and chop regimes (defined by 60-day rolling Sharpe of the underlying). Require the strategy to be net-positive in all three. Bull-only strategies are common; bear-only strategies are rare and high-value; chop-only strategies usually fail forward-paper because chop is hard to forecast in real time.
D2. Volatility quartile partition
Stratify by realized volatility quartile. The strategy must be net-positive in at least three of the four. Strategies that only work in high-vol regimes are over-fit to a specific market period.
D3. Liquidity-regime partition
Stratify by orderbook depth quartile. Maker strategies should be best in low-liquidity regimes (less competition); taker strategies should be best in high-liquidity regimes (better fills). Inverse signals are red flags.
D4. Time-of-day stratification
Partition by UTC hour, day of week, day of month. The strategy must not concentrate >40% of its P&L in any single bucket — that level of concentration usually indicates a calendar artifact rather than an edge.
D5. Cross-universe out-of-sample transfer
Where applicable, train on universe X (e.g., Kalshi) and test on universe Y (e.g., Polymarket). A strategy that doesn't transfer at all is venue-specific; a strategy that transfers is robust. Most directional strategies do not transfer. Most structural strategies do, with a multiplicative venue-fee adjustment.
Family E — Forward-paper (5 methods)
This family asks: does the strategy work on data the developer has never seen?
Family E is the gold standard. Backtests, no matter how clean, cannot model your own market impact, adverse selection on your specific orderflow, true fill availability, or latency. Forward-paper on live data is the only test that observes all of those. We treat backtests as a triage step; forward-paper is the gate to capital deployment.
E1. Pre-registered hypothesis
Before forward-paper begins, the strategy's expected Sharpe, win rate, and average P&L per trade are committed to a write-once log. After 30 days, we compare actual to pre-registered. Strategies that match are candidates; strategies whose performance is unexpectedly good usually have a leak we missed.
E2. 14-day SPAN gate
The strategy must run for 14 calendar days in paper-mode against the live tape. Daily PnL distribution moments are compared to backtest. Strategies whose live moments deviate >1σ from backtest moments are halted for review.
E3. Realistic fill validation
During paper, the system logs both the "assumed fill" (mid-price) and the "realistic fill" (against live book). The strategy must be net-positive against realistic fills, not just assumed. This is where backtest-clean strategies often die in practice.
E4. Champion-challenger A/B
For maker strategies especially, run the challenger persona alongside the incumbent (or against a null persona) in live paper. Statistically compare per-trade P&L. A challenger that doesn't beat the null with p < 0.10 after 30 days isn't shipping.
E5. Walk-forward in production
Once live, the strategy's rolling 30-day Sharpe is monitored. A degraded rolling Sharpe — particularly one that correlates with regime shift — triggers a freeze. This is not validation but it is the last line of defense before realized losses compound.
The pass funnel
| Gate | Candidates in | Candidates out |
|---|---|---|
| Fleet baseline (all personas tested) | ~2,400 | — |
| Naive backtest positive | 2,400 | ~480 |
| Family A — purged-CV | 480 | 43 |
| Family B — leak audit | 43 | ~38 |
| Family C — execution realism | 38 | ~12 |
| Family D — regime stratification | 12 | ~5 |
| Family E — forward-paper | 5 | 0 (currently) |
Five strategies are battery-clean on A–D and on forward-paper P&L, failing only the calendar gate (the 14-day SPAN window). Re-validation is scheduled for ~2026-05-27. We do not lift the gate until the span is met. No exceptions. The whole point of the discipline is to not deploy on enthusiasm.
Why most quants skip this
The battery is expensive. Family C alone — the realistic fill simulator — requires per-market orderbook reconstruction across the entire backtest period. Family D requires regime-stratified replays. Family E requires weeks of calendar time. A retail quant chasing alpha can't justify the engineering cost. A fund can, but mostly runs proprietary variants without publishing.
Skipping the battery is the equilibrium behavior. The cost of doing so is that ~95% of "validated" strategies in the wild are validated only through Family A — purged-CV — and ship into production with untested execution, regime, and forward-paper exposure. The base rate of those strategies surviving six months in production is, anecdotally, dismal.
The five-method minimum, if you do nothing else
If you can't run all 26, run these five. They catch the largest share of false positives per unit of engineering cost:
- A1 purged-CV with embargo — kills overfitting
- B1 post-resolution price scrubbing — kills the most common leak
- C1 realistic fills against the orderbook — kills capacity-driven false positives
- D1 bull/bear/chop partition — kills regime-dependent edges
- E2 14-day forward-paper — kills everything that survived the first four
Reproduce / use this
The full battery is implemented in our F1106/F1224 stack. We productized it as Validation-as-a-Service — submit a strategy spec or backtest, get back a per-method pass/fail report. The white paper has the full method specifications including pseudocode for the most subtle tests (B6 time-shuffle, A5 synthetic-null, C2 adverse selection); see "Building Honest Quant Validation From First Principles."
Honest disclosure. These posts come from an internal trading-research program ($5K experimental capital, paper-mode preserved). Results are reported as measured; none of this is investment advice. Where a method or finding has caveats, we name them in-line — that is the whole point of this series.