Reproducibility Datasets
Five of six backtest corpora we audited were contaminated (how). We rebuilt them from raw tape and locked the result. Each dataset ships with the contamination-test suite that proves the corpus is the one we audited — you can verify provenance with one command.
Available datasets
Kalshi 305K markets
Multi-year history of Kalshi binary contracts. Rebuilt from the Becker 72.1M-trade dataset. Every market carries scrape_ts, resolution_ts, and a genuine_flag. Post-resolution ticks scrubbed.
- · 305,127 markets · ~9.2M price ticks
- · Parquet + DuckDB-ready
- · SHA-256 fingerprinted
- · F1214 contamination test suite included
Limitless 36K markets
Daily, hourly, and 15-min crypto-binary contracts. The "constant-0.5 yes_price" bug from the original corpus has been fixed by replay from the API.
- · 36,184 markets across daily / hourly / 15-min cadences
- · Maker-rebate eligibility flags
- · Resolution-source audit trail
- · F1214 contamination test suite included
Polymarket YES-token
Polymarket (international) market histories restricted to the YES-token side, post-snapshot-misalignment fix. Use for calibration research and structural-arb baselines.
- · ~190K markets
- · UMA oracle resolution trail
- · neg-risk grouping metadata
- · F1214 contamination test suite included
All three + future updates
All three corpora, plus quarterly refreshes for 12 months. The F1214 test suite ships with every refresh — you can rerun provenance verification on every update.
- · All three datasets
- · 4 × quarterly updates
- · Priority email support
- · Access to research changelog
License terms restrict redistribution. Academic tier requires .edu email + a research-use statement.
What "verify provenance" means
Each dataset ships with a verify_corpus.py script that:
- Computes the SHA-256 of every parquet shard and matches against the locked fingerprint.
- Runs all five F1214 contamination tests on the corpus and prints the per-test score.
- Spot-checks 100 random markets against the source venue for outcome correctness.
If anyone modifies the dataset (deliberately or accidentally), the fingerprint diverges and the contamination scores shift. Reproducibility is not honor-system — it's mechanical.
Why buy a corpus you could scrape yourself?
- · Time. The Becker dataset is ~36 GiB compressed. Rebuilding our cleaned variant — strip post-resolution ticks, mark provenance, fingerprint — took roughly 80 engineering hours.
- · The test suite. The contamination tests are the real value. Even if you have your own corpus, the F1214 suite is what tells you whether it's contaminated.
- · Citability. Academic / commissioned research benefits from a fingerprinted corpus that another team can re-fetch and verify against the same fingerprint.
- · Updates. The bundle ships quarterly refreshes that re-run the audit. Doing this in-house every quarter is the maintenance cost most teams underestimate.
If you don't need the test suite or the updates and just want the underlying tape, the Becker dataset is public — we cite it openly and recommend it. Our value is the rebuild, the audit, and the test suite, not the raw bytes.