Stanford releases SEFD: machine-readable SEC filings dataset

Editorial illustration for: Stanford releases SEFD dataset, offering machine-readable SEC filings to researchers

In brief

  • Stanford Advanced Financial Technologies Lab released SEFD, machine-readable SEC filings dataset from 1994 to present
  • Initial snapshot: 152 billion tokens covering Jan 2022–Jun 2025; full dataset estimated at 550 billion tokens
  • SEFD achieves 99% structural accuracy with <0.1% Common Crawl overlap, providing novel training data
  • Dataset democratizes financial data access, traditionally dominated by premium providers Bloomberg and Refinitiv

Preserving Structure Where It Matters

The core innovation lies in SEFD's approach to preservation. Past SEC filing extraction efforts routinely destroyed the structural and semantic components that make financial documents useful—flattening table hierarchies, losing numeric signs, stripping formatting details. SEFD's MultiMarkdown approach preserves those elements, with structural accuracy exceeding 99% based on human evaluations.

That precision matters. A 99% accuracy rate is impressive, but that remaining sub-1% error rate across 18.5 million filings still represents a non-trivial number of potential inaccuracies requiring robust validation.

Unique Training Material for AI Models

The dataset has less than 0.1% overlap with Common Crawl-derived corpora, offering novel training data not seen in most large language models. Having almost zero overlap means SEFD offers genuinely novel training material that won't just reinforce what models have already learned.

The Stanford team, led by Nick Bettencourt (affiliated with UCLA and collaborating with Stanford), introduced two benchmarks designed to test how well models can work with this kind of data: EDGAR-Forecast for numerical forecasting and EDGAR-OCR for financial table transcription. The project was announced on June 16, 2026.

Breaking Paywalls on Financial Data

The financial data industry is dominated by players like Bloomberg and Refinitiv that charge premium prices for structured financial information. An open, high-quality dataset of SEC filings could democratize access to the raw material that powers financial analysis. For researchers, startups, and institutions without Bloomberg terminals, SEFD removes a significant barrier to rigorous financial data work.

The dataset is live now. Researchers can begin building on it immediately.

Frequently asked questions

What is the SEFD dataset?

SEFD is a machine-readable reconstruction of US SEC EDGAR filings from 1994 to present in MultiMarkdown format. It preserves structural elements like table hierarchies and formatting that past extraction efforts destroyed, achieving 99% structural accuracy.

How large is the dataset?

The initial public snapshot contains 152 billion tokens covering January 2022 to June 2025. The full dataset, when complete, is estimated to reach roughly 550 billion tokens from approximately 18.5 million filings.

Why does SEFD matter for AI training?

SEFD has less than 0.1% overlap with Common Crawl-derived corpora, offering genuinely novel training data not seen in most large language models. This makes it valuable for training financial AI models without reinforcing existing biases.