Skip to content

Raw OCR Evidence Bundles

Sources/*/raw/ contains the repo-local evidence backend for the cutover source corpus.

What Lives Here

  • One folder per doc_id extracted from the Mistral OCR corpus.
  • manifest.json uses the portable raw_manifest.v2 schema and is the preferred metadata entrypoint for QA and citations.
  • response.json is the full OCR response for deep inspection.
  • pages/ contains decoded page images when the OCR response included image payloads.
  • Sources/citations/raw_index.json is the generated index used by family QA reports and future citation tooling.
  • Manifest path fields are Sources-root-relative and machine-local OCR runner paths do not belong in the live corpus.

Current Coverage

  • Bundles indexed: 89
  • Raw roots: BMP_2023/raw, BMP_2026/raw, NJAC_2023/raw, NJAC_2026/raw
  • 2023 BMP bundles: 34
  • 2026 BMP bundles: 36
  • 2023 NJAC bundles: 1
  • 2026 NJAC bundles: 18

Utility

  • Supports family-level 0_*QAPASS.md posterity reports.
  • Supports coverage checks between extracted markdown, source registry, and raw OCR evidence.
  • Provides the evidence base that future bmp_crosswalk and report_claims refreshes should use.

Notes

  • This directory is evidence infrastructure, not authored content.
  • Stale prose elsewhere in Sources/citations/ must not override live raw evidence and manifests stored under Sources/*/raw/.