Skip to content

Frontmatter Template - Extracted Source Markdown

Canonical 10-field schema for all Sources/*/Markdown/*.md files. Run .agents/scripts/normalize_frontmatter.py after any corpus refresh or manual cutover.


Template

---
doc_id: <identifier>
era: '2023' | '2026'
chapter: '<chapter-number-or-rule>' # '' allowed for appendices and supplemental guidance
source_pdf: ../<paired-source.pdf>
sha256_source: <sha256-hex>
extracted_by: mistral-ocr-latest
extracted_on: <YYYY-MM-DD HH:MM:SS+00:00>
qa_date: '<YYYY-MM-DD>' # 'Pending' if not yet QA'd
qa_delta_from_pdf: '+0 words' # 'Pending' if not yet QA'd
tags: [<deterministic-tag-list>]
---

Review lag is derived as qa_date - extracted_on when qa_date is dated. It is not stored inline.


Field Reference

Field Required Values / Format
doc_id yes Matches filename without .md
era yes Quoted string: '2023' or '2026'
chapter yes BMP chapter ('5', '10.1'), rule identifier ('7:8'), or '' for appendices / supplemental docs
source_pdf yes Relative path from the markdown file to its paired PDF, for example ../2026_BMP_5_SWM_Standards_and_Computations.pdf
sha256_source yes SHA-256 hex digest of the paired source PDF
extracted_by yes mistral-ocr-latest
extracted_on yes UTC timestamp with offset: 2026-03-26 21:47:50+00:00
qa_date yes Quoted date of last QA pass, or 'Pending'
qa_delta_from_pdf yes Signed word delta from QA rerun, or 'Pending'
tags yes Inline YAML list using the taxonomy below

Source-of-Truth Contract

  • Sources/*/Markdown/*.md is the working source-of-truth text corpus.
  • Sources/BMP_*/*.pdf and Sources/NJAC_*/*.pdf are the paired canonical binary sources.
  • Page markers in extracted markdown must preserve the source PDF page boundaries.

Tag Taxonomy

Always present

Tag Meaning
extracted Markdown extraction artifact
source-pdf Has a paired PDF in the same family root
stormwater Stormwater source corpus
family-njbmp / family-njac Source family tag
era-2023 / era-2026 Era tag
src-<registry-id> Stable origin-link tag, for example src-2026-bmp-11-3 or src-njac-7-8-2023
<normalized-topic-tag> Deterministic title/topic tag derived from the registry title or source heading

Family and structure tags

Tag Used when
chapter-<n> BMP chapter-level docs, for example chapter-7
section-<chapter>-<section> BMP section docs, for example section-10-1
appendix / appendix-<letter> BMP appendices
rule-<normalized-citation> NJAC rule-structure tag, for example rule-7-8, rule-7-9a, rule-7-1e-1-20-26
checklist / fact-sheet / guidance / dphs Deterministic supplemental tags when applicable

Tags are lowercase, ASCII, deduplicated, and sorted deterministically.

Deprecated and no longer emitted by the normalizer:

  • source-ocr
  • source-of-truth
  • family-2026-bmp / similar era-combined family tags
  • 2026-era / 2023-era
  • bmp-manual
  • njac
  • regulatory
  • supplemental-source

What Does Not Belong in Frontmatter

QA audit-trail details belong in family-level 0_*QAPASS.md files, not inline:

Field Where it lives instead
qa_status 0_*QAPASS.md pass summary
qa_checked_on superseded by qa_date
qa_checks_passed 0_*QAPASS.md summary
qa_report family-level QA summary or corpus QA report
patched_on family-level QA summary if needed
patch_reason family-level QA summary if needed
reextract_reason no longer used in live extracted frontmatter
pdf_filename redundant; use source_pdf
---

Raw Evidence Utility

  • Sources/*/raw/ is the repo-local OCR evidence backend for the live source corpus.
  • Each Sources/*/raw/<doc_id>/ bundle should contain manifest.json, response.json, and pages/ when page images were emitted.
  • manifest.json uses the portable raw_manifest.v2 schema and is the preferred metadata entrypoint for QA and citations.
  • raw_manifest.v2 stores Sources-root-relative source_pdf_path, markdown_path, bundle_dir, response_path, and optional pages_dir.
  • response.json is the full OCR payload for deep inspection, not the default aggregation source.
  • Sources/citations/raw_index.json is the generated evidence index used by family QA reports and future citation tooling.
  • Raw evidence supports QA and citation maintenance; it is not authored content and should override stale prose elsewhere in Sources/citations/ when conflicts appear.

Citations Identity Model

  • Sources/citations/source_document_registry.json is the authoritative identity ledger.
  • registry_id is the stable public identifier for joins and human-facing citation references.
  • doc_id is the canonical extracted markdown / raw bundle identifier.
  • Generated citations artifacts should carry both identifiers when they emit live source references.
  • See Sources/citations/README.md for the full rebuild order and operating contract.