Frontmatter Template - Extracted Source Markdown¶
Canonical 10-field schema for all Sources/*/Markdown/*.md files.
Run .agents/scripts/normalize_frontmatter.py after any corpus refresh or manual cutover.
Template¶
---
doc_id: <identifier>
era: '2023' | '2026'
chapter: '<chapter-number-or-rule>' # '' allowed for appendices and supplemental guidance
source_pdf: ../<paired-source.pdf>
sha256_source: <sha256-hex>
extracted_by: mistral-ocr-latest
extracted_on: <YYYY-MM-DD HH:MM:SS+00:00>
qa_date: '<YYYY-MM-DD>' # 'Pending' if not yet QA'd
qa_delta_from_pdf: '+0 words' # 'Pending' if not yet QA'd
tags: [<deterministic-tag-list>]
---
Review lag is derived as qa_date - extracted_on when qa_date is dated. It is not stored inline.
Field Reference¶
| Field | Required | Values / Format |
|---|---|---|
doc_id |
yes | Matches filename without .md |
era |
yes | Quoted string: '2023' or '2026' |
chapter |
yes | BMP chapter ('5', '10.1'), rule identifier ('7:8'), or '' for appendices / supplemental docs |
source_pdf |
yes | Relative path from the markdown file to its paired PDF, for example ../2026_BMP_5_SWM_Standards_and_Computations.pdf |
sha256_source |
yes | SHA-256 hex digest of the paired source PDF |
extracted_by |
yes | mistral-ocr-latest |
extracted_on |
yes | UTC timestamp with offset: 2026-03-26 21:47:50+00:00 |
qa_date |
yes | Quoted date of last QA pass, or 'Pending' |
qa_delta_from_pdf |
yes | Signed word delta from QA rerun, or 'Pending' |
tags |
yes | Inline YAML list using the taxonomy below |
Source-of-Truth Contract¶
Sources/*/Markdown/*.mdis the working source-of-truth text corpus.Sources/BMP_*/*.pdfandSources/NJAC_*/*.pdfare the paired canonical binary sources.- Page markers in extracted markdown must preserve the source PDF page boundaries.
Tag Taxonomy¶
Always present¶
| Tag | Meaning |
|---|---|
extracted |
Markdown extraction artifact |
source-pdf |
Has a paired PDF in the same family root |
stormwater |
Stormwater source corpus |
family-njbmp / family-njac |
Source family tag |
era-2023 / era-2026 |
Era tag |
src-<registry-id> |
Stable origin-link tag, for example src-2026-bmp-11-3 or src-njac-7-8-2023 |
<normalized-topic-tag> |
Deterministic title/topic tag derived from the registry title or source heading |
Family and structure tags¶
| Tag | Used when |
|---|---|
chapter-<n> |
BMP chapter-level docs, for example chapter-7 |
section-<chapter>-<section> |
BMP section docs, for example section-10-1 |
appendix / appendix-<letter> |
BMP appendices |
rule-<normalized-citation> |
NJAC rule-structure tag, for example rule-7-8, rule-7-9a, rule-7-1e-1-20-26 |
checklist / fact-sheet / guidance / dphs |
Deterministic supplemental tags when applicable |
Tags are lowercase, ASCII, deduplicated, and sorted deterministically.
Deprecated and no longer emitted by the normalizer:
source-ocrsource-of-truthfamily-2026-bmp/ similar era-combined family tags2026-era/2023-erabmp-manualnjacregulatorysupplemental-source
What Does Not Belong in Frontmatter¶
QA audit-trail details belong in family-level 0_*QAPASS.md files, not inline:
| Field | Where it lives instead |
|---|---|
qa_status |
0_*QAPASS.md pass summary |
qa_checked_on |
superseded by qa_date |
qa_checks_passed |
0_*QAPASS.md summary |
qa_report |
family-level QA summary or corpus QA report |
patched_on |
family-level QA summary if needed |
patch_reason |
family-level QA summary if needed |
reextract_reason |
no longer used in live extracted frontmatter |
pdf_filename |
redundant; use source_pdf |
| --- |
Raw Evidence Utility¶
Sources/*/raw/is the repo-local OCR evidence backend for the live source corpus.- Each
Sources/*/raw/<doc_id>/bundle should containmanifest.json,response.json, andpages/when page images were emitted. manifest.jsonuses the portableraw_manifest.v2schema and is the preferred metadata entrypoint for QA and citations.raw_manifest.v2storesSources-root-relativesource_pdf_path,markdown_path,bundle_dir,response_path, and optionalpages_dir.response.jsonis the full OCR payload for deep inspection, not the default aggregation source.Sources/citations/raw_index.jsonis the generated evidence index used by family QA reports and future citation tooling.- Raw evidence supports QA and citation maintenance; it is not authored content and should override stale prose elsewhere in
Sources/citations/when conflicts appear.
Citations Identity Model¶
Sources/citations/source_document_registry.jsonis the authoritative identity ledger.registry_idis the stable public identifier for joins and human-facing citation references.doc_idis the canonical extracted markdown / raw bundle identifier.- Generated citations artifacts should carry both identifiers when they emit live source references.
- See
Sources/citations/README.mdfor the full rebuild order and operating contract.