Skip to content

docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet#264

Open
rdhyee wants to merge 1 commit into
isamplesorg:mainfrom
rdhyee:docs/data-provenance
Open

docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet#264
rdhyee wants to merge 1 commit into
isamplesorg:mainfrom
rdhyee:docs/data-provenance

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Jun 3, 2026

Closes the biggest provenance gap surfaced while scoping #256/#263.

(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar merge → 7 frontend derived files → R2/Worker), with the script/command + file:line per stage. Documents the load-bearing constraint: the iSamples export is frozen (Central Solr API offline since Aug 2025), so new per-source data (e.g. concept URIs for #263) must arrive via the pid sidecar merge (precedent: enrich_wide_with_oc_thumbnails.py), not a re-export. Folds the sidecar pattern into the repo (was Obsidian-only).

(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that previously had no checked-in build (notebook SQL only): sample_facets_v2, samples_map_lite, wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one wide input (DuckDB + h3 + spatial), with --validate-against.

Validation (built from 202604 wide vs published 202601):

file built published
sample_facets_v2 5,980,282 5,980,282 ✅ exact
samples_map_lite 5,980,282 5,980,282 ✅ exact
h3_summary_res4/6/8 38,406 / 111,681 / 175,653 same ✅ exact
facet_summaries 59 56 schema ✅; +3 (version delta)
facet_cross_filter 612 526 schema ✅; superset (self-pairs)

All schemas match. Docs-only + a new build script — no runtime/site change.

🤖 Generated with Claude Code

… derived parquet

(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.

(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.

Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant