docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet#264
Open
rdhyee wants to merge 1 commit into
Open
docs+scripts: data provenance map + build scripts for the 6 unscripted derived parquet#264rdhyee wants to merge 1 commit into
rdhyee wants to merge 1 commit into
Conversation
… derived parquet
(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.
(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.
Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the biggest provenance gap surfaced while scoping #256/#263.
(a)
DATA_PROVENANCE.md— end-to-end build chain (export → base PQG → sidecar merge → 7 frontend derived files → R2/Worker), with the script/command + file:line per stage. Documents the load-bearing constraint: the iSamples export is frozen (Central Solr API offline since Aug 2025), so new per-source data (e.g. concept URIs for #263) must arrive via the pid sidecar merge (precedent:enrich_wide_with_oc_thumbnails.py), not a re-export. Folds the sidecar pattern into the repo (was Obsidian-only).(c)
scripts/build_frontend_derived.py— reproduces the 6 derived files that previously had no checked-in build (notebook SQL only):sample_facets_v2,samples_map_lite,wide_h3,h3_summary_res{4,6,8},facet_summaries,facet_cross_filter— from onewideinput (DuckDB + h3 + spatial), with--validate-against.Validation (built from 202604 wide vs published 202601):
All schemas match. Docs-only + a new build script — no runtime/site change.
🤖 Generated with Claude Code