[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models#556
Draft
shenweichen wants to merge 15 commits into
Draft
[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models#556shenweichen wants to merge 15 commits into
shenweichen wants to merge 15 commits into
Conversation
Add a benchmarks/ package with single-task, multitask, and sequence track runners, dataset loaders, metrics, and a model registry, plus a fast smoke test in tests/benchmark_test.py. Ignore downloaded datasets and generated leaderboards via .gitignore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark split correctness: - Census multitask now uses the dataset's official train/test partition (census-income.data / .test) when present, instead of randomly reshuffling the two together; encoders are fit on the union so test-only categories don't crash the run. Falls back to a random split for the bundled sample. - Single-task gains --temporal-split / --time-col: a chronological hold-out (most recent test_size as test) for time-ordered logs, to avoid look-ahead leakage. Default stays random (Criteo_x1/DAC is anonymised and shuffled with no timestamp, so random is the standard leakage-free split). - Add tests covering both guarantees; document the flags in benchmarks/README. Also: benchmarks/RESULTS.md captures the real-data leaderboards (Criteo 500k single-task, official-split Census multitask), and a project skill .claude/skills/deepctr-benchmark records how to re-run the suite quickly (CPU-only on this host, data locations, split semantics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DIN/BST/DSIN reach 0.79-0.80 AUC on 2.28M real behavior samples (vs ~0.5 synthetic). Document the ml25m_seq_2m.csv build steps and the CPU-only caveat in the skill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- FinalMLP (AAAI 2023): two feature-gated MLP streams fused by a new
InteractionAggregation bilinear head (deepctr/layers/finalmlp.py).
- MaskNet (DLP-KDD 2021): parallel instance-guided MaskBlocks.
- OneTrans: finish wiring (custom_objects, deepctr.models export, __all__)
and fix serialization bugs surfaced by save/load round-trip:
* positional add_weight -> keyword name=
* tensorflow.python.keras -> tensorflow.keras imports
* unserializable Lambda slice -> TokenSlice layer
* PositionEncoding class-name collision -> SinusoidalPositionEncoding
- Register OneTransLayer/TokenSlice/SinusoidalPositionEncoding/
InteractionAggregation in layers.custom_objects; add unit tests.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Standalone CLI standardizing new-model adoption across four stages: discover -> scaffold -> verify -> docs (+ audit, onboard). - scaffold: codegen model/test/layer skeletons and auto-wire all 6 registration points (models __init__/__all__, sub-package __init__, layers custom_objects, benchmarks/registry.py) idempotently. - audit: validate every discovered model is fully wired; catches half-integrated models (e.g. the original OneTrans state). - verify: correctness (unit test + audit) + effectiveness (benchmark AUC vs in-track baseline, compared to paper metric); writes reports/. - discover: candidate knowledge base (candidates.json) + web-research prompt; reconciles status against live deepctr.models. - Register FinalMLP/MaskNet/OneTrans builders in benchmarks/registry.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add model-table rows, Features.md sections, autodoc rst stubs + toctree entries, History changelog lines, and RESULTS entries (generated via benchmarks.onboard docs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- AGENTS.md: env setup (tf-keras + TF_USE_LEGACY_KERAS=1 + CUDA off), test commands, onboarding flow, and gotchas for any coding agent. - discover/README: replace Claude-specific wording (deep-research skill, WebSearch) with tool-agnostic phrasing so Codex etc. can drive it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the design plan into the repo so it survives across machines; link it from the onboard README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the FuxiCTR/BARS model zoo and recent arXiv (2022-2025): - single: FinalNet (SIGIR'23), WuKong (ICML'24), FCN (2024), QNN (KDD'25), APG (NeurIPS'22) - sequence: TransAct (KDD'23), TWIN (KDD'23) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- WuKongLayer: stackable FMB (X X^T -> MLP) + LCB (linear embedding recombination) with residual; registered in custom_objects. - WuKong model: stacks N WuKong layers over field embeddings + DNN head. - Wire registry + tests + docs; mark implemented in candidate KB. - audit: 33/33 models fully wired. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The criteo-labs S3 mirror (and go.criteo.net) now 404. Stop defaulting --source download to a dead URL; DEFAULT_DAC_URL is None and the loader prints current sources (HF Criteo_x1 / CriteoClickLogs / Criteo AI Lab) pointing users to --data-path, while still honoring an explicit --download-url mirror. Falls back to the bundled sample as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Security | 6 medium 1 minor 12 high |
🟢 Metrics 586 complexity · 14 duplication
Metric Results Complexity 586 Duplication 14
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
benchmarks.onboardworkflow for discovery, scaffolding, wiring audits, verification, and documentationscaffold --with-layergenerate a correctness-test skeleton and makeverifyrun itWhy
Adding a model previously required manually updating several scattered registration, testing, benchmark, serialization, and documentation points. Existing smoke tests also proved that models could run, but not that paper equations, gradients, invariants, or serialization semantics were correct. This change makes onboarding repeatable and adds layered correctness evidence.
Impact
Contributors can scaffold and verify new CTR models through one CLI workflow. Users gain four recent models, reproducible benchmark tooling, and stronger regression protection across all existing models.
Validation
git diff --checkpassesLarge-sample effectiveness checks
Criteo_x1 2M, CPU, 1 epoch, batch 1024:
MovieLens-25M, 2.28M behavior samples, CPU, 1 epoch, batch 1024:
These benchmark numbers validate end-to-end effectiveness under a fixed protocol; equation-level correctness is covered separately by the semantic contracts above.