[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models by shenweichen · Pull Request #556 · shenweichen/DeepCTR

shenweichen · 2026-06-28T03:59:57Z

What changed

add a reusable benchmark suite for single-task, multitask, and sequence models
add the benchmarks.onboard workflow for discovery, scaffolding, wiring audits, verification, and documentation
integrate OneTrans, FinalMLP, MaskNet, and WuKong, including serializable custom layers
add a reusable C0-C6 correctness-testing contract:
- deterministic fixtures
- independent NumPy reference equations
- finite and numerical gradients
- semantic invariants such as masking and causality
- prediction-equivalent weight and full-model serialization
make scaffold --with-layer generate a correctness-test skeleton and make verify run it
add model tests, benchmark tests, verification reports, docs, and contributor guidance

Why

Adding a model previously required manually updating several scattered registration, testing, benchmark, serialization, and documentation points. Existing smoke tests also proved that models could run, but not that paper equations, gradients, invariants, or serialization semantics were correct. This change makes onboarding repeatable and adds layered correctness evidence.

Impact

Contributors can scaffold and verify new CTR models through one CLI workflow. Users gain four recent models, reproducible benchmark tooling, and stronger regression protection across all existing models.

Validation

dynamic onboarding audit: 33/33 models fully wired
semantic correctness examples: 8 passed across reference equations, gradients, exact masks, causal invariance, and EDCN deterministic cases
four new models plus multitask serialization contracts: 14 passed
full suite first pass: 171 passed, with the stricter finite-output contract exposing one random-input EDCN NaN; deterministic fixture added and all four EDCN bridge configurations then passed
scaffold templates compile and git diff --check passes

Large-sample effectiveness checks

Criteo_x1 2M, CPU, 1 epoch, batch 1024:

Model	AUC	LogLoss	Train
MaskNet	0.7908	0.4544	41.9s
DeepFM	0.7892	0.4559	32.6s
FinalMLP	0.7889	0.4562	32.6s
WuKong	0.7842	0.4614	61.5s

MovieLens-25M, 2.28M behavior samples, CPU, 1 epoch, batch 1024:

Model	AUC	LogLoss	Train
OneTrans	0.8056	0.5347	114.5s
DIN	0.8030	0.5375	12.4s

These benchmark numbers validate end-to-end effectiveness under a fixed protocol; equation-level correctness is covered separately by the semantic contracts above.

Add a benchmarks/ package with single-task, multitask, and sequence track runners, dataset loaders, metrics, and a model registry, plus a fast smoke test in tests/benchmark_test.py. Ignore downloaded datasets and generated leaderboards via .gitignore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Benchmark split correctness: - Census multitask now uses the dataset's official train/test partition (census-income.data / .test) when present, instead of randomly reshuffling the two together; encoders are fit on the union so test-only categories don't crash the run. Falls back to a random split for the bundled sample. - Single-task gains --temporal-split / --time-col: a chronological hold-out (most recent test_size as test) for time-ordered logs, to avoid look-ahead leakage. Default stays random (Criteo_x1/DAC is anonymised and shuffled with no timestamp, so random is the standard leakage-free split). - Add tests covering both guarantees; document the flags in benchmarks/README. Also: benchmarks/RESULTS.md captures the real-data leaderboards (Criteo 500k single-task, official-split Census multitask), and a project skill .claude/skills/deepctr-benchmark records how to re-run the suite quickly (CPU-only on this host, data locations, split semantics). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DIN/BST/DSIN reach 0.79-0.80 AUC on 2.28M real behavior samples (vs ~0.5 synthetic). Document the ml25m_seq_2m.csv build steps and the CPU-only caveat in the skill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- FinalMLP (AAAI 2023): two feature-gated MLP streams fused by a new InteractionAggregation bilinear head (deepctr/layers/finalmlp.py). - MaskNet (DLP-KDD 2021): parallel instance-guided MaskBlocks. - OneTrans: finish wiring (custom_objects, deepctr.models export, __all__) and fix serialization bugs surfaced by save/load round-trip: * positional add_weight -> keyword name= * tensorflow.python.keras -> tensorflow.keras imports * unserializable Lambda slice -> TokenSlice layer * PositionEncoding class-name collision -> SinusoidalPositionEncoding - Register OneTransLayer/TokenSlice/SinusoidalPositionEncoding/ InteractionAggregation in layers.custom_objects; add unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Standalone CLI standardizing new-model adoption across four stages: discover -> scaffold -> verify -> docs (+ audit, onboard). - scaffold: codegen model/test/layer skeletons and auto-wire all 6 registration points (models __init__/__all__, sub-package __init__, layers custom_objects, benchmarks/registry.py) idempotently. - audit: validate every discovered model is fully wired; catches half-integrated models (e.g. the original OneTrans state). - verify: correctness (unit test + audit) + effectiveness (benchmark AUC vs in-track baseline, compared to paper metric); writes reports/. - discover: candidate knowledge base (candidates.json) + web-research prompt; reconciles status against live deepctr.models. - Register FinalMLP/MaskNet/OneTrans builders in benchmarks/registry.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add model-table rows, Features.md sections, autodoc rst stubs + toctree entries, History changelog lines, and RESULTS entries (generated via benchmarks.onboard docs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- AGENTS.md: env setup (tf-keras + TF_USE_LEGACY_KERAS=1 + CUDA off), test commands, onboarding flow, and gotchas for any coding agent. - discover/README: replace Claude-specific wording (deep-research skill, WebSearch) with tool-agnostic phrasing so Codex etc. can drive it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the design plan into the repo so it survives across machines; link it from the onboard README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

From the FuxiCTR/BARS model zoo and recent arXiv (2022-2025): - single: FinalNet (SIGIR'23), WuKong (ICML'24), FCN (2024), QNN (KDD'25), APG (NeurIPS'22) - sequence: TransAct (KDD'23), TWIN (KDD'23) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- WuKongLayer: stackable FMB (X X^T -> MLP) + LCB (linear embedding recombination) with residual; registered in custom_objects. - WuKong model: stacks N WuKong layers over field embeddings + DNN head. - Wire registry + tests + docs; mark implemented in candidate KB. - audit: 33/33 models fully wired. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The criteo-labs S3 mirror (and go.criteo.net) now 404. Stop defaulting --source download to a dead URL; DEFAULT_DAC_URL is None and the loader prints current sources (HF Criteo_x1 / CriteoClickLogs / Criteo AI Lab) pointing users to --data-path, while still honoring an explicit --download-url mirror. Falls back to the bundled sample as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codacy-production · 2026-06-28T04:04:28Z

Not up to standards ⛔

🔴 Issues 12 high · 6 medium · 1 minor

Alerts:
⚠ 19 issues (≤ 0 issues of at least minor severity)

Results:
19 new issues

Category Results

Security 6 medium
1 minor
12 high

View in Codacy

🟢 Metrics 586 complexity · 14 duplication

Metric Results

Complexity 586

Duplication 14

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

shenweichen and others added 14 commits June 4, 2026 06:27

Translate deepctr-benchmark skill to Chinese

0c3f88b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document full Criteo_x1 download + streaming slice in skill

9f2fb16

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add 2M Criteo_x1 benchmark leaderboard and cross-scale comparison

624794f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: document FinalMLP, MaskNet, OneTrans

0d39bb4

Add model-table rows, Features.md sections, autodoc rst stubs + toctree entries, History changelog lines, and RESULTS entries (generated via benchmarks.onboard docs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: add onboarding pipeline design doc (DESIGN.md)

fd357fa

Move the design plan into the repo so it survives across machines; link it from the onboard README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test: add reusable correctness contracts

2389e67

shenweichen changed the title ~~[codex] Add benchmark suite, model onboarding pipeline, and four CTR models~~ [codex] Add benchmark suite, model onboarding pipeline, and recent CTR models Jun 28, 2026

shenweichen force-pushed the benchmarks branch from 1b653d1 to 2389e67 Compare June 28, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models#556

[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models#556
shenweichen wants to merge 15 commits into
masterfrom
benchmarks

shenweichen commented Jun 28, 2026 •

edited

Loading

Uh oh!

codacy-production Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shenweichen commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Impact

Validation

Large-sample effectiveness checks

Uh oh!

codacy-production Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shenweichen commented Jun 28, 2026 •

edited

Loading

codacy-production Bot commented Jun 28, 2026 •

edited

Loading