Skip to content

[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models#556

Draft
shenweichen wants to merge 15 commits into
masterfrom
benchmarks
Draft

[codex] Add benchmark suite, model onboarding pipeline, and recent CTR models#556
shenweichen wants to merge 15 commits into
masterfrom
benchmarks

Conversation

@shenweichen

@shenweichen shenweichen commented Jun 28, 2026

Copy link
Copy Markdown
Owner

What changed

  • add a reusable benchmark suite for single-task, multitask, and sequence models
  • add the benchmarks.onboard workflow for discovery, scaffolding, wiring audits, verification, and documentation
  • integrate OneTrans, FinalMLP, MaskNet, and WuKong, including serializable custom layers
  • add a reusable C0-C6 correctness-testing contract:
    • deterministic fixtures
    • independent NumPy reference equations
    • finite and numerical gradients
    • semantic invariants such as masking and causality
    • prediction-equivalent weight and full-model serialization
  • make scaffold --with-layer generate a correctness-test skeleton and make verify run it
  • add model tests, benchmark tests, verification reports, docs, and contributor guidance

Why

Adding a model previously required manually updating several scattered registration, testing, benchmark, serialization, and documentation points. Existing smoke tests also proved that models could run, but not that paper equations, gradients, invariants, or serialization semantics were correct. This change makes onboarding repeatable and adds layered correctness evidence.

Impact

Contributors can scaffold and verify new CTR models through one CLI workflow. Users gain four recent models, reproducible benchmark tooling, and stronger regression protection across all existing models.

Validation

  • dynamic onboarding audit: 33/33 models fully wired
  • semantic correctness examples: 8 passed across reference equations, gradients, exact masks, causal invariance, and EDCN deterministic cases
  • four new models plus multitask serialization contracts: 14 passed
  • full suite first pass: 171 passed, with the stricter finite-output contract exposing one random-input EDCN NaN; deterministic fixture added and all four EDCN bridge configurations then passed
  • scaffold templates compile and git diff --check passes

Large-sample effectiveness checks

Criteo_x1 2M, CPU, 1 epoch, batch 1024:

Model AUC LogLoss Train
MaskNet 0.7908 0.4544 41.9s
DeepFM 0.7892 0.4559 32.6s
FinalMLP 0.7889 0.4562 32.6s
WuKong 0.7842 0.4614 61.5s

MovieLens-25M, 2.28M behavior samples, CPU, 1 epoch, batch 1024:

Model AUC LogLoss Train
OneTrans 0.8056 0.5347 114.5s
DIN 0.8030 0.5375 12.4s

These benchmark numbers validate end-to-end effectiveness under a fixed protocol; equation-level correctness is covered separately by the semantic contracts above.

shenweichen and others added 14 commits June 4, 2026 06:27
Add a benchmarks/ package with single-task, multitask, and sequence
track runners, dataset loaders, metrics, and a model registry, plus a
fast smoke test in tests/benchmark_test.py. Ignore downloaded datasets
and generated leaderboards via .gitignore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark split correctness:
- Census multitask now uses the dataset's official train/test partition
  (census-income.data / .test) when present, instead of randomly reshuffling
  the two together; encoders are fit on the union so test-only categories don't
  crash the run. Falls back to a random split for the bundled sample.
- Single-task gains --temporal-split / --time-col: a chronological hold-out
  (most recent test_size as test) for time-ordered logs, to avoid look-ahead
  leakage. Default stays random (Criteo_x1/DAC is anonymised and shuffled with
  no timestamp, so random is the standard leakage-free split).
- Add tests covering both guarantees; document the flags in benchmarks/README.

Also: benchmarks/RESULTS.md captures the real-data leaderboards (Criteo 500k
single-task, official-split Census multitask), and a project skill
.claude/skills/deepctr-benchmark records how to re-run the suite quickly
(CPU-only on this host, data locations, split semantics).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
DIN/BST/DSIN reach 0.79-0.80 AUC on 2.28M real behavior samples (vs ~0.5
synthetic). Document the ml25m_seq_2m.csv build steps and the CPU-only
caveat in the skill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- FinalMLP (AAAI 2023): two feature-gated MLP streams fused by a new
  InteractionAggregation bilinear head (deepctr/layers/finalmlp.py).
- MaskNet (DLP-KDD 2021): parallel instance-guided MaskBlocks.
- OneTrans: finish wiring (custom_objects, deepctr.models export, __all__)
  and fix serialization bugs surfaced by save/load round-trip:
    * positional add_weight -> keyword name=
    * tensorflow.python.keras -> tensorflow.keras imports
    * unserializable Lambda slice -> TokenSlice layer
    * PositionEncoding class-name collision -> SinusoidalPositionEncoding
- Register OneTransLayer/TokenSlice/SinusoidalPositionEncoding/
  InteractionAggregation in layers.custom_objects; add unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Standalone CLI standardizing new-model adoption across four stages:
  discover -> scaffold -> verify -> docs (+ audit, onboard).

- scaffold: codegen model/test/layer skeletons and auto-wire all 6
  registration points (models __init__/__all__, sub-package __init__,
  layers custom_objects, benchmarks/registry.py) idempotently.
- audit: validate every discovered model is fully wired; catches
  half-integrated models (e.g. the original OneTrans state).
- verify: correctness (unit test + audit) + effectiveness (benchmark
  AUC vs in-track baseline, compared to paper metric); writes reports/.
- discover: candidate knowledge base (candidates.json) + web-research
  prompt; reconciles status against live deepctr.models.
- Register FinalMLP/MaskNet/OneTrans builders in benchmarks/registry.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add model-table rows, Features.md sections, autodoc rst stubs + toctree
entries, History changelog lines, and RESULTS entries (generated via
benchmarks.onboard docs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- AGENTS.md: env setup (tf-keras + TF_USE_LEGACY_KERAS=1 + CUDA off),
  test commands, onboarding flow, and gotchas for any coding agent.
- discover/README: replace Claude-specific wording (deep-research skill,
  WebSearch) with tool-agnostic phrasing so Codex etc. can drive it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the design plan into the repo so it survives across machines; link
it from the onboard README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the FuxiCTR/BARS model zoo and recent arXiv (2022-2025):
- single: FinalNet (SIGIR'23), WuKong (ICML'24), FCN (2024), QNN (KDD'25), APG (NeurIPS'22)
- sequence: TransAct (KDD'23), TWIN (KDD'23)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- WuKongLayer: stackable FMB (X X^T -> MLP) + LCB (linear embedding
  recombination) with residual; registered in custom_objects.
- WuKong model: stacks N WuKong layers over field embeddings + DNN head.
- Wire registry + tests + docs; mark implemented in candidate KB.
- audit: 33/33 models fully wired.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The criteo-labs S3 mirror (and go.criteo.net) now 404. Stop defaulting
--source download to a dead URL; DEFAULT_DAC_URL is None and the loader
prints current sources (HF Criteo_x1 / CriteoClickLogs / Criteo AI Lab)
pointing users to --data-path, while still honoring an explicit
--download-url mirror. Falls back to the bundled sample as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codacy-production

codacy-production Bot commented Jun 28, 2026

Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 12 high · 6 medium · 1 minor

Alerts:
⚠ 19 issues (≤ 0 issues of at least minor severity)

Results:
19 new issues

Category Results
Security 6 medium
1 minor
12 high

View in Codacy

🟢 Metrics 586 complexity · 14 duplication

Metric Results
Complexity 586
Duplication 14

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@shenweichen shenweichen changed the title [codex] Add benchmark suite, model onboarding pipeline, and four CTR models [codex] Add benchmark suite, model onboarding pipeline, and recent CTR models Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant