ci: consolidate self-hosted build; wire pytest-xdist (off by default)#901
ci: consolidate self-hosted build; wire pytest-xdist (off by default)#901thomasrockhu-codecov wants to merge 3 commits intomainfrom
Conversation
…ted build
Two ci-router runtime improvements (items 1 and 3+4 from the speedup plan):
1. Parallelize API and Shared test jobs with pytest-xdist (-n auto)
The API test job is the longest single job on the critical path
(~6 minutes). Both API and Shared tests are pytest-django suites whose
per-worker test DB suffixing is supported out of the box, so we can opt
into pytest-xdist with minimal risk. Worker stays serial because its
SQLAlchemy DB setup in apps/worker/conftest.py uses the deprecated
`slaveinput` API and mutates `engine.url` (immutable in modern
SQLAlchemy), which would race on `test_postgres_sqlalchemy` under xdist.
Wiring:
- Add pytest-xdist to the dev dependency group.
- New `PYTEST_XDIST` variable in docker/Makefile.ci-tests; when set, the
test recipes append `-n <value>`. Empty (default) keeps tests serial.
- New `pytest_xdist` input on _run-tests.yml that forwards to the make
target.
- api-ci.yml and shared-ci.yml set `pytest_xdist: auto`; worker-ci.yml
keeps it empty with an inline note explaining why.
2. Consolidate self-hosted build/push so we only build once per run
Previously `api-self-hosted` (and `worker-self-hosted`) ran on push to
main and called _self-hosted.yml a second time, just to push a rolling
tag whose image had already been built moments earlier by
`api-build-self-hosted`. The second call would cache-hit but still
spun up a fresh runner, restored cache, and `docker load`d the tar
(~50-90s of pure overhead).
Replace the two-job pattern with a single `*-build-self-hosted` job
that always builds and conditionally sets `push_rolling` based on the
same push-to-main condition. The inner _self-hosted.yml already gates
its push job on `inputs.push_rolling == true`.
Net effect: removes ~50-90s from push-to-main critical path and one
redundant runner per app per main push.
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (50.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #901 +/- ##
==========================================
- Coverage 91.91% 91.91% -0.01%
==========================================
Files 1316 1316
Lines 50191 50236 +45
Branches 1625 1625
==========================================
+ Hits 46133 46174 +41
- Misses 3752 3756 +4
Partials 306 306
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…xdist
CI on the parent commit revealed two distinct xdist failures:
1. API: `test_when_repository_has_null_head_has_parent_report` and a few
sibling bundle-analysis tests in `test_pull.py` / `test_commit.py`
failed with `sqlite3.OperationalError: no such table: bundles`. The
tests `os.system("rm -rf /tmp/bundle_analysis_*")` and assert via
`os.listdir("/tmp")` that no `bundle_analysis_*` files remain. Under
xdist those operations clobber other workers' temp SQLite files
created by `tempfile.mkstemp(prefix="bundle_analysis_")` in
`shared/bundle_analysis/report.py`.
Fix: in apps/codecov-api/conftest.py, on `pytest_configure`, point
each xdist worker at its own `TMPDIR` (`<tmp>/pytest_<worker_id>`)
and update the four affected tests to use `tempfile.gettempdir()`
instead of hard-coded `/tmp`. Behaviorally a no-op when running
serially (PYTEST_XDIST_WORKER unset).
2. Shared: enabling xdist exposed an unrelated race in shared's custom
`django_setup_test_db` where multiple workers collide on
`test_postgres_gw0` ("already exists" / "is being accessed by other
users"). That's a separate plumbing change in libs/shared; back the
shared workflow down to `pytest_xdist: ""` for now with an inline
note.
API stays on `pytest_xdist: auto` and is the headline win
(measured ~360s -> 218s on the failing run before this fix).
… the win The API xdist run uncovered a second, unrelated test isolation issue in `upload/tests/views/test_uploads.py`: 7 parametrized variants of `test_uploads_post_tokenless` and `test_uploads_post_token_required_auth_check` fail with `assert list(upload.flags.all()) == [flag1, flag2]` returning only `[flag2]` even though both `UploadFlagMembership` rows are present in the DB for that upload. That looks like Django M2M-related caching or a parametrize-ordering effect that's worth fixing in a focused follow-up rather than gating this CI optimization PR on it. Setting `pytest_xdist: ""` for API turns it back to serial. The wiring (input + Makefile flag + pytest-xdist dependency + bundle_analysis tmpdir isolation in the API conftest) all stays in the PR; flipping API back on xdist is a one-line change once the upload-tests issue is fixed. The headline win in this PR is now the self-hosted job consolidation (items 3+4): one build per app per run, push only on main.
Summary
Two CI improvements:
1. Consolidate self-hosted build/push so we build once per run (items 3+4 from the speedup plan)
On push-to-main, both
api-build-self-hostedandapi-self-hostedran. The second job called_self-hosted.ymlagain purely to push the rolling tag — it cache-hit on the build but still spun up a fresh runner, restored cache, anddocker loadd the tar (~50-90s of pure overhead per app).This PR collapses each pair into a single
*-build-self-hostedjob that always builds and conditionally pushes by computingpush_rollingfrom the same push-to-main condition. The inner_self-hosted.ymlalready gates its push job oninputs.push_rolling == true, so no changes were needed there.api-self-hostedandworker-self-hostedtop-level jobs inapi-ci.ymlandworker-ci.yml.API CI / Build Self Hosted (API) / Build Self Hosted App, etc.) are preserved.2. Wire pytest-xdist plumbing (off by default for now)
The plumbing for opting individual test jobs into pytest-xdist is in place but every app keeps
pytest_xdist: ""(serial). This unblocks future enablement as a one-line flip per app:pytest-xdist>=3.6.1added to thedevdep group;uv.lockupdated.PYTEST_XDISTMake variable indocker/Makefile.ci-tests. When non-empty, the test recipes append-n <value>. Empty default = serial (no behavior change).pytest_xdistinput on_run-tests.ymlforwarded to the make recipe.tempfile.gettempdir()per xdist worker so the bundle-analysis tests ingraphql_api/tests/test_pull.pyandtest_commit.py(which clean upbundle_analysis_*files in the system tmpdir) won't race once xdist is enabled. Those tests now usetempfile.gettempdir()instead of hard-coded/tmp— small hygiene improvement under serial mode too.Why xdist is staying off in this PR
A run with
pytest_xdist: autoon this branch hit two separate test-isolation regressions:upload/tests/views/test_uploads.py::test_uploads_post_tokenlessandtest_uploads_post_token_required_auth_check(7 parametrized variants) failed under xdist withassert list(upload.flags.all()) == [flag1, flag2]returning only[flag2], even though bothUploadFlagMembershiprows were demonstrably present in the DB for that upload. Looks like Django M2M-related caching or a parametrize-ordering effect — worth fixing in a focused follow-up rather than gating this PR.test_postgres_gw0"already exists" / "is being accessed by other users") inside shared's customdjango_setup_test_dbinlibs/shared/shared/testutils.py. Separate plumbing change.Both are tracked for follow-ups. The wiring in this PR means re-enabling later is a one-line change per workflow.
For the record, the API xdist run (before backing off) measured 3:38 vs the previous ~6:00 serial baseline, so the underlying speedup is real and worth pursuing in a follow-up.
Test plan
Push Self Hosted Image (API/Worker)check name being gone is intentional w.r.t. branch protection.Notes / follow-ups
apps/worker/conftest.pyuses the deprecated xdistslaveinputAPI and mutatesengine.url).django_setup_test_dbworker-safe.upload.flags.all()discrepancy intest_uploads.py.