Skip to content

ci: consolidate self-hosted build; wire pytest-xdist (off by default)#901

Open
thomasrockhu-codecov wants to merge 3 commits intomainfrom
tomhu/ci-shard-and-selfhosted-consolidate
Open

ci: consolidate self-hosted build; wire pytest-xdist (off by default)#901
thomasrockhu-codecov wants to merge 3 commits intomainfrom
tomhu/ci-shard-and-selfhosted-consolidate

Conversation

@thomasrockhu-codecov
Copy link
Copy Markdown
Contributor

@thomasrockhu-codecov thomasrockhu-codecov commented Apr 30, 2026

Summary

Two CI improvements:

1. Consolidate self-hosted build/push so we build once per run (items 3+4 from the speedup plan)

On push-to-main, both api-build-self-hosted and api-self-hosted ran. The second job called _self-hosted.yml again purely to push the rolling tag — it cache-hit on the build but still spun up a fresh runner, restored cache, and docker loadd the tar (~50-90s of pure overhead per app).

This PR collapses each pair into a single *-build-self-hosted job that always builds and conditionally pushes by computing push_rolling from the same push-to-main condition. The inner _self-hosted.yml already gates its push job on inputs.push_rolling == true, so no changes were needed there.

  • Removes the api-self-hosted and worker-self-hosted top-level jobs in api-ci.yml and worker-ci.yml.
  • Existing PR check names (API CI / Build Self Hosted (API) / Build Self Hosted App, etc.) are preserved.
  • Expected impact: ~50-90s off push-to-main critical path per app and one redundant runner per app per main push.

2. Wire pytest-xdist plumbing (off by default for now)

The plumbing for opting individual test jobs into pytest-xdist is in place but every app keeps pytest_xdist: "" (serial). This unblocks future enablement as a one-line flip per app:

  • pytest-xdist>=3.6.1 added to the dev dep group; uv.lock updated.
  • New PYTEST_XDIST Make variable in docker/Makefile.ci-tests. When non-empty, the test recipes append -n <value>. Empty default = serial (no behavior change).
  • New pytest_xdist input on _run-tests.yml forwarded to the make recipe.
  • API conftest defensively isolates tempfile.gettempdir() per xdist worker so the bundle-analysis tests in graphql_api/tests/test_pull.py and test_commit.py (which clean up bundle_analysis_* files in the system tmpdir) won't race once xdist is enabled. Those tests now use tempfile.gettempdir() instead of hard-coded /tmp — small hygiene improvement under serial mode too.

Why xdist is staying off in this PR

A run with pytest_xdist: auto on this branch hit two separate test-isolation regressions:

  • API: upload/tests/views/test_uploads.py::test_uploads_post_tokenless and test_uploads_post_token_required_auth_check (7 parametrized variants) failed under xdist with assert list(upload.flags.all()) == [flag1, flag2] returning only [flag2], even though both UploadFlagMembership rows were demonstrably present in the DB for that upload. Looks like Django M2M-related caching or a parametrize-ordering effect — worth fixing in a focused follow-up rather than gating this PR.
  • Shared: workers raced creating the per-worker postgres test DB (test_postgres_gw0 "already exists" / "is being accessed by other users") inside shared's custom django_setup_test_db in libs/shared/shared/testutils.py. Separate plumbing change.

Both are tracked for follow-ups. The wiring in this PR means re-enabling later is a one-line change per workflow.

For the record, the API xdist run (before backing off) measured 3:38 vs the previous ~6:00 serial baseline, so the underlying speedup is real and worth pursuing in a follow-up.

Test plan

  • CI passes on this branch (validates the self-hosted consolidation on a PR; tests stay serial so no behavior change for those).
  • After merge, watch one push-to-main run to confirm self-hosted rolling images are still pushed (single build, single push) and that the previous Push Self Hosted Image (API/Worker) check name being gone is intentional w.r.t. branch protection.

Notes / follow-ups

  • Worker xdist support is left as a follow-up (its apps/worker/conftest.py uses the deprecated xdist slaveinput API and mutates engine.url).
  • Shared xdist support: make django_setup_test_db worker-safe.
  • API xdist support: investigate the upload.flags.all() discrepancy in test_uploads.py.
  • Other items from the speedup plan (collapsing the 4-way Codecov upload matrix, ubuntu-large queueing, etc.) are not in this PR.

…ted build

Two ci-router runtime improvements (items 1 and 3+4 from the speedup plan):

1. Parallelize API and Shared test jobs with pytest-xdist (-n auto)

   The API test job is the longest single job on the critical path
   (~6 minutes). Both API and Shared tests are pytest-django suites whose
   per-worker test DB suffixing is supported out of the box, so we can opt
   into pytest-xdist with minimal risk. Worker stays serial because its
   SQLAlchemy DB setup in apps/worker/conftest.py uses the deprecated
   `slaveinput` API and mutates `engine.url` (immutable in modern
   SQLAlchemy), which would race on `test_postgres_sqlalchemy` under xdist.

   Wiring:
   - Add pytest-xdist to the dev dependency group.
   - New `PYTEST_XDIST` variable in docker/Makefile.ci-tests; when set, the
     test recipes append `-n <value>`. Empty (default) keeps tests serial.
   - New `pytest_xdist` input on _run-tests.yml that forwards to the make
     target.
   - api-ci.yml and shared-ci.yml set `pytest_xdist: auto`; worker-ci.yml
     keeps it empty with an inline note explaining why.

2. Consolidate self-hosted build/push so we only build once per run

   Previously `api-self-hosted` (and `worker-self-hosted`) ran on push to
   main and called _self-hosted.yml a second time, just to push a rolling
   tag whose image had already been built moments earlier by
   `api-build-self-hosted`. The second call would cache-hit but still
   spun up a fresh runner, restored cache, and `docker load`d the tar
   (~50-90s of pure overhead).

   Replace the two-job pattern with a single `*-build-self-hosted` job
   that always builds and conditionally sets `push_rolling` based on the
   same push-to-main condition. The inner _self-hosted.yml already gates
   its push job on `inputs.push_rolling == true`.

   Net effect: removes ~50-90s from push-to-main critical path and one
   redundant runner per app per main push.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 30, 2026

Merging this PR will not alter performance

✅ 9 untouched benchmarks


Comparing tomhu/ci-shard-and-selfhosted-consolidate (f388d94) with main (3d2b291)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (94de4d3) during the generation of this report, so 3d2b291 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@sentry
Copy link
Copy Markdown
Contributor

sentry Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.91%. Comparing base (429842c) to head (f388d94).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/conftest.py 50.00% 4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (50.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #901      +/-   ##
==========================================
- Coverage   91.91%   91.91%   -0.01%     
==========================================
  Files        1316     1316              
  Lines       50191    50236      +45     
  Branches     1625     1625              
==========================================
+ Hits        46133    46174      +41     
- Misses       3752     3756       +4     
  Partials      306      306              
Flag Coverage Δ
apiunit 95.01% <50.00%> (-0.01%) ⬇️
sharedintegration 36.97% <ø> (ø)
sharedunit 84.89% <ø> (ø)
workerintegration 58.54% <ø> (ø)
workerunit 90.41% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codecov-notifications
Copy link
Copy Markdown

codecov-notifications Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
apps/codecov-api/conftest.py 50.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

…xdist

CI on the parent commit revealed two distinct xdist failures:

1. API: `test_when_repository_has_null_head_has_parent_report` and a few
   sibling bundle-analysis tests in `test_pull.py` / `test_commit.py`
   failed with `sqlite3.OperationalError: no such table: bundles`. The
   tests `os.system("rm -rf /tmp/bundle_analysis_*")` and assert via
   `os.listdir("/tmp")` that no `bundle_analysis_*` files remain. Under
   xdist those operations clobber other workers' temp SQLite files
   created by `tempfile.mkstemp(prefix="bundle_analysis_")` in
   `shared/bundle_analysis/report.py`.

   Fix: in apps/codecov-api/conftest.py, on `pytest_configure`, point
   each xdist worker at its own `TMPDIR` (`<tmp>/pytest_<worker_id>`)
   and update the four affected tests to use `tempfile.gettempdir()`
   instead of hard-coded `/tmp`. Behaviorally a no-op when running
   serially (PYTEST_XDIST_WORKER unset).

2. Shared: enabling xdist exposed an unrelated race in shared's custom
   `django_setup_test_db` where multiple workers collide on
   `test_postgres_gw0` ("already exists" / "is being accessed by other
   users"). That's a separate plumbing change in libs/shared; back the
   shared workflow down to `pytest_xdist: ""` for now with an inline
   note.

API stays on `pytest_xdist: auto` and is the headline win
(measured ~360s -> 218s on the failing run before this fix).
… the win

The API xdist run uncovered a second, unrelated test isolation issue in
`upload/tests/views/test_uploads.py`: 7 parametrized variants of
`test_uploads_post_tokenless` and `test_uploads_post_token_required_auth_check`
fail with `assert list(upload.flags.all()) == [flag1, flag2]` returning
only `[flag2]` even though both `UploadFlagMembership` rows are present in
the DB for that upload. That looks like Django M2M-related caching or a
parametrize-ordering effect that's worth fixing in a focused follow-up
rather than gating this CI optimization PR on it.

Setting `pytest_xdist: ""` for API turns it back to serial. The wiring
(input + Makefile flag + pytest-xdist dependency + bundle_analysis tmpdir
isolation in the API conftest) all stays in the PR; flipping API back
on xdist is a one-line change once the upload-tests issue is fixed.

The headline win in this PR is now the self-hosted job consolidation
(items 3+4): one build per app per run, push only on main.
@thomasrockhu-codecov thomasrockhu-codecov changed the title ci: shard API/shared tests with pytest-xdist and consolidate self-hosted build ci: consolidate self-hosted build; wire pytest-xdist (off by default) Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant