Cache CLI extractor paths across Actions steps by mario-campos · Pull Request #3950 · github/codeql-action

mario-campos · 2026-06-04T13:53:53Z

Similar to #3943, this PR caches the output of codeql resolve languages, which contains the paths to the various extractors so that repeated calls to resolveLanguages() are idempotent. Additionally, re-implement resolveExtractor() as a wrapper over resolveLanguages() (to re-use the cached output) rather than shell out to codeql resolve extractor.

In one experiment, I counted seven instances of shelling out to codeql resolve extractor. When you dig into the code, you can see why: resolveExtractor() is not called often or from many places; But one caller is isTracedLanguage(), which is wrapped by isScannedLanguage(). And these functions are often used in a loop/map over all/some languages. This can explain why we see consecutive executions of codeql resolve extractor.

In support of the above goals, this PR also adds some additional functions to the json module, to enable validation of the codeql version output.

Risk assessment

For internal use only. Please select the risk level of this change:

Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Which use cases does this change impact?

Workflow types:

Advanced setup - Impacts users who have custom CodeQL workflows.
Managed - Impacts users with dynamic workflows (Default Setup, Code Quality, ...).

Products:

Code Scanning - The changes impact analyses when analysis-kinds: code-scanning.
Code Quality - The changes impact analyses when analysis-kinds: code-quality.
Other first-party - The changes impact other first-party analyses.
Third-party analyses - The changes affect the upload-sarif action.

Environments:

Dotcom - Impacts CodeQL workflows on github.com and/or GitHub Enterprise Cloud with Data Residency.
GHES - Impacts CodeQL workflows on GitHub Enterprise Server.
Testing/None - This change does not impact any CodeQL workflows in production.

How did/will you validate this change?

Unit tests - I am depending on unit test coverage (i.e. tests in .test.ts files).
End-to-end tests - I am depending on PR checks (i.e. tests in pr-checks).
Other - Manual/local testing

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

Feature flags - All new or changed code paths can be fully disabled with corresponding feature flags.
Rollback - Change can only be disabled by rolling back the release or releasing a new version with a fix.
Development/testing only - This change cannot cause any failures in production.
Other - Please provide details.

How will you know if something goes wrong after this change is released?

Telemetry - I rely on existing telemetry or have made changes to the telemetry.
- Dashboards - I will watch relevant dashboards for issues after the release. Consider whether this requires this change to be released at a particular time rather than as part of a regular release.
- Alerts - New or existing monitors will trip if something goes wrong with this change.
Other - Please provide details.

Are there any special considerations for merging or releasing this change?

No special considerations - This change can be merged at any time.
Special considerations - This change should only be merged once certain preconditions are met. Please provide details of those or link to this PR from an internal issue.

Merge / deployment checklist

Confirm this change is backwards compatible with existing workflows.
Consider adding a changelog entry for this change.
Confirm the readme and docs have been updated if necessary.

henrymercer

Caching these invocations makes a lot of sense! I have a high level comment and a couple of lower level comments.

The main point is that now that we're caching multiple invocations, it might be a good opportunity to generalise the design. For instance, you could imagine something like:

const versionCache = createPersistedCliCache({ envVar: EnvVar.CODEQL_VERSION_INFO, validate: isVersionInfo });
const resolveLanguagesCache = createPersistedCliCache({ envVar: EnvVar.CODEQL_RESOLVE_LANGUAGES, validate: isResolveLanguagesOutput });

where createPersistedCliCache handles memoising in the Action and persisting between Actions steps with an environment variable.

Some smaller things:

Ideally the cache entry would also depend on getExtraOptionsFromEnv(["resolve","languages"])
We should remove the cache in testing-utils.ts like we do for the CodeQL version cache

mbg

I agree with @henrymercer's comments regarding a more generalised design for this. I am wondering about the use of environment variables here vs using a file on disk. I don't know if you have already considered this, but we store e.g. the Action configuration on disk as a file. Perhaps that would make sense for these cached CLI results as well.

A general point: could we also make sure to add doc comments for new top-level definitions before merging?

Repeated calls to `resolveLanguages()` will only pay the performance penalty of executing `codeql resolve languages` once.

By wrapping `resolveLanguages()`, which is memoized, we can avoid executing `codeql resolve extractor` several times over the course of an analysis.

This commit adds a `number` validator`, an `object` validator, an `isNumber` predicate, and `undefinable()` to test optional-but-not-null properties.

This provides a separation of concerns between the memoization and the execution.

mario-campos · 2026-06-18T15:44:33Z

I've taken your comments into consideration and overhauled the design to be more comprehensive and unified. The design now backs to a temporary file instead of the environment. I also identified a few opportunities to refactor some duplicated code into helper functions.

I kept the use of cmd as a key in the cache, but I question whether it's really necessary. I think it's safe to assume that, in most cases, there will only be one instance of codeql in use per job. And, even in the event that there's more than one instance, how likely is it that init would use a different version than autobuild or analyze? If it's not necessary, I would opt to delete it to simplify the code a bit.

Copilot

Warning

Copilot's review of this pull request may be incomplete because some of the changed files are excluded by your Copilot content exclusion settings. See Excluding content from Copilot for details.

Pull request overview

This PR introduces a cross-step cache for selected CodeQL CLI command outputs (notably codeql version and codeql resolve languages) to reduce repeated JVM startups and improve performance across GitHub Actions steps. It also refactors extractor resolution to derive extractor roots from resolve languages (reusing the cached output) and extends the internal JSON validation helpers to support stronger runtime validation of CLI JSON output.

Changes:

Add a new 2-tier command-output cache (in-memory + temp-file) and wire it into codeql.ts for version and resolve languages.
Refactor resolveExtractor() to use resolveLanguages() rather than invoking codeql resolve extractor.
Extend src/json validation helpers (number/object validators and undefinable) and add unit tests; remove now-obsolete util-based version cache.

Show a summary per file

File	Description
src/util.ts	Removes the prior in-process/env-var version cache helpers.
src/util.test.ts	Removes tests for the old version-caching behavior.
src/testing-utils.ts	Updates test setup to reset the new command-output cache between tests.
src/status-report.ts	Switches telemetry version lookup to the new cache + `isVersionInfo` guard.
src/json/index.ts	Adds `number`, `object`, and `undefinable` validators to support schema checks.
src/json/index.test.ts	Adds tests for `undefinable` semantics (rejecting `null`).
src/environment.ts	Removes the env var used for the old persisted version cache.
src/codeql.ts	Adds caching wrappers/type guards and refactors extractor resolution and JSON parsing.
src/cache.ts	New: implements the command-output cache (memo + temp file).
src/cache.test.ts	New: tests cache persistence/memo behavior and validation.
lib/entry-points.js	Generated output (content excluded by policy; not reviewed).

Copilot's findings

Files excluded by content exclusion policy (1)

lib/entry-points.js

Files reviewed: 10/11 changed files
Comments generated: 3

mbg · 2026-06-19T16:00:26Z

+  // Tier 1: the in-memory variable.
+  const memoized = inMemoryCache.get(key);
+  if (memoized !== undefined) {
+    return memoized.output as T;
+  }


This seems correct? At least with respect to the cmd aspect?

mario-campos · 2026-06-18T19:46:02Z

+      return getCachedOrRun(
+        CommandCacheKey.ResolveLanguages,
+        cmd,
+        () =>
+          runCliJson<ResolveLanguagesOutput>(cmd, [


Good catch!

It looks like none of the callers of resolveLanguages actually use or rely on these filtered extractors (html, csv, xml). So, while we could just always include the flag --filter-to-languages-with-queries in the command, not all versions of the CLI would support it. Alas, we would have to raise the CODEQL_MINIMUM_VERSION to v2.23.0 (2025-09-04)—which I'm open to doing, but perhaps unnecessary.

Instead, I opted to include the flag if it's supported. Otherwise, don't. This new implementation is spiritually similar to the current implementation, so I don't expect any surprises from it.

…e support The output of `resolveLanguages()` can vary based on whether the flag `--filter-to-languages-with-queries` is included, but not all versions of the CLI support that. This makes caching a single execution problematic, so I opted to cache it based on whether it's supported. If it's supported, it's used; otherwise, it's not.

henrymercer

Thanks for implementing the generic cache mechanism, that's a nice simplification! To make this PR easier to review and get merged, what do you think about splitting this PR into two: the first would implement the generic cache mechanism just for codeql version, and the second would use it to cache codeql resolve languages?

mbg

I haven't finished reviewing everything (particularly the new tests as well as codeql.ts), but I already have a bunch of comments that I think should be addressed. So to not end up with too much noise in one go, I am submitting my review without reviewing those last bits.

The main thing to look at is that the on-disk cache is repeatedly written to and read from, which obviously has unnecessary overheads. It would be better if the on-disk cache was only read once (if it exists) to initialise the in-memory cache. It should then only be written to when the Action is about to exit.

Secondly, it would be nice (but is not essential) if we could avoid having the in-memory cache as global state and instead propagate it between the relevant use sites. Since we already have a CodeQL object, we could probably make the command cache part of it. That would make the data flow more explicit and simplify the tests, which then don't have to clear the global cache and can additionally run in parallel.

mbg · 2026-06-19T15:38:35Z

+ * Transforms a validator to be optional, accepting `undefined` for an absent
+ * value but, unlike `optional`, rejecting `null`.
+ */
+export function undefinable<T>(validator: Validator<T>) {


Minor: I think this would be better named optional and the existing optional would be better named as optionalOrNull (or something like that). I probably didn't think about this scenario when I named the existing optional.

mbg · 2026-06-19T15:41:00Z

We have several different types of caches that are handled by the CodeQL Action, so the name cache.ts is a bit too generic for this since it suggests to me that cache.ts contains implementations that are shared between all the different types of caches / caching implementations.

How about cli-cache.ts or cli-config-cache.ts? Or, alternatively, we could add a new cli folder and then move cache.ts into it (or call it config-cache.ts).

Fair enough. If it's all the same, I think I'll go with either cli-cache.ts or cli/cache.ts.

mbg · 2026-06-19T15:42:15Z

+};
+
+test("validateSchema - undefinable properties are optional but reject null", async (t) => {
+  // Optional fields may be absent or explicitly undefined


Minor: This comment still refers to "optional" (which is fine if you follow the renaming suggestion in my other comment).

mbg · 2026-06-19T15:43:36Z

-    // a built-in language.
+    // TODO: Delete this `if` condition once CODEQL_MINIMUM_VERSION
+    //       is at least v2.23.0 — the first version to support the
+    //       BuiltinExtractorsSpecifyDefaultQueries feature.


TODO may not be the right word. Really, this comment is meant to serve as a reminder (to the future maintainers) that this part of the code is only necessary for now; so as not to forget to "clean up" these workarounds/fallbacks.

mbg · 2026-06-19T15:44:22Z

-  const resolveResult = await codeql.resolveLanguages({
-    filterToLanguagesWithQueries: resolveSupportedLanguagesUsingCli,
-  });
+  const resolveResult = await codeql.resolveLanguages();


This change seems unrelated to the PR?

It's actually a result of this finding.

mbg · 2026-06-19T15:59:24Z

+  }
+
+  // Tier 2: the temporary file persisted by an earlier step, if any.
+  const entry = readCommandCacheFile()[key] as unknown;


Similar comment as above. I think we should use this to (attempt to) restore the in-memory cache once on "startup" and not call it repeatedly.

mbg · 2026-06-19T16:00:26Z

+  // Tier 1: the in-memory variable.
+  const memoized = inMemoryCache.get(key);
+  if (memoized !== undefined) {
+    return memoized.output as T;
+  }


This seems correct? At least with respect to the cmd aspect?

mbg · 2026-06-19T16:00:51Z

+    (cmd !== undefined && entry.cmd !== cmd) ||
+    (validate !== undefined && !validate(entry.output))
+  ) {
+    return undefined;


Should this be an error or logged?

mbg · 2026-06-19T16:01:42Z

+  }
+
+  // Memoize so subsequent lookups in this process hit tier 1.
+  inMemoryCache.set(key, { cmd: entry.cmd, output: entry.output });


With the current approach where the on-disk file is read/written to repeatedly, shouldn't this be cacheCommandOutput? (Not that I suggest we stick with this approach.)

mbg · 2026-06-19T16:03:13Z

+/**
+ * Clears the in-process memo (tier 1). Only for use in tests, which exercise
+ * multiple "steps" within a single process.
+ */
+export function resetCachedCommandOutputs(): void {
+  inMemoryCache.clear();
+}


In some other files, we export these test-only functions in a nested object to make it clearer that they are for test-use only.

That said, my general preference for global state in the CodeQL Action is to try and make it explicit in the way I have done recently in e.g. #3963

github-actions Bot added the size/S Should be easy to review label Jun 4, 2026

henrymercer reviewed Jun 5, 2026

View reviewed changes

Comment thread src/codeql.ts Outdated

mbg reviewed Jun 10, 2026

View reviewed changes

Comment thread src/environment.ts Outdated

Comment thread src/util.ts Outdated

mario-campos added 8 commits June 18, 2026 09:58

Cache the output of codeql resolve languages

311292c

Repeated calls to `resolveLanguages()` will only pay the performance penalty of executing `codeql resolve languages` once.

Reimplement resolveExtractor() as wrapper over resolveLanguages()

6010f85

By wrapping `resolveLanguages()`, which is memoized, we can avoid executing `codeql resolve extractor` several times over the course of an analysis.

Validate numbers, objects, and undefinables in the json module

445107e

This commit adds a `number` validator`, an `object` validator, an `isNumber` predicate, and `undefinable()` to test optional-but-not-null properties.

Refactor isVersionInfo() to use json` module

587fcb3

Refactor CLI executions into helper functions

889ae42

This provides a separation of concerns between the memoization and the execution.

Refactor CLI JSON handling into a dedicated runCliJson function

dc8e1e9

Refactor CLI caching with in-memory and file storage

a602287

Rebased onto main; fixups were needed

b18df17

mario-campos force-pushed the mario-campos/cache-cli-resolve-langs branch from c218fd6 to b18df17 Compare June 18, 2026 15:25

github-actions Bot added size/XL May be very hard to review and removed size/S Should be easy to review labels Jun 18, 2026

Add error handling for undefined extractors in language resolution

553eef0

mario-campos marked this pull request as ready for review June 18, 2026 15:44

mario-campos requested a review from a team as a code owner June 18, 2026 15:44

Copilot AI review requested due to automatic review settings June 18, 2026 15:44

Copilot started reviewing on behalf of mario-campos June 18, 2026 15:45 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

mario-campos force-pushed the mario-campos/cache-cli-resolve-langs branch from 2f4495a to c8e32e4 Compare June 18, 2026 19:46

henrymercer reviewed Jun 19, 2026

View reviewed changes

mbg requested changes Jun 19, 2026

View reviewed changes

Conversation

mario-campos commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Risk assessment

Which use cases does this change impact?

How did/will you validate this change?

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

How will you know if something goes wrong after this change is released?

Are there any special considerations for merging or releasing this change?

Merge / deployment checklist

Uh oh!

henrymercer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mario-campos commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henrymercer left a comment

Choose a reason for hiding this comment

Uh oh!

mbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mario-campos commented Jun 4, 2026 •

edited

Loading