Cache CLI extractor paths across Actions steps#3950
Conversation
henrymercer
left a comment
There was a problem hiding this comment.
Caching these invocations makes a lot of sense! I have a high level comment and a couple of lower level comments.
The main point is that now that we're caching multiple invocations, it might be a good opportunity to generalise the design. For instance, you could imagine something like:
const versionCache = createPersistedCliCache({ envVar: EnvVar.CODEQL_VERSION_INFO, validate: isVersionInfo });
const resolveLanguagesCache = createPersistedCliCache({ envVar: EnvVar.CODEQL_RESOLVE_LANGUAGES, validate: isResolveLanguagesOutput });where createPersistedCliCache handles memoising in the Action and persisting between Actions steps with an environment variable.
Some smaller things:
- Ideally the cache entry would also depend on
getExtraOptionsFromEnv(["resolve","languages"]) - We should remove the cache in
testing-utils.tslike we do for the CodeQL version cache
mbg
left a comment
There was a problem hiding this comment.
I agree with @henrymercer's comments regarding a more generalised design for this. I am wondering about the use of environment variables here vs using a file on disk. I don't know if you have already considered this, but we store e.g. the Action configuration on disk as a file. Perhaps that would make sense for these cached CLI results as well.
A general point: could we also make sure to add doc comments for new top-level definitions before merging?
Repeated calls to `resolveLanguages()` will only pay the performance penalty of executing `codeql resolve languages` once.
By wrapping `resolveLanguages()`, which is memoized, we can avoid executing `codeql resolve extractor` several times over the course of an analysis.
This commit adds a `number` validator`, an `object` validator, an `isNumber` predicate, and `undefinable()` to test optional-but-not-null properties.
This provides a separation of concerns between the memoization and the execution.
c218fd6 to
b18df17
Compare
|
I've taken your comments into consideration and overhauled the design to be more comprehensive and unified. The design now backs to a temporary file instead of the environment. I also identified a few opportunities to refactor some duplicated code into helper functions. I kept the use of |
There was a problem hiding this comment.
Warning
- Copilot's review of this pull request may be incomplete because some of the changed files are excluded by your Copilot content exclusion settings. See Excluding content from Copilot for details.
Pull request overview
This PR introduces a cross-step cache for selected CodeQL CLI command outputs (notably codeql version and codeql resolve languages) to reduce repeated JVM startups and improve performance across GitHub Actions steps. It also refactors extractor resolution to derive extractor roots from resolve languages (reusing the cached output) and extends the internal JSON validation helpers to support stronger runtime validation of CLI JSON output.
Changes:
- Add a new 2-tier command-output cache (in-memory + temp-file) and wire it into
codeql.tsforversionandresolve languages. - Refactor
resolveExtractor()to useresolveLanguages()rather than invokingcodeql resolve extractor. - Extend
src/jsonvalidation helpers (number/object validators andundefinable) and add unit tests; remove now-obsoleteutil-based version cache.
Show a summary per file
| File | Description |
|---|---|
| src/util.ts | Removes the prior in-process/env-var version cache helpers. |
| src/util.test.ts | Removes tests for the old version-caching behavior. |
| src/testing-utils.ts | Updates test setup to reset the new command-output cache between tests. |
| src/status-report.ts | Switches telemetry version lookup to the new cache + isVersionInfo guard. |
| src/json/index.ts | Adds number, object, and undefinable validators to support schema checks. |
| src/json/index.test.ts | Adds tests for undefinable semantics (rejecting null). |
| src/environment.ts | Removes the env var used for the old persisted version cache. |
| src/codeql.ts | Adds caching wrappers/type guards and refactors extractor resolution and JSON parsing. |
| src/cache.ts | New: implements the command-output cache (memo + temp file). |
| src/cache.test.ts | New: tests cache persistence/memo behavior and validation. |
| lib/entry-points.js | Generated output (content excluded by policy; not reviewed). |
Copilot's findings
Files excluded by content exclusion policy (1)
- lib/entry-points.js
- Files reviewed: 10/11 changed files
- Comments generated: 3
| // Tier 1: the in-memory variable. | ||
| const memoized = inMemoryCache.get(key); | ||
| if (memoized !== undefined) { | ||
| return memoized.output as T; | ||
| } |
There was a problem hiding this comment.
This seems correct? At least with respect to the cmd aspect?
| return getCachedOrRun( | ||
| CommandCacheKey.ResolveLanguages, | ||
| cmd, | ||
| () => | ||
| runCliJson<ResolveLanguagesOutput>(cmd, [ |
There was a problem hiding this comment.
Good catch!
It looks like none of the callers of resolveLanguages actually use or rely on these filtered extractors (html, csv, xml). So, while we could just always include the flag --filter-to-languages-with-queries in the command, not all versions of the CLI would support it. Alas, we would have to raise the CODEQL_MINIMUM_VERSION to v2.23.0 (2025-09-04)—which I'm open to doing, but perhaps unnecessary.
Instead, I opted to include the flag if it's supported. Otherwise, don't. This new implementation is spiritually similar to the current implementation, so I don't expect any surprises from it.
…e support The output of `resolveLanguages()` can vary based on whether the flag `--filter-to-languages-with-queries` is included, but not all versions of the CLI support that. This makes caching a single execution problematic, so I opted to cache it based on whether it's supported. If it's supported, it's used; otherwise, it's not.
2f4495a to
c8e32e4
Compare
henrymercer
left a comment
There was a problem hiding this comment.
Thanks for implementing the generic cache mechanism, that's a nice simplification! To make this PR easier to review and get merged, what do you think about splitting this PR into two: the first would implement the generic cache mechanism just for codeql version, and the second would use it to cache codeql resolve languages?
mbg
left a comment
There was a problem hiding this comment.
I haven't finished reviewing everything (particularly the new tests as well as codeql.ts), but I already have a bunch of comments that I think should be addressed. So to not end up with too much noise in one go, I am submitting my review without reviewing those last bits.
The main thing to look at is that the on-disk cache is repeatedly written to and read from, which obviously has unnecessary overheads. It would be better if the on-disk cache was only read once (if it exists) to initialise the in-memory cache. It should then only be written to when the Action is about to exit.
Secondly, it would be nice (but is not essential) if we could avoid having the in-memory cache as global state and instead propagate it between the relevant use sites. Since we already have a CodeQL object, we could probably make the command cache part of it. That would make the data flow more explicit and simplify the tests, which then don't have to clear the global cache and can additionally run in parallel.
| * Transforms a validator to be optional, accepting `undefined` for an absent | ||
| * value but, unlike `optional`, rejecting `null`. | ||
| */ | ||
| export function undefinable<T>(validator: Validator<T>) { |
There was a problem hiding this comment.
Minor: I think this would be better named optional and the existing optional would be better named as optionalOrNull (or something like that). I probably didn't think about this scenario when I named the existing optional.
There was a problem hiding this comment.
We have several different types of caches that are handled by the CodeQL Action, so the name cache.ts is a bit too generic for this since it suggests to me that cache.ts contains implementations that are shared between all the different types of caches / caching implementations.
How about cli-cache.ts or cli-config-cache.ts? Or, alternatively, we could add a new cli folder and then move cache.ts into it (or call it config-cache.ts).
There was a problem hiding this comment.
Fair enough. If it's all the same, I think I'll go with either cli-cache.ts or cli/cache.ts.
| }; | ||
|
|
||
| test("validateSchema - undefinable properties are optional but reject null", async (t) => { | ||
| // Optional fields may be absent or explicitly undefined |
There was a problem hiding this comment.
Minor: This comment still refers to "optional" (which is fine if you follow the renaming suggestion in my other comment).
| // a built-in language. | ||
| // TODO: Delete this `if` condition once CODEQL_MINIMUM_VERSION | ||
| // is at least v2.23.0 — the first version to support the | ||
| // BuiltinExtractorsSpecifyDefaultQueries feature. |
There was a problem hiding this comment.
TODO may not be the right word. Really, this comment is meant to serve as a reminder (to the future maintainers) that this part of the code is only necessary for now; so as not to forget to "clean up" these workarounds/fallbacks.
| const resolveResult = await codeql.resolveLanguages({ | ||
| filterToLanguagesWithQueries: resolveSupportedLanguagesUsingCli, | ||
| }); | ||
| const resolveResult = await codeql.resolveLanguages(); |
There was a problem hiding this comment.
This change seems unrelated to the PR?
| } | ||
|
|
||
| // Tier 2: the temporary file persisted by an earlier step, if any. | ||
| const entry = readCommandCacheFile()[key] as unknown; |
There was a problem hiding this comment.
Similar comment as above. I think we should use this to (attempt to) restore the in-memory cache once on "startup" and not call it repeatedly.
| // Tier 1: the in-memory variable. | ||
| const memoized = inMemoryCache.get(key); | ||
| if (memoized !== undefined) { | ||
| return memoized.output as T; | ||
| } |
There was a problem hiding this comment.
This seems correct? At least with respect to the cmd aspect?
| (cmd !== undefined && entry.cmd !== cmd) || | ||
| (validate !== undefined && !validate(entry.output)) | ||
| ) { | ||
| return undefined; |
| } | ||
|
|
||
| // Memoize so subsequent lookups in this process hit tier 1. | ||
| inMemoryCache.set(key, { cmd: entry.cmd, output: entry.output }); |
There was a problem hiding this comment.
With the current approach where the on-disk file is read/written to repeatedly, shouldn't this be cacheCommandOutput? (Not that I suggest we stick with this approach.)
| /** | ||
| * Clears the in-process memo (tier 1). Only for use in tests, which exercise | ||
| * multiple "steps" within a single process. | ||
| */ | ||
| export function resetCachedCommandOutputs(): void { | ||
| inMemoryCache.clear(); | ||
| } |
There was a problem hiding this comment.
In some other files, we export these test-only functions in a nested object to make it clearer that they are for test-use only.
That said, my general preference for global state in the CodeQL Action is to try and make it explicit in the way I have done recently in e.g. #3963
Similar to #3943, this PR caches the output of
codeql resolve languages, which contains the paths to the various extractors so that repeated calls toresolveLanguages()are idempotent. Additionally, re-implementresolveExtractor()as a wrapper overresolveLanguages()(to re-use the cached output) rather than shell out tocodeql resolve extractor.In one experiment, I counted seven instances of shelling out to
codeql resolve extractor. When you dig into the code, you can see why:resolveExtractor()is not called often or from many places; But one caller isisTracedLanguage(), which is wrapped byisScannedLanguage(). And these functions are often used in a loop/map over all/some languages. This can explain why we see consecutive executions ofcodeql resolve extractor.In support of the above goals, this PR also adds some additional functions to the
jsonmodule, to enable validation of thecodeql versionoutput.Risk assessment
For internal use only. Please select the risk level of this change:
Which use cases does this change impact?
Workflow types:
dynamicworkflows (Default Setup, Code Quality, ...).Products:
analysis-kinds: code-scanning.analysis-kinds: code-quality.upload-sarifaction.Environments:
github.comand/or GitHub Enterprise Cloud with Data Residency.How did/will you validate this change?
.test.tsfiles).pr-checks).If something goes wrong after this change is released, what are the mitigation and rollback strategies?
How will you know if something goes wrong after this change is released?
Are there any special considerations for merging or releasing this change?
Merge / deployment checklist