Skip to content

fix(codegen): Use setSafe for fixed-width writes into nested collection children whose element count is data-dependent#4549

Merged
mbutrovich merged 3 commits into
apache:mainfrom
mbutrovich:fix_4539
Jun 2, 2026
Merged

fix(codegen): Use setSafe for fixed-width writes into nested collection children whose element count is data-dependent#4549
mbutrovich merged 3 commits into
apache:mainfrom
mbutrovich:fix_4539

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Jun 1, 2026

Which issue does this PR close?

Closes #4539.

Rationale for this change

Routing a map- or array-typed expression through the JVM codegen dispatcher could corrupt the output. For example, map_concat(map(1,'a',2,'b'), map(3,'c')) produced {1->a, 2->b, 0->c} (last key corrupted to 0). The query runs natively with no fallback, so this is a wrong-answer bug.

Root cause: fixed-width scalar writes into nested collection children used Arrow's set, which does not grow the buffer. allocateOutput sizes a vector from numRows, but a collection child's element count is the data-dependent sum of per-row collection sizes, not bounded by numRows. The total is also unknown until the per-row ev.code has run inside the write loop, so the child cannot be presized. Once a row held more entries than the presized child, the write ran off the end of the buffer.

Comet enables Arrow unsafe memory access (arrow.enable_unsafe_memory_access=true, set by NativeBase), so that out-of-bounds set is not checked at runtime: it writes past the child and the value reads back as 0. With bounds checking on (e.g. the kernel unit test), the same write throws IndexOutOfBoundsException. Same bug, two symptoms.

The value side (String) already used setSafe, which is why values survived and keys did not.

What changes are included in this PR?

  • emitWrite takes a nested flag. Fixed-width Boolean/numeric/temporal writes use setSafe when nested and the set fast path otherwise.
  • Call sites:
    • array elements and map key/value pass nested = true (always written at a cumulative index).
    • struct fields propagate the struct's own nested (fields are co-indexed with the struct, so a field is nested exactly when the struct is).
    • the root output stays set (presized to numRows, one scalar per row).
  • Variable-width (String/Binary) and Decimal writes already used setSafe and are unchanged.
  • Test-only: the runKernel kernel harness is lifted into the shared CometCodegenAssertions trait so the new fuzz and the executing kernel tests both use it.

How are these changes tested?

  • Executing kernel tests in CometCodegenSuite that compile a kernel, run a batch, and read the output back via CometVector, covering each vulnerable path:
    • constant-folded map_concat (map),
    • constant-folded array(...) with more elements than the presized child (array),
    • Array<Struct<Int, String>> (struct nested in a collection).
  • Deterministic end-to-end coverage in CometCodegenSuite: 10 cases, each a scalar-seed UDF returning a dynamically sized Array / Map / Struct (varying element type, nulls, empty and large rows), checked against Spark.
  • Randomized output-writer fuzz in CometCodegenFuzzSuite: generates a random nested output type and value, drives it through the writer via runKernel, and asserts the Arrow readback round-trips (CatalystTypeConverters.convertToScala as the oracle), with a guard that fails the test if canHandle rejects every generated shape.
  • Source assertions in CometCodegenSourceSuite: nested fixed-width children emit setSafe, and a top-level scalar output keeps the set fast path.
  • All the executing tests fail (overflow, or a JVM abort under NMT) if the fix is reverted; the fuzz reproduces the same corruption across random nested shapes.

…and must grow dynamically. Add regression tests.
@mbutrovich mbutrovich requested a review from andygrove June 1, 2026 18:23
mbutrovich and others added 2 commits June 1, 2026 18:15
…andomized output-writer fuzz in CometCodegenFuzzSuite, and

  runKernel/runKernelRow0 lifted into the shared CometCodegenAssertions trait. Test-only, stacked on apache#4549.
Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mbutrovich

@mbutrovich mbutrovich merged commit a0ae768 into apache:main Jun 2, 2026
68 checks passed
@mbutrovich mbutrovich deleted the fix_4539 branch June 2, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JVM codegen dispatcher miscompiles map-typed (MapType) output

2 participants