fix(codegen): Use setSafe for fixed-width writes into nested collection children whose element count is data-dependent#4549
Merged
Conversation
…and must grow dynamically. Add regression tests.
…andomized output-writer fuzz in CometCodegenFuzzSuite, and runKernel/runKernelRow0 lifted into the shared CometCodegenAssertions trait. Test-only, stacked on apache#4549.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4539.
Rationale for this change
Routing a map- or array-typed expression through the JVM codegen dispatcher could corrupt the output. For example,
map_concat(map(1,'a',2,'b'), map(3,'c'))produced{1->a, 2->b, 0->c}(last key corrupted to 0). The query runs natively with no fallback, so this is a wrong-answer bug.Root cause: fixed-width scalar writes into nested collection children used Arrow's
set, which does not grow the buffer.allocateOutputsizes a vector fromnumRows, but a collection child's element count is the data-dependent sum of per-row collection sizes, not bounded bynumRows. The total is also unknown until the per-rowev.codehas run inside the write loop, so the child cannot be presized. Once a row held more entries than the presized child, the write ran off the end of the buffer.Comet enables Arrow unsafe memory access (
arrow.enable_unsafe_memory_access=true, set byNativeBase), so that out-of-boundssetis not checked at runtime: it writes past the child and the value reads back as 0. With bounds checking on (e.g. the kernel unit test), the same write throwsIndexOutOfBoundsException. Same bug, two symptoms.The value side (String) already used
setSafe, which is why values survived and keys did not.What changes are included in this PR?
emitWritetakes anestedflag. Fixed-width Boolean/numeric/temporal writes usesetSafewhen nested and thesetfast path otherwise.nested = true(always written at a cumulative index).nested(fields are co-indexed with the struct, so a field is nested exactly when the struct is).set(presized tonumRows, one scalar per row).setSafeand are unchanged.runKernelkernel harness is lifted into the sharedCometCodegenAssertionstrait so the new fuzz and the executing kernel tests both use it.How are these changes tested?
CometCodegenSuitethat compile a kernel, run a batch, and read the output back viaCometVector, covering each vulnerable path:map_concat(map),array(...)with more elements than the presized child (array),Array<Struct<Int, String>>(struct nested in a collection).CometCodegenSuite: 10 cases, each a scalar-seed UDF returning a dynamically sizedArray/Map/Struct(varying element type, nulls, empty and large rows), checked against Spark.CometCodegenFuzzSuite: generates a random nested output type and value, drives it through the writer viarunKernel, and asserts the Arrow readback round-trips (CatalystTypeConverters.convertToScalaas the oracle), with a guard that fails the test ifcanHandlerejects every generated shape.CometCodegenSourceSuite: nested fixed-width children emitsetSafe, and a top-level scalar output keeps thesetfast path.