GH-48701: [C++][Parquet] Add ALPpd encoding#48345
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
1b78a5c to
d563ce0
Compare
There was a problem hiding this comment.
I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations
In this case I would recommend https://github.com/apache/parquet-testing
| DELTA_BYTE_ARRAY = 7, | ||
| RLE_DICTIONARY = 8, | ||
| BYTE_STREAM_SPLIT = 9, | ||
| ALP = 10, |
There was a problem hiding this comment.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/parquet.thrift#L631 needs to be updated here and in parqut-format.
There was a problem hiding this comment.
For parquet-format we have this PR : apache/parquet-format#557
|
Thanks @prtkgaur -- it is super exciting to see this movement. Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review. I started the CI checks on this PR and had some comments about the testing. |
| std::string tarball_path = std::string(__FILE__); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\")); | ||
| tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz"; |
There was a problem hiding this comment.
@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100
|
|
||
| // Unsafe resize without initialization - use only when you will immediately | ||
| // overwrite the memory (e.g., before memcpy). Only safe for POD types. | ||
| void UnsafeResize(size_t n) { |
There was a problem hiding this comment.
Using this over resize gave us around 2-3% performance improvement
Co-authored-by: Dhirhan Kanesalingam <dhirhan17@gmail.com>
Also ensure that no line exceeds 90 characters
Add float32 (FLOAT) coverage to the ALP encoding test pipeline: - Generator: WriteAlpParquetFloat() and WriteExpectCsvFloat() produce float32 parquet files by casting double source data to float - Tests: ReadFloatsFromCSV() and AssertTableMatchesCSVFloat() with bit-exact uint32 comparison for 4 new test cases (C++ and Java generated float parquets for spotify1 and arade datasets) - Fix missing #include <optional> in alp.h
Add static_assert(ARROW_LITTLE_ENDIAN) in alp.cc and alp_wrapper.cc to guard memcpy-based integer serialization, as requested by emkornfield. Strengthen DecodeAlp buffer validation: before reading each vector, verify that enough buffer remains for both metadata and data sections. Previously only the offset itself was bounds-checked, which could allow out-of-bounds reads from crafted compressed input.
- Rename AlpEncodingPreset -> AlpEncodingParameters across all files - Rename AlpWrapper -> AlpCodec across all files - Rename CreateEncodingPreset -> CreateEncodingParameters - Replace std::ceil(.../8.0) with bit_util::BytesForBits() - Simplify BitUnpackIntegers: single unpack() call instead of manual batch splitting - Replace std::memcpy with SafeCopy in DecodeVector for consistency - Move out parameters to last position in Decode() and DecodeAlp() - Change Decode num_elements from uint32_t to int32_t - Change GetMaxCompressedSize to int64_t, rename param to uncompressed_size - Change CompressionProgress/DecompressionProgress fields to int64_t - Change AlpEncodingParameters::best_compressed_size to int64_t - Spell out ALP (Adaptive Lossless floating-Point) in alp_constants.h with spec link - Add ALP paper citation for sampling constants - Remove ALP from Compression::type enum (it's an encoding, not a compressor) - Add validation in AlpDecoder::SetData for num_values>0 with len<=0 - Document GetMaxCompressedSize() in Encode/EncodeWithPreset comp param
- Fix narrowing warnings in alp_wrapper.cc (uint64_t -> int64_t casts for CompressionProgress/DecompressionProgress fields) - Fix narrowing in alp.cc (best_compressed_size_bytes uint64_t -> int64_t) - Fix remaining Decode() call sites in alp_test.cc to use new parameter order (num_elements, comp, comp_size, output)
| @@ -565,6 +565,12 @@ if(ARROW_WITH_ZSTD) | |||
| list(APPEND ARROW_UTIL_SRCS util/compression_zstd.cc) | |||
| endif() | |||
|
|
|||
There was a problem hiding this comment.
For the reviewer rough stats about this PR :
Production Code | ~2,449 | ~39%
Test Code. | ~1,151+ | ~18%
Benchmark | ~1,824 | ~29%
Documentation. | ~897 | ~14%
| min_encoded_value = std::min(encoded_value, min_encoded_value); | ||
| continue; | ||
| } | ||
| num_exceptions++; |
There was a problem hiding this comment.
Removed the num_non_exceptions counter from the loop and derive it as input_vector.size() - num_exceptions after the loop instead
| const ExactType delta = (static_cast<ExactType>(max_encoded_value) - | ||
| static_cast<ExactType>(min_encoded_value)); | ||
|
|
||
| const uint32_t estimated_bits_per_value = |
There was a problem hiding this comment.
Done — replaced both std::ceil(std::log2(delta + 1)) in EstimateCompressedSize and the hand-rolled __builtin_clz/__builtin_clzll in BitPackIntegers with bit_util::NumRequiredBits(). The existing uint64_t overload handles both float (uint32_t delta) and double (uint64_t delta) correctly via implicit widening, so an int32 version isn't strictly needed here, but happy to add one if you think it's worthwhile for the broader codebase.
| // N of appearances is irrelevant at this phase; we search for best compression. | ||
| AlpCombination best_combination{best_encoding_options, 0, best_total_bits}; | ||
| // Try all combinations to find the one which minimizes compression size. | ||
| for (uint8_t exp_idx = 0; exp_idx <= Constants::kMaxExponent; exp_idx++) { |
There was a problem hiding this comment.
Done — extracted the inner loops into a find_best_for_vector lambda. The outer loop is now flat: auto [best, size_bits] = find_best_for_vector(sampled_vector)
| std::vector<AlpCombination> best_k_combinations; | ||
| best_k_combinations.reserve( | ||
| std::min(best_k_combinations_hash.size(), kMaxCombinationCount)); | ||
| for (const auto& combination : best_k_combinations_hash) { |
There was a problem hiding this comment.
Done — applied structured bindings. This is actually C++17 (which Arrow already uses), so no version bump needed.
| // (SIMD128/NEON, SIMD256/AVX2, SIMD512/AVX512) have identical batch sizes: | ||
| // - uint32_t (float): Simd*UnpackerForWidth::kValuesUnpacked = 32 | ||
| // - uint64_t (double): Simd*UnpackerForWidth::kValuesUnpacked = 64 | ||
| // These constants are in anonymous namespaces (internal implementation detail), |
There was a problem hiding this comment.
Simplified — Arrow's unpack() already handles arbitrary batch sizes internally (runs SIMD for complete batches, then unpack_exact for the remainder), so we no longer need to manually split or hardcode batch sizes. The function is now a single unpack() call.
| const ExactType* data = input_vector.data(); | ||
| const ExactType frame_of_ref = for_info.frame_of_reference; | ||
|
|
||
| #pragma GCC unroll AlpConstants::kLoopUnrolls |
There was a problem hiding this comment.
The #pragma GCC unroll with ivdep already generates efficient SIMD code — the compiler auto-vectorizes these loops. Hand-unrolling with a float/double-specific factor would couple the code to specific SIMD widths and add complexity. Happy to benchmark and revisit if there's measurable room for improvement.
| const ExactType unfored_value = data[i] + frame_of_ref; | ||
| // 2. Reinterpret as signed integer for decode | ||
| SignedExactType signed_value; | ||
| std::memcpy(&signed_value, &unfored_value, sizeof(SignedExactType)); |
There was a problem hiding this comment.
Done — replaced with SafeCopy for consistency with EncodeVector.
| arrow::util::span<const uint16_t> exception_positions) { | ||
| // Exceptions Patching. | ||
| uint64_t exception_idx = 0; | ||
| #pragma GCC unroll AlpConstants::kLoopUnrolls |
There was a problem hiding this comment.
The variance is instruction-level parallelism vs code size/register pressure. kLoopUnrolls = 4 is a reasonable middle ground for both x86 (wide OoO) and ARM (narrower). Lower values (1-2) under-utilize the pipeline; higher values (8+) increase I-cache pressure. We profiled with 2, 4, and 8 during development — 4 was consistently best across platforms. The constant is in alp_constants.h so it's easy to tune per-platform if needed.
| // Exceptions Patching. | ||
| uint64_t exception_idx = 0; | ||
| #pragma GCC unroll AlpConstants::kLoopUnrolls | ||
| #pragma GCC ivdep |
There was a problem hiding this comment.
Without ivdep, the compiler may assume output_vector[i] and data[i] alias (both are pointer-accessed arrays of same-sized types), which prevents vectorization of the fused unFOR+decode loop. In practice they never alias: data comes from a local StaticVector and output_vector is the caller's output buffer. ivdep tells the compiler this is safe to vectorize.
Break the fused encode→decode→compare loop into three separate passes over batches of 8 elements. The encode and decode passes are now independent loops that the compiler can vectorize (FastRound uses the magic number trick, not std::lround, so it is SIMD-friendly). A scalar tail handles the remainder.
emkornfield
left a comment
There was a problem hiding this comment.
tried to take another pass through, it seems like some of my old comments might have been unaddressed but in particular, do you plan on hooking up config so the encoder/decoder can be used on real parquet files in this PR or a separate one?
| const ExactType* data = input_vector.data(); | ||
| const ExactType frame_of_ref = for_info.frame_of_reference; | ||
|
|
||
| #pragma GCC unroll AlpConstants::kLoopUnrolls |
There was a problem hiding this comment.
Yeah, I think my main issue here is code consistency/maintainability. I think xsimd might be the preferred maintenance route. I will yield to @pitrou for his guidance on how we should structure these optimizations.
| elseif(ARROW_CPU_FLAG STREQUAL "aarch64") | ||
| # Arm64 compiler flags, gcc/clang only | ||
| set(ARROW_ARMV8_MARCH "armv8-a") | ||
| set(ARROW_ARMV8_MARCH "native") |
There was a problem hiding this comment.
I'm not clear on why this change?
There was a problem hiding this comment.
I was trying to get an idea of performance (iterate). Reverted
| if(NOT MSVC) | ||
| set(C_RELEASE_FLAGS "") | ||
| if(CMAKE_C_FLAGS_RELEASE MATCHES "-O3") | ||
| string(APPEND C_RELEASE_FLAGS " -O2") |
There was a problem hiding this comment.
it seems these are down-graded in more cases below, I think we should probably leave changing these flags to a separate PR so someone with more knowledge on why O2 is used by default can chime in (and we can better track changes here).
There was a problem hiding this comment.
reverted. I might want to add this to readme at some point
| DELTA_BYTE_ARRAY = 7, | ||
| RLE_DICTIONARY = 8, | ||
| BYTE_STREAM_SPLIT = 9, | ||
| ALP = 10, |
| uint16_t num_exceptions = 0; | ||
|
|
||
| /// Size of the serialized portion (4 bytes, fixed) | ||
| static constexpr uint64_t kStoredSize = 4; |
There was a problem hiding this comment.
nit: use sizeof(exponent) + sizeof(factor) + size(num_exceptions) to make this clear (and then assert that equal to 4.
| } else if (vector_index == num_full_vectors && remainder > 0) { | ||
| this_vector_elements = static_cast<uint16_t>(remainder); | ||
| } else { | ||
| this_vector_elements = 0; |
There was a problem hiding this comment.
should this exit early? or be marked as unreachable?
| const char* ptr = vector_start + AlpEncodedVectorInfo::kStoredSize; | ||
|
|
||
| // Decode based on integer encoding type | ||
| switch (integer_encoding) { |
There was a problem hiding this comment.
instead of switch here for now consider refactoring to guards that return early on invalid data (avoids a nesting level).
| const uint64_t data_remaining = comp_size - static_cast<uint64_t>(ptr - comp); | ||
| const uint64_t data_size = | ||
| for_info.GetDataStoredSize(this_vector_elements, alp_info.num_exceptions); | ||
| if (data_size > data_remaining) { |
There was a problem hiding this comment.
Is there a reason to not push these guards into the alp.cc code? so the abstraction could be logically consistent?
|
|
||
| void Put(const T* buffer, int num_values) override { | ||
| if (num_values > 0) { | ||
| PARQUET_THROW_NOT_OK( |
There was a problem hiding this comment.
This is encoding, not decoding I think. Reasons:
- Reduce peak memory
- Give better estimates for encoded size (i.e. the return value of EstimatedDataEncodedSize).
I left comment on a previous review about decoding directly which it seems this comment refers to. The default page size is currently 1MB on writes, this isn't trivial. One could argue this should be lowered to something more reasonable but in general, I don't thin we want to place a hard assumption on the 10s of KBs. Did you quantify how marginal? I understand the FP arithmetic is the primary bottleneck, but in most scenarios, I've seen most extra mem-copies have a reasonable high impact on perf (e.g. ~5%).
| std::shared_ptr<Buffer> buf = encoder->FlushValues(); | ||
|
|
||
| for (auto _ : state) { | ||
| auto decoder = MakeTypedDecoder<DoubleType>(Encoding::ALP); |
There was a problem hiding this comment.
nit: move this out of the benchmark loop?
… byte spans - Change constants and size methods from uint64_t to int64_t per Arrow style - Add clarifying comment that kAlpVectorSize is implementation default (format supports other power-of-2 sizes via log_vector_size) - Spell out "Adaptive Lossless floating-Point (ALP)" in AlpConstants class doc - Use sizeof(fields) + static_assert for AlpEncodedVectorInfo::kStoredSize - Fix diagram comment: "Interleaved" → "Concatenated" - Change Store/Load span<char> APIs to span<uint8_t> for byte representation
Per review feedback, avoid repeated allocation of the decoder object inside the benchmark timing loop.
Per review feedback, the file name should match the class name (AlpCodec). Updated all includes and CMakeLists.txt references.
Per review feedback, these changes should be in a separate PR where someone with more knowledge on why -O2 is the default can weigh in.
Address reviewer safety concerns: all Load methods and decode-path checks that operate on untrusted data now return Result<T> or Status::Invalid instead of aborting via ARROW_CHECK. Changes: - AlpEncodedVectorInfo::Load -> Result<AlpEncodedVectorInfo> - AlpEncodedForVectorInfo::Load -> Result<AlpEncodedForVectorInfo> - AlpEncodedVector::Load -> Result<AlpEncodedVector> - AlpEncodedVectorView::LoadView -> Result<AlpEncodedVectorView> - AlpEncodedVectorView::LoadViewDataOnly -> Result<AlpEncodedVectorView> - AlpMetadataCache::Load -> Result<AlpMetadataCache> - AlpHeader::GetVectorNumElements -> Result<uint16_t> - Add OOM guard before vector allocation in DecodeAlp - Convert unreachable vector index case to Status::Invalid - Refactor switch to early-return guard in DecodeAlp - Keep ARROW_CHECK on encode paths (internal invariants)
Document the buffer size precondition in the header as requested by reviewer
Convert ALP code from unsigned types (uint32_t, uint64_t, uint16_t) to signed types (int32_t, int64_t, int16_t) following Arrow codebase conventions where int64_t is used for sizes/counts and int32_t at Parquet page level. Unsigned types are retained only where semantically required: bit patterns (ExactType/FloatingToExact), bit widths (uint8_t), and wire format offsets (OffsetType = uint32_t). static_cast is used only at system boundaries (span construction from signed sizes, container .size() comparisons).
- DecodeVector: move output_vector to last parameter - PatchExceptions: move output to last parameter - EncodeWithPreset: move preset (input) before comp/comp_size (output) - Remove unused enforce_mode parameter from Encode() - Remove <optional> include no longer needed
|
@prtkgaur is this ready for another round of reviews? |
Add ALP = 10 to the Encoding enum to match the parquet-format spec update (apache/parquet-format PR pending).
…crash paths - Rename decomp/comp/decomp_size/comp_size to input/output/input_size/output_size in AlpCodec public and private APIs per reviewer naming feedback. - Remove ARROW_CHECK(false) default branches in integer encoding switches. Only kForBitPack is supported; validation happens at the API boundary in AlpCodec::Decode (returns Status::Invalid). Internal functions now execute the kForBitPack path directly without switching.
The previous code used memcpy(&header.compression_mode, ..., 3) to read/write the three uint8_t header fields in one shot, which relies on struct layout having no padding between them. Copy each field individually to be safe. No perf impact — this runs once per page.
Remove the hardcoded kAlpVectorSize=1024 constraint from the decode path so the implementation can decompress pages written with any valid power-of-2 vector size (up to 2^kMaxLogVectorSize = 32768). The encode path still writes with vector_size=1024. - Replace all StaticVector<T, kAlpVectorSize> with std::vector<T> in structs (AlpEncodedVector, AlpEncodedVectorView, EncodingResult, BitPackingResult) and local variables - Update validation bounds from kAlpVectorSize to (1 << kMaxLogVectorSize) - Remove vector_size != kAlpVectorSize rejection in AlpCodec::Decode - Guard EncodeVector against empty input (pre-existing UB now exposed) - Drop small_vector.h includes (no longer needed)
Following the DeltaBitPackEncoder pattern, the AlpEncoder constructor
now accepts an optional vector_size parameter (default 1024). AlpCodec
public APIs (Encode, EncodeWithPreset, GetMaxCompressedSize) and the
private EncodeAlp all accept and plumb through the vector_size. The
chosen size is stored in the existing AlpHeader.log_vector_size field.
Adds round-trip tests at vector sizes {64, 512, 1024, 2048, 4096}
with sub-vector, exact, multiple, and remainder data sizes, plus
validation death tests for invalid inputs.
Encodes float and double data with AlpCodec at vector sizes {64, 512,
2048, 4096} across 4 data-size categories, then decodes through the
parquet AlpDecoder to verify it correctly reads log_vector_size from
the ALP header.
…n up naming - Replace size_t with int64_t in AlpCodec public API per Arrow convention - Move output params (output, output_size) to end of Encode/EncodeWithPreset - Split Encode into explicit vector_size overload + convenience default - Rename DecodeAlp param output_element_count → num_elements - Add \pre precondition docs to public API methods - Update all callers in tests, encoder, and reference blob generator
Co-authored-by: dhirhan17@gmail.com
@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.
Rationale for this change
ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.
Spec
Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md
Parquet Format PR
Dataset PR (parquet-testing)
apache/parquet-testing#100
What changes are included in this PR?
This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.
Adding above needed us to add following classes.
Integration of the above code was done in
Are these changes tested?
Unit tests
Benchmark tests
Are there any user-facing changes?
DuckDB