Skip to content

fix: preserve zero-length buffers in binary copy compaction#6992

Merged
yanghua merged 2 commits into
lance-format:mainfrom
zhangyue19921010:fix-binary-copy-zero-buffer
Jun 3, 2026
Merged

fix: preserve zero-length buffers in binary copy compaction#6992
yanghua merged 2 commits into
lance-format:mainfrom
zhangyue19921010:fix-binary-copy-zero-buffer

Conversation

@zhangyue19921010

Copy link
Copy Markdown
Contributor

closes: #6991

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 29, 2026
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 37.50000% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/optimize/binary_copy.rs 37.50% 18 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@yanghua

yanghua commented May 31, 2026

Copy link
Copy Markdown
Collaborator

@claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — targeted fix with a regression test.

Extended reasoning...

Overview

This PR fixes a bug in binary copy compaction where zero-length buffers (e.g., a column of all empty strings) caused failures during compaction and subsequent scalar index creation. The change touches rust/lance/src/dataset/optimize/binary_copy.rs in two analogous places (page-level buffer loop and column-level buffer loop) and adds a regression test in tests/binary_copy.rs.

Security risks

None apparent. This is internal data-file rewriting logic with no external input boundary, no auth/crypto/permissions, and no new I/O surface.

Level of scrutiny

Moderate — this is on the compaction write path (not user-facing API), but the change is small (~48 lines) and the logic is symmetric in both spots: filter zero-size entries before submit_request so the scheduler isn't asked for 0-byte ranges, then on the write side iterate the original buffer_offsets_and_sizes and emit (start, 0) placeholders for zero-size entries without consuming from bytes_iter. This preserves buffer count and ordering in the resulting page/column metadata.

Other factors

A regression test was added that exercises the original failure mode (4000 empty strings → ForceBinaryCopy compaction → create_index(Scalar)), run across all non-legacy file versions. Codecov flagged low patch coverage (32%) but the missing lines appear to be the defensive Error::execution branches for the bytes_iter.next() mismatch — those are unreachable in normal operation by construction. The PR has already been merged as 6bd78b5.

@yanghua yanghua left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments—otherwise, LGTM.

}
}

async fn do_test_binary_copy_empty_string_scalar_index(version: LanceFileVersion) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also scan the data and assert the data correctness?

} else {
let bytes = bytes_iter.next().ok_or_else(|| {
Error::execution(
"binary copy: missing page buffer bytes while rewriting data file",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more context in the error msg? e.g. col_idx , page index, buffer...

@zhangyue19921010

Copy link
Copy Markdown
Contributor Author

Hi @yanghua Thanks a lot for your review. All comments are addressed. PTAL~

@yanghua yanghua left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yanghua yanghua merged commit bf51273 into lance-format:main Jun 3, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary Copy Compaction Drops Zero-Length Buffers In V2.0

2 participants