Skip to content

perf(search): reuse IVF_RQ query scratch#6982

Merged
Xuanwo merged 1 commit into
mainfrom
yang/ivfrq-query-scratch
Jun 1, 2026
Merged

perf(search): reuse IVF_RQ query scratch#6982
Xuanwo merged 1 commit into
mainfrom
yang/ivfrq-query-scratch

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented May 28, 2026

Performance Improvement

IVF_RQ query currently allocates several temporary buffers on the hot path, including rotated query vectors, distance tables, and residual query arrays. These allocations happen repeatedly during partition search and show up in latency-sensitive vector search workloads.

This PR adds runtime query scratch reuse for the IVF_RQ flat search path:

  • Create a runtime QueryScratchPool when an IVFIndex is loaded or reconstructed; it is not part of serialized/cached IvfIndexState.
  • Keep scratch lifetime tied to the loaded index, so different indexes have independent scratch pools.
  • Pre-size num_compute_intensive_cpus() scratch entries from per-index dimensions and partition sizes.
  • Use a non-blocking pool strategy: if the pool is empty, allocate a same-sized temporary scratch that is not returned to the pool.
  • Avoid residual query array allocation for Float32 RQ by passing the centroid into the RQ distance calculator and rotating q - c into scratch directly.
  • Preserve non-Float32 residual semantics by falling back to typed Arrow residual preprocessing instead of doing f32-before-subtract residual math.
  • Reuse RQ rotated query, distance table, quantized table, and distance output buffers through QueryScratch.
  • Include scratch pool capacity in DeepSizeOf, including checked-out scratch slots.
  • Avoid clearing the RQ distance output buffer before the batch distance kernel overwrites it.

Benchmarks

GCP VM: yang-agent-ae1f-ivfrq-rebase-20260528 (c4-standard-16)

Workload: lance_ivfrq, sift1m, num_bits=1, target_partition_size=4096, k=10. IVF_PQ was not included.

Baseline: main at 5cf70b27b3ad38ecdcd1547b7af385e05f67598a
Current PR head: 8476110dd4dd018bba05b8320ff999920a88288e

max_threads query avg latency p99 latency QPS
1 nprobes=8 0.803 ms -> 0.772 ms (-3.80%) 0.977 ms -> 0.893 ms (-8.59%) 1222.64 -> 1269.49 (+3.83%)
1 nprobes=24, refine_factor=null 1.040 ms -> 0.973 ms (-6.44%) 1.199 ms -> 1.094 ms (-8.83%) 948.73 -> 1014.18 (+6.90%)
16 nprobes=8 7.014 ms -> 6.835 ms (-2.55%) 11.411 ms -> 11.347 ms (-0.57%) 2228.36 -> 2285.86 (+2.58%)
16 nprobes=24, refine_factor=null 7.268 ms -> 7.009 ms (-3.57%) 11.791 ms -> 11.421 ms (-3.13%) 2140.48 -> 2217.49 (+3.60%)

Validation

  • cargo fmt --all -- --check
  • cargo clippy --all --tests --benches -- -D warnings
  • cargo test -p lance-index vector::storage::tests::test_query_scratch_pool -- --nocapture
  • cargo test -p lance-index vector::bq::storage::tests::test_dist_calculator_with_scratch -- --nocapture
  • cargo test -p lance index::vector::ivf::v2::tests:: -- --nocapture

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

@BubbleCal BubbleCal force-pushed the yang/ivfrq-query-scratch branch from 9788130 to 8476110 Compare May 28, 2026 16:14
@BubbleCal BubbleCal changed the title perf(search): reuse ivfrq query scratch perf(search): reuse IVF_RQ query scratch Jun 1, 2026
@BubbleCal BubbleCal marked this pull request as ready for review June 1, 2026 05:37
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Jun 1, 2026

@claude review once

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substantial perf refactor — let me know if you'd like me to take a closer look at any particular area.

Extended reasoning...

Overview

This PR refactors the IVF_RQ flat search hot path to reuse per-query scratch buffers via a new QueryScratchPool, avoiding several per-query allocations (rotated query vectors, distance tables, residual arrays). It touches 7 files and ~700+ lines of code, including: new trait methods on VectorStore (dist_calculator_with_scratch) and IvfSubIndex (search_with_scratch, accumulate_topk_with_scratch), new types in storage.rs (QueryScratch, QueryScratchPool, QueryScratchCapacity, QueryScratchGuard, QueryResidual), residual-aware rotation paths in bq/storage.rs, and integration in IVFIndex (v2.rs). The benchmarks show modest but real wins (3–9% latency reductions on SIFT1M).

Security risks

No security-sensitive surface here — this is internal indexing code with no auth/crypto/input parsing changes. The main correctness risks are around memory safety: the PR introduces several new unsafe { Vec::set_len } calls in the distance-all path and a SIMD store_unaligned in copy_subtract_f32. The SAFETY comments are reasonable (the scalar fallback in sum_4bit_dist_table_scalar was explicitly updated to fill(0) before saturating_add so the overwrite contract holds on all paths), but these invariants are non-trivial and worth a second look.

Level of scrutiny

This warrants careful human review. It's a substantial refactor of performance-critical code, not a mechanical change. Specific things a human should weigh: (1) the new unsafe set_len invariants in RabitDistCalculator::distance_all_with_scratch and quantize_dist_table_into; (2) the lifetime/borrow design around Cow<'a, [f32]> in RabitDistCalculator and the f32_scratch: &'a mut Vec<f32> borrowed inside the calculator; (3) the new trait surface (default impls forwarding to the non-scratch variants look correct, but this is a public-ish trait API expansion); (4) the use_residual_scratch gating, which only enables the residual fast path when centroids are Float32 — the Float64/Float16 residual path is preserved via the typed Arrow subtract.

Other factors

Coverage on the patch is 92% with 62 uncovered lines (mostly in v2.rs glue and error paths in bq/storage.rs). The PR adds focused unit tests for the scratch pool, scratch reuse, residual centroid path, and a Float64 residual-before-cast precision test, plus a regression test for the scalar SIMD overwrite contract. No bugs were surfaced by the automated review. The author requested a review explicitly, and given the scope and the touch on hot-path memory safety, a maintainer should sign off rather than relying on a shadow approval.

@Xuanwo Xuanwo merged commit 6df946e into main Jun 1, 2026
34 checks passed
@Xuanwo Xuanwo deleted the yang/ivfrq-query-scratch branch June 1, 2026 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants