Skip to content

Conversation

@lowener
Copy link
Contributor

@lowener lowener commented Jan 7, 2026

Currently the Capacity template goes from 1 to 256 by power of 2.
By changing it to power of 4 from 1 to 256, we can reduce the size of libcuvs from 157 Mb to 146 Mb (11 Mb or 7% reduction).

After some tests on mnist-784-euclidean, across multiple topk and a nprobe of 1 or 5, the impact on the throughput would be around 4%. The measurements are noisy as the power-of-4 version is sometimes faster than the base version. The benchmarks are reproducible by running the script present in the first commit of the PR.

Topk N-Probes QPS base QPS power of 4 Pow-of-4 over Base
1 1 341,646 300,844 88%
1 5 269,131 257,179 96%
2 1 328,880 293,591 89%
2 5 224,674 264,695 118%
4 1 308,350 296,900 96%
4 5 227,393 220,282 97%
5 1 340,225 296,276 87%
5 5 296,486 278,676 94%
10 1 301,967 308,025 102%
10 5 234,487 286,652 122%
20 1 335,355 311,835 93%
20 5 231,498 256,806 111%
50 1 336,700 310,101 92%
50 5 293,545 241,445 82%
100 1 337,883 277,521 82%
100 5 227,633 223,234 98%
-------- -------- ------- ------- -------
Average -------- ------- ------- 96%

rmm::cuda_stream_view stream)
{
const int capacity = raft::bound_by_power_of_two(k);
const int capacity = bound_by_power_of_four(k);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared the binary size reduction of this approach vs converting capacity to a run-time parameter?

Maybe for the first step, we may try to just estimate the potential size reduction without worrying too much about performance or even correctness.

In the kernel function (https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L810)

Capacity is mainly used in two places.

https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L828
=> Just replacing constexpr to const will be sufficient for initial best case estimate.

https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_flat/ivf_flat_interleaved_scan.cuh#L852
=> Looks more involved (need to dig into the internals of block_sort_t), but for the initial estimate, we may just set Capacity here to an arbitrary value (e.g. 4) to just quickly get an idea about the upper limit in binary size reduction.

If the size you get with this approach is significantly smaller, then it might be worth further investigation. If the size reduction is comparable or even less, yeah, better not bother.

@cjnolet
Copy link
Member

cjnolet commented Jan 14, 2026

we can reduce the size of libcuvs from 157 Mb to 146 Mb (11 Mb or 7% reduction).

@lowener is this per architecture? Any idea what the savings is for the binary when all architectures are compiled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants