Skip to content

[WIP] IP Vector Normalization to avoid all vectors clumped into single cluster in IVF-PQ #1892

Open
HowardHuang1 wants to merge 17 commits intorapidsai:mainfrom
HowardHuang1:HH-Vector-Normalization
Open

[WIP] IP Vector Normalization to avoid all vectors clumped into single cluster in IVF-PQ #1892
HowardHuang1 wants to merge 17 commits intorapidsai:mainfrom
HowardHuang1:HH-Vector-Normalization

Conversation

@HowardHuang1
Copy link

@HowardHuang1 HowardHuang1 commented Mar 7, 2026

Addresses #1875.

image

… cases due to reduced precision 8-bit values stored in LUT
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aamijar aamijar added non-breaking Introduces a non-breaking change bug Something isn't working labels Mar 9, 2026
@aamijar
Copy link
Member

aamijar commented Mar 9, 2026

/ok to test 78e47f1

@aamijar
Copy link
Member

aamijar commented Mar 9, 2026

Hi @HowardHuang1, are we able to document an example in the description of the PR where vectors were clumped in a single cluster, and then verify that the fix (this PR) solves that case?

Also you can run pre-commit to fix the CI style check.

@HowardHuang1
Copy link
Author

Yes will do. I'll upload a before and after example that illustrates the fixed cluster distributions and timing improvements for search.

@aamijar
Copy link
Member

aamijar commented Mar 10, 2026

/ok to test e822460

@aamijar
Copy link
Member

aamijar commented Mar 14, 2026

/ok to test da32073

@HowardHuang1 HowardHuang1 requested a review from a team as a code owner March 18, 2026 16:40
@aamijar aamijar removed the request for review from a team March 18, 2026 22:32
@aamijar
Copy link
Member

aamijar commented Mar 18, 2026

Hi @HowardHuang1, is this targeting release/26.04? If so, please retarget from main to release/26.04

@HowardHuang1 HowardHuang1 changed the base branch from main to release/26.04 March 18, 2026 22:41
@HowardHuang1 HowardHuang1 requested review from a team as code owners March 18, 2026 22:41
@HowardHuang1 HowardHuang1 requested review from a team as code owners March 18, 2026 22:41
@HowardHuang1 HowardHuang1 requested a review from AyodeAwe March 18, 2026 22:41
@HowardHuang1 HowardHuang1 changed the base branch from release/26.04 to main March 18, 2026 23:41
@jinsolp
Copy link
Contributor

jinsolp commented Mar 19, 2026

Hi @HowardHuang1 you'll have to rebase on release/26.04 if you want to target that branch. 🙂

@jinsolp jinsolp removed request for a team and AyodeAwe March 19, 2026 02:51
Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HowardHuang1 ! I have a few questions as I try to understand the fix in this code.

  1. So we normalize the data, and train kmeans on this normalized data. However, instead of saving the resulting centroids (which come from the normalized space), we re-compute the proper centroids on the raw vectors (what calc_centers_and_sizes do)?
  2. what are the changes in the cuvs bench for? I'm not so familiar with this part of the code so this is out of curiosity.
  3. If I'm reading things right in the screenshot, we get a lower recall with the normalized vectors compared to using the raw vectors? And the search also takes longer?

handle, kmeans_params, trainset_kmeans_const_view, centers_const_view, labels_view);
// Recompute centers in original space (mean of unnormalized trainset per cluster), overwrites centers_view
rmm::device_uvector<uint32_t> cluster_sizes(impl->n_lists(), stream, device_memory);
cuvs::cluster::kmeans::detail::calc_centers_and_sizes<float, float, internal_extents_t, uint32_t, uint32_t>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the calc_centers_and_sizes in the public namespace if possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jinsolp !

  1. Yes exactly. If we were to keep the normalized centroids that would cause the computation to no longer be Inner Product and instead it will degenerate to Cosine which is not what we want. To prevent this, we use calc_centers_and_sizes to re-compute the proper centroids on the raw vectors. In essence, we only want normalization to happen in the kmeans cluster assignment step rather than the whole pipeline (normalization of the whole pipeline changes metric to Cosine).
  2. The changes in cuvs bench are to resolve a linker issue. Will remove these when merging. They can be ignored for now.
  3. The screenshot is a bit outdated, will replace soon with a new one. Recall should be same search should be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

3 participants