-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[vector]support lumina #7330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[vector]support lumina #7330
Changes from all commits
3cb15a2
0a20c81
0f64b8c
a59205f
faffb66
8b96857
3eb43a8
30d58b4
562bb39
2b9e099
98e72f5
483f278
9d846c1
b69417c
643cd13
1a5cd6f
973254f
186ed45
962ec98
8b62ef1
d6b1fd1
b57993e
888fba2
80fec05
39e16c1
c3417b3
a19e97e
38281d2
c4f0142
f722cf1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| ## Paimon Lumina | ||
|
|
||
| This module integrates [Lumina](https://github.com/alibaba/paimon-cpp/tree/main/third_party/lumina) | ||
| as a vector index for Apache Paimon's global index framework. | ||
|
|
||
| Lumina vector search library is derived from an internal repository maintained by | ||
| Alibaba Storage Service Team. It is accessed via JNI through the `lumina-jni` artifact. | ||
|
|
||
| ### Supported Index Types | ||
|
|
||
| | Index Type | Description | | ||
| |------------|-------------| | ||
| | **DISKANN** | DiskANN graph-based index (default) | | ||
|
|
||
|
Comment on lines
+11
to
+14
|
||
| ### Supported Vector Metrics | ||
|
|
||
| | Metric | Description | | ||
| |--------|-------------| | ||
| | **L2** | Euclidean distance (default) | | ||
| | **COSINE** | Cosine distance | | ||
| | **INNER_PRODUCT** | Dot product | | ||
|
|
||
| ### Configuration Options | ||
|
|
||
| | Option | Type | Default | Description | | ||
| |--------|------|---------|-------------| | ||
| | `vector.dim` | int | 128 | Vector dimension | | ||
| | `vector.metric` | enum | L2 | Distance metric | | ||
| | `vector.index-type` | enum | DISKANN | Index type | | ||
| | `vector.encoding-type` | string | rawf32 | Encoding type (rawf32, sq8, pq) | | ||
| | `vector.size-per-index` | int | 2,000,000 | Max vectors per index file | | ||
| | `vector.training-size` | int | 500,000 | Vectors used for pretraining | | ||
| | `vector.search-factor` | int | 10 | Multiplier for search limit when filtering | | ||
| | `vector.normalize` | boolean | false | L2-normalize vectors before indexing/searching | | ||
| | `vector.diskann.search-list-size` | int | 100 | DiskANN search list size | | ||
| | `vector.pretrain-sample-ratio` | double | 1.0 | Pretrain sample ratio | | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,100 @@ | ||||||||||||||||||||||||||||||||||||||||||
| <?xml version="1.0" encoding="UTF-8"?> | ||||||||||||||||||||||||||||||||||||||||||
| <!-- | ||||||||||||||||||||||||||||||||||||||||||
| Licensed to the Apache Software Foundation (ASF) under one | ||||||||||||||||||||||||||||||||||||||||||
| or more contributor license agreements. See the NOTICE file | ||||||||||||||||||||||||||||||||||||||||||
| distributed with this work for additional information | ||||||||||||||||||||||||||||||||||||||||||
| regarding copyright ownership. The ASF licenses this file | ||||||||||||||||||||||||||||||||||||||||||
| to you under the Apache License, Version 2.0 (the | ||||||||||||||||||||||||||||||||||||||||||
| "License"); you may not use this file except in compliance | ||||||||||||||||||||||||||||||||||||||||||
| with the License. You may obtain a copy of the License at | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| Unless required by applicable law or agreed to in writing, | ||||||||||||||||||||||||||||||||||||||||||
| software distributed under the License is distributed on an | ||||||||||||||||||||||||||||||||||||||||||
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||||||||||||||||||||||||||||||||||||||
| KIND, either express or implied. See the License for the | ||||||||||||||||||||||||||||||||||||||||||
| specific language governing permissions and limitations | ||||||||||||||||||||||||||||||||||||||||||
| under the License. | ||||||||||||||||||||||||||||||||||||||||||
| --> | ||||||||||||||||||||||||||||||||||||||||||
| <project xmlns="http://maven.apache.org/POM/4.0.0" | ||||||||||||||||||||||||||||||||||||||||||
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||||||||||||||||||||||||||||||||||||||||||
| xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||||||||||||||||||||||||||||||||||||||||||
| <modelVersion>4.0.0</modelVersion> | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| <parent> | ||||||||||||||||||||||||||||||||||||||||||
| <artifactId>paimon-parent</artifactId> | ||||||||||||||||||||||||||||||||||||||||||
| <groupId>org.apache.paimon</groupId> | ||||||||||||||||||||||||||||||||||||||||||
| <version>1.4-SNAPSHOT</version> | ||||||||||||||||||||||||||||||||||||||||||
| </parent> | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| <artifactId>paimon-lumina</artifactId> | ||||||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just one paimon-lumina is OK, no need to have index and e2e.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please create a README.md to this, explain what is lumina. |
||||||||||||||||||||||||||||||||||||||||||
| <name>Paimon : Lumina Index</name> | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| <repositories> | ||||||||||||||||||||||||||||||||||||||||||
| <repository> | ||||||||||||||||||||||||||||||||||||||||||
| <id>lumina</id> | ||||||||||||||||||||||||||||||||||||||||||
| <url>https://lumina-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/</url> | ||||||||||||||||||||||||||||||||||||||||||
| </repository> | ||||||||||||||||||||||||||||||||||||||||||
| </repositories> | ||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+34
to
+39
|
||||||||||||||||||||||||||||||||||||||||||
| <repositories> | |
| <repository> | |
| <id>lumina</id> | |
| <url>https://lumina-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/</url> | |
| </repository> | |
| </repositories> | |
| <profiles> | |
| <profile> | |
| <id>lumina-repo</id> | |
| <activation> | |
| <activeByDefault>false</activeByDefault> | |
| </activation> | |
| <repositories> | |
| <repository> | |
| <id>lumina</id> | |
| <url>https://lumina-binary.oss-cn-shanghai.aliyuncs.com/mvn-repo/</url> | |
| </repository> | |
| </repositories> | |
| </profile> | |
| </profiles> |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,223 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.paimon.lumina.index; | ||
|
|
||
| import org.aliyun.lumina.LuminaBuilder; | ||
| import org.aliyun.lumina.LuminaFileInput; | ||
| import org.aliyun.lumina.LuminaFileOutput; | ||
| import org.aliyun.lumina.LuminaSearcher; | ||
| import org.aliyun.lumina.MetricType; | ||
|
|
||
| import java.io.Closeable; | ||
| import java.nio.ByteBuffer; | ||
| import java.nio.ByteOrder; | ||
| import java.util.LinkedHashMap; | ||
| import java.util.Map; | ||
|
|
||
| /** | ||
| * A high-level wrapper for Lumina index operations (build and search). | ||
| * | ||
| * <p>This class provides a safe Java API for building and searching Lumina vector indices. It | ||
| * manages the lifecycle of native LuminaBuilder and LuminaSearcher objects. | ||
| */ | ||
| public class LuminaIndex implements Closeable { | ||
|
|
||
| private LuminaBuilder builder; | ||
| private LuminaSearcher searcher; | ||
| private final int dimension; | ||
| private final LuminaVectorMetric metric; | ||
| private final LuminaIndexType indexType; | ||
| private volatile boolean closed = false; | ||
|
|
||
| private LuminaIndex(int dimension, LuminaVectorMetric metric, LuminaIndexType indexType) { | ||
| this.dimension = dimension; | ||
| this.metric = metric; | ||
| this.indexType = indexType; | ||
| } | ||
|
|
||
| /** Create a new index for building. */ | ||
| public static LuminaIndex createForBuild( | ||
| int dimension, | ||
| LuminaVectorMetric metric, | ||
| LuminaIndexType indexType, | ||
| Map<String, String> extraOptions) { | ||
| LuminaIndex index = new LuminaIndex(dimension, metric, indexType); | ||
|
|
||
| Map<String, String> opts = new LinkedHashMap<>(extraOptions); | ||
| index.builder = | ||
| LuminaBuilder.create( | ||
| indexType.getLuminaName(), dimension, toMetricType(metric), opts); | ||
| return index; | ||
| } | ||
|
|
||
| /** | ||
| * Open an existing index from a streaming file input for searching. | ||
| * | ||
| * <p>The native searcher reads on-demand from the provided input. The caller must keep the | ||
| * underlying stream open until this index is closed. | ||
| */ | ||
| public static LuminaIndex fromStream( | ||
| LuminaFileInput fileInput, | ||
| long fileSize, | ||
| int dimension, | ||
| LuminaVectorMetric metric, | ||
| LuminaIndexType indexType, | ||
| Map<String, String> extraOptions) { | ||
| LuminaIndex index = new LuminaIndex(dimension, metric, indexType); | ||
|
|
||
| Map<String, String> searcherOpts = new LinkedHashMap<>(); | ||
| for (Map.Entry<String, String> entry : extraOptions.entrySet()) { | ||
| String key = entry.getKey(); | ||
| if (key.startsWith("diskann.search.")) { | ||
| searcherOpts.put(key, entry.getValue()); | ||
| } | ||
| } | ||
| index.searcher = | ||
| LuminaSearcher.create( | ||
| indexType.getLuminaName(), dimension, toMetricType(metric), searcherOpts); | ||
| index.searcher.open(fileInput, fileSize); | ||
| return index; | ||
| } | ||
|
|
||
| /** Pretrain the index with sample vectors before insertion. */ | ||
| public void pretrain(ByteBuffer vectorBuffer, int n) { | ||
| ensureOpen(); | ||
| ensureBuilder(); | ||
| builder.pretrain(vectorBuffer, n); | ||
| } | ||
|
|
||
| /** Insert a batch of vectors with IDs (zero-copy). */ | ||
| public void insertBatch(ByteBuffer vectorBuffer, ByteBuffer idBuffer, int n) { | ||
| ensureOpen(); | ||
| ensureBuilder(); | ||
| builder.insertBatch(vectorBuffer, idBuffer, n); | ||
| } | ||
|
|
||
| /** Dump (serialize) the built index to a streaming file output. */ | ||
| public void dump(LuminaFileOutput fileOutput) { | ||
| ensureOpen(); | ||
| ensureBuilder(); | ||
| builder.dump(fileOutput); | ||
| } | ||
|
|
||
| /** Search for k nearest neighbors. */ | ||
| public void search( | ||
| float[] queryVectors, | ||
| int n, | ||
| int k, | ||
| float[] distances, | ||
| long[] labels, | ||
| Map<String, String> searchOptions) { | ||
| ensureOpen(); | ||
| ensureSearcher(); | ||
| searcher.search(n, queryVectors, k, distances, labels, searchOptions); | ||
| } | ||
|
|
||
| /** Search for k nearest neighbors with native pre-filtering on vector IDs. */ | ||
| public void searchWithFilter( | ||
| float[] queryVectors, | ||
| int n, | ||
| int k, | ||
| float[] distances, | ||
| long[] labels, | ||
| long[] filterIds, | ||
| Map<String, String> searchOptions) { | ||
| ensureOpen(); | ||
| ensureSearcher(); | ||
| searcher.searchWithFilter(n, queryVectors, k, distances, labels, filterIds, searchOptions); | ||
| } | ||
|
|
||
| /** Get the number of vectors (searcher mode). */ | ||
| public long size() { | ||
| ensureOpen(); | ||
| ensureSearcher(); | ||
| return searcher.getCount(); | ||
| } | ||
|
|
||
| public int dimension() { | ||
| return dimension; | ||
| } | ||
|
|
||
| public LuminaVectorMetric metric() { | ||
| return metric; | ||
| } | ||
|
|
||
| public LuminaIndexType indexType() { | ||
| return indexType; | ||
| } | ||
|
|
||
| public static ByteBuffer allocateVectorBuffer(int numVectors, int dimension) { | ||
| return ByteBuffer.allocateDirect(numVectors * dimension * Float.BYTES) | ||
| .order(ByteOrder.nativeOrder()); | ||
| } | ||
|
|
||
| public static ByteBuffer allocateIdBuffer(int numIds) { | ||
| return ByteBuffer.allocateDirect(numIds * Long.BYTES).order(ByteOrder.nativeOrder()); | ||
| } | ||
|
|
||
| private void ensureOpen() { | ||
| if (closed) { | ||
| throw new IllegalStateException("Index has been closed"); | ||
| } | ||
| } | ||
|
|
||
| private void ensureBuilder() { | ||
| if (builder == null) { | ||
| throw new IllegalStateException("Index was not opened for building"); | ||
| } | ||
| } | ||
|
|
||
| private void ensureSearcher() { | ||
| if (searcher == null) { | ||
| throw new IllegalStateException("Index was not opened for searching"); | ||
| } | ||
| } | ||
|
|
||
| private static MetricType toMetricType(LuminaVectorMetric metric) { | ||
| switch (metric) { | ||
| case L2: | ||
| return MetricType.L2; | ||
| case COSINE: | ||
| return MetricType.COSINE; | ||
| case INNER_PRODUCT: | ||
| return MetricType.INNER_PRODUCT; | ||
| default: | ||
| throw new IllegalArgumentException("Unknown metric: " + metric); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public void close() { | ||
| if (!closed) { | ||
| synchronized (this) { | ||
| if (!closed) { | ||
| if (builder != null) { | ||
| builder.close(); | ||
| builder = null; | ||
| } | ||
| if (searcher != null) { | ||
| searcher.close(); | ||
| searcher = null; | ||
| } | ||
| closed = true; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
containsRangeis incorrect for negativeminId(and forminId == Long.MIN_VALUEit also avoids evaluatingminId - 1). This can produce false positives/negatives when the range starts below 0. Consider computingcountBeforeMinviarankLong(minId - 1)for allminIdexceptLong.MIN_VALUE(where it must be 0), instead of the currentminId > 0shortcut.