Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# How to run benchmarks

JVector comes with a built-in benchmarking system in `jvector-examples/.../BenchYAML.java`.

To run a benchmark
- Decide which dataset(s) you want to benchmark.
- Configure the parameters combinations for which you want to run the benchmark. This includes graph index parameters, quantization parameters and search parameters.

To describe a dataset you will need to specify the base vectors used to construct the index, the query vectors, and the "ground truth" results which will be used to compute accuracy metrics.

JVector supports two types of datasets:
- **Fvec/Ivec**: The dataset consists of three files, for example `base.fvec`, `queries.fvec` and `neighbors.ivec`.
- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.

General procedure for running benchmarks:
- Specify the dataset names to benchmark in `datasets.yml`.
- Certain datasets will be downloaded automatically. If using a different datasets, make sure the dataset files are downloaded and made available (refer the section on [using datasets](#using-datasets)).
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<dataset-name>.yml` in the same folder.

You can run the configured benchmark with maven:
```sh
mvn clean compile exec:exec@bench -pl jvector-examples -am
```

## Using Datasets

### Using Fvec/Ivec datasets

Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.

To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following:
- Ensure that you have three files:
- `base.fvec` containing N D-dimensional float vectors. These are used to build the index.
- `queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index.
- `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
The files can be named however you like.
- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
- Edit `MultiFileDatasource.java` to configure a new dataset and it's associated files:
```java
put("cust-ds", new MultiFileDatasource("cust-ds",
"/cust-ds/base.fvec",
"/cust-ds/query.fvec",
"/cust-ds/neighbors.ivec"));
```
The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate.
- In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you don't want to benchmark.
```yaml
custom:
- cust-ds
```

### Using HDF5 datasets

HDF5 datasets consist of a single file. The Hdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`. These correspond to the base, query and neighbors vectors described above for fvec/ivec files.

To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following:
```yaml
category:
- dataset-name.hdf5
```

BenchYAML looks for hdf5 datasets with the name `dataset-name.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. To use a custom dataset, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly.

## Setting benchmark parameters

Benchmark configurations are defined in `jvector-examples/yaml-configs`. There are three types of files:
- `datasets.yml` which controls which datasets will be used for running the benchmark.
- `default.yml` which defines the default parameter sets to be used for all datasets.
- `dataset-name.yml` which specifies the parameter sets for a single dataset.

### datasets.yml

This file specifies the datasets to be used when running `BenchYAML`. Datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.

### default.yml / \<dataset-name\>.yml

These files define the parameters to be used by `BenchYAML`. The settings in the `default.yml` file apply to all datasets, except ones which have a custom configuration defined in `<dataset-name>.yml`.

See `default.yml` for a list of all options.

Most parameters can be specified as an array. For these parameters, a separate benchmark is run for each value of the parameter. If multiple parameters are specified as arrays, a benchmark is run for each combination (i.e. taking the Cartesian product). For example:
```yaml
construction:
M: [32, 64]
ef: [100, 200]
```
will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is useful when running a Grid search to identify the best performing parameters.


<!-- TODO Bench args -->
150 changes: 150 additions & 0 deletions docs/draft-hello.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# JVector Tutorial

JVector provides a graph index for ANN search which is a hybrid of DiskANN and HNSW. You can think of it as a Vamana index with an HNSW-style hierarchy. The rest of this tutorial assumes you have a basic understanding of Vector search, but no prior understanding of HNSW or DiskANN is assumed.

JVector provides a `VectorFloat` datatype for representing vectors, as an abstraction over the physical vector type. Therefore, the first step to using JVector is to understand how to create a `VectorFloat`:

```java
// `VectorizationProvider` is automatically picked based on the system, language version and runtime flags
// and determines the actual type of the vector data, and provides implementations for common operations
// like the inner product.
VectorTypeSupport vts = VectorizationProvider.getInstance().getVectorTypeSupport();

int dimension = 3;

// Create a `VectorFloat` from a `float[]`.
// The types that can be converted to a VectorFloat are technically dependent on which VectorizationProvider is picked,
// but `float[]` is generally a safe bet.
float[] vector0Array = new float[]{0.1f, 0.2f, 0.3f};
VectorFloat<?> vector0 = vts.createFloatVector(vector0Array);
```

> [!TIP]
> For other ways to create vectors, refer to the javadoc for `VectorTypeSupport`.

Before creating the vector index, we will group all of our base vectors into a container which implements the `RandomAccessVectorValues` interface. Many APIs in JVector accept an instance of `RandomAccessVectorValues` as input. In this case, we'll use it to specify the vectors to be used to build the index.

```java
// This toy example uses only three vectors, in practical cases you might have millions or more.
List<VectorFloat<?>> baseVectors = List.of(
vector0,
vts.createFloatVector(new float[]{0.01f, 0.15f, -0.3f}),
vts.createFloatVector(new float[]{-0.2f, 0.1f, 0.35f})
);

// RAVV or `ravv` is convenient shorthand for a RandomAccessVectorValues instance
RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, dimension /* 3 */);
```

> [!TIP]
> In this example, all vectors are loaded in-memory, but RAAVs are quite versatile. For example, you might have a RAVV backed by disk (check out `MMapRandomAccessVectorValues.java`) or write your own custom RAAV that transfers data over a network interface.

> [!NOTE]
> A note on terminology:
> - "Base" vectors are the vectors used to build the index. Each vector becomes a node in the graph. May also be referred to as the "train" set.
> - "Query" vectors are vectors used as queries for ANN search after the index has been built. In some cases you may want to use some base vectors as queries. Also referred to as the "test" set.

We're now ready to create a Graph-based vector index. We'll do this using a `GraphIndexBuilder` as an intermediate. Let's take a look at the signature of one of it's constructors:

```java
public GraphIndexBuilder(BuildScoreProvider scoreProvider,
int dimension,
int M,
int beamWidth,
float neighborOverflow,
float alpha,
boolean addHierarchy,
boolean refineFinalGraph);
```

This constructor asks for something called a `BuildScoreProvider`, the vector dimension, and a set of graph parameters.

The `BuildScoreProvider` is used by the graph builder to compute the similarity scores between any two vectors at build time. We'll use the RAVV we created earlier to generate a BuildScoreProvider:

```java
// The type of similarity score to use. JVector supports EUCLIDEAN (L2 distance), DOT_PRODUCT and COSINE.
VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.EUCLIDEAN;

// A simple score provider which can compute exact similarity scores by holding a reference to all the base vectors.
BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, similarityFunction);
```

Let's also initialize the graph parameters. For now we won't worry about the exact function of the parameters, except to note that the below is a reasonable set of defaults. Refer the DiskANN and HNSW papers for more details.

<!-- TODO describe graph parameters in a separate doc -->

```java
// Graph construction parameters
int M = 32; // maximum degree of each node
int efConstruction = 100; // search depth during construction
float neighborOverflow = 1.2f;
float alpha = 1.2f; // note: not the best setting for 3D vectors, but good in the general case
boolean addHierarchy = true; // use an HNSW-style hierarchy
boolean refineFinalGraph = true;
```

and now we can create the graph index

```java
// Build the graph index using a Builder
// Remember to close the builder using builder.close() or a try-with-resources block
GraphIndexBuilder builder = new GraphIndexBuilder(bsp,
dimension,
M,
efConstruction,
neighborOverflow,
alpha,
addHierarchy,
refineFinalGraph);
ImmutableGraphIndex graph = builder.build(ravv);
```

> [!NOTE]
> You may notice that we supplied the same `ravv` to `builder.build`, even though we'd already passed in the RAVV while creating the `BuildScoreProvider`. This is necessary since generally speaking, the `BuildScoreProvider` won't keep a reference to the actual base vectors, it just so happens that we're using an "exact" score provider that does so.

At this point, you have a completed Graph Index that's entirely fully in-memory.

To perform a search operation, you need to first create a `GraphSearcher`.

> [!IMPORTANT]
> The graph index itself can be shared between threads, but `GraphSearcher`s maintain internal state and are therefore NOT thread-safe.

```java
// Remember to close the searcher using searcher.close() or a try-with-resources block
var searcher = new GraphSearcher(graph);
```

Generally speaking, you can't pass in a `VectorFloat<?>` directly to the `GraphSearcher`. You need to wrap the query vector with a `SearchScoreProvider`, similar in spirit to the `BuildScoreProvider` we created earlier.

```java
VectorFloat<?> queryVector = vts.createFloatVector(new float[]{0.2f, 0.3f, 0.4f}); // for example
// The in-memory graph index doesn't own the actual vectors used to construct it.
// To compute exact scores at search time, you need to pass in the base RAVV again,
// in addition to the actual query vector
SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(queryVector, similarityFunction, ravv);
```

Now we can run a search

```java
int topK = 10; // number of approximate nearest neighbors to fetch
// You can provide a filter to the query as a bit mask.
// In this case we want the actual topK neighbors without filtering,
// so we pass in a virtual bit mask representing all ones.
SearchResult result = searcher.search(ssp, topK, Bits.ALL);

for (NodeScore ns : result.getNodes()) {
int id = ns.node; // you can look up this ID in the RAVV
float score = ns.score; // the similarity score between this vector and the query vector (higher -> more similar)
System.out.println("ID: " + id + ", Score: " + score + ", Vector: " + ravv.getVector(id));
}
```

For the full example, refer `jvector-examples/../VectorIntro.java`.

Next steps:
- Understand index construction parameters
- Overquerying to improve search accuracy
- Quantization for space efficiency
- Building indexes for larger-than-memory datasets on disk
- VectorizationProviders
58 changes: 58 additions & 0 deletions jvector-examples/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,27 @@
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<configuration>
<skip>false</skip>
</configuration>
<executions>
<execution>
<id>intro</id>
<configuration>
<arguments>
<argument>-classpath</argument>
<classpath/>
<argument>--add-modules=jdk.incubator.vector</argument>
<argument>-ea</argument>
<argument>io.github.jbellis.jvector.example.VectorIntro</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
Expand Down Expand Up @@ -156,6 +177,17 @@
<skip>false</skip>
</configuration>
<executions>
<execution>
<id>intro</id>
<configuration>
<arguments>
<argument>-classpath</argument>
<classpath/>
<argument>-ea</argument>
<argument>io.github.jbellis.jvector.example.VectorIntro</argument>
</arguments>
</configuration>
</execution>
<execution>
<id>sift</id>
<configuration>
Expand Down Expand Up @@ -204,6 +236,18 @@
<skip>false</skip>
</configuration>
<executions>
<execution>
<id>intro</id>
<configuration>
<arguments>
<argument>-classpath</argument>
<classpath/>
<argument>--add-modules=jdk.incubator.vector</argument>
<argument>-ea</argument>
<argument>io.github.jbellis.jvector.example.VectorIntro</argument>
</arguments>
</configuration>
</execution>
<execution>
<id>sift</id>
<configuration>
Expand Down Expand Up @@ -296,6 +340,20 @@
<skip>false</skip>
</configuration>
<executions>
<execution>
<id>intro</id>
<configuration>
<arguments>
<argument>-classpath</argument>
<classpath/>
<argument>--enable-native-access=ALL-UNNAMED</argument>
<argument>--add-modules=jdk.incubator.vector</argument>
<argument>-ea</argument>
<argument>-Djvector.experimental.enable_native_vectorization=true</argument>
<argument>io.github.jbellis.jvector.example.VectorIntro</argument>
</arguments>
</configuration>
</execution>
<execution>
<id>sift</id>
<configuration>
Expand Down
Loading
Loading