datastax · ashkrisk · Jan 19, 2026
@@ -0,0 +1,90 @@
+# How to run benchmarks
+
+JVector comes with a built-in benchmarking system in `jvector-examples/.../BenchYAML.java`.
+
+To run a benchmark
+- Decide which dataset(s) you want to benchmark.
+- Configure the parameters combinations for which you want to run the benchmark. This includes graph index parameters, quantization parameters and search parameters.
+
+To describe a dataset you will need to specify the base vectors used to construct the index, the query vectors, and the "ground truth" results which will be used to compute accuracy metrics.
+
+JVector supports two types of datasets:
+- **Fvec/Ivec**: The dataset consists of three files, for example `base.fvec`, `queries.fvec` and `neighbors.ivec`.
+- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.
+
+General procedure for running benchmarks:
+- Specify the dataset names to benchmark in `datasets.yml`.
+- Certain datasets will be downloaded automatically. If using a different datasets, make sure the dataset files are downloaded and made available (refer the section on [using datasets](#using-datasets)).
+- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<dataset-name>.yml` in the same folder.
+
+You can run the configured benchmark with maven:
+```sh
+mvn clean compile exec:exec@bench -pl jvector-examples -am
+```
+
+## Using Datasets
+
+### Using Fvec/Ivec datasets
+
+Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.
+
+To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following:
+- Ensure that you have three files:
+    - `base.fvec` containing N D-dimensional float vectors. These are used to build the index.
+    - `queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index.
+    - `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
+    The files can be named however you like.
+- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
+- Edit `MultiFileDatasource.java` to configure a new dataset and it's associated files:
+    ```java
+    put("cust-ds", new MultiFileDatasource("cust-ds",
+            "/cust-ds/base.fvec",
+            "/cust-ds/query.fvec",
+            "/cust-ds/neighbors.ivec"));
+    ```
+    The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate.
+- In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you don't want to benchmark.
+    ```yaml
+    custom:
+      - cust-ds
+    ```
+
+### Using HDF5 datasets
+
+HDF5 datasets consist of a single file. The Hdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`. These correspond to the base, query and neighbors vectors described above for fvec/ivec files.
+
+To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following:
+```yaml
+category:
+  - dataset-name.hdf5
+```
+
+BenchYAML looks for hdf5 datasets with the name `dataset-name.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. To use a custom dataset, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly.
+
+## Setting benchmark parameters
+
+Benchmark configurations are defined in `jvector-examples/yaml-configs`. There are three types of files:
+- `datasets.yml` which controls which datasets will be used for running the benchmark.
+- `default.yml` which defines the default parameter sets to be used for all datasets.
+- `dataset-name.yml` which specifies the parameter sets for a single dataset.
+
+### datasets.yml
+
+This file specifies the datasets to be used when running `BenchYAML`. Datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.
+
+### default.yml / \<dataset-name\>.yml
+
+These files define the parameters to be used by `BenchYAML`. The settings in the `default.yml` file apply to all datasets, except ones which have a custom configuration defined in `<dataset-name>.yml`.
+
+See `default.yml` for a list of all options.
+
+Most parameters can be specified as an array. For these parameters, a separate benchmark is run for each value of the parameter. If multiple parameters are specified as arrays, a benchmark is run for each combination (i.e. taking the Cartesian product). For example:
+```yaml
+construction:
+  M: [32, 64]
+  ef: [100, 200]
+```
+will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is useful when running a Grid search to identify the best performing parameters.
+
+
+<!-- TODO Bench args -->
@@ -0,0 +1,150 @@
+# JVector Tutorial
+
+JVector provides a graph index for ANN search which is a hybrid of DiskANN and HNSW. You can think of it as a Vamana index with an HNSW-style hierarchy. The rest of this tutorial assumes you have a basic understanding of Vector search, but no prior understanding of HNSW or DiskANN is assumed.
+
+JVector provides a `VectorFloat` datatype for representing vectors, as an abstraction over the physical vector type. Therefore, the first step to using JVector is to understand how to create a `VectorFloat`:
+
+```java
+// `VectorizationProvider` is automatically picked based on the system, language version and runtime flags
+// and determines the actual type of the vector data, and provides implementations for common operations
+// like the inner product.
+VectorTypeSupport vts = VectorizationProvider.getInstance().getVectorTypeSupport();
+
+int dimension = 3;
+
+// Create a `VectorFloat` from a `float[]`.
+// The types that can be converted to a VectorFloat are technically dependent on which VectorizationProvider is picked,
+// but `float[]` is generally a safe bet.
+float[] vector0Array = new float[]{0.1f, 0.2f, 0.3f};
+VectorFloat<?> vector0 = vts.createFloatVector(vector0Array);
+```
+
+> [!TIP]
+> For other ways to create vectors, refer to the javadoc for `VectorTypeSupport`.
+
+Before creating the vector index, we will group all of our base vectors into a container which implements the `RandomAccessVectorValues` interface. Many APIs in JVector accept an instance of `RandomAccessVectorValues` as input. In this case, we'll use it to specify the vectors to be used to build the index.
+
+```java
+// This toy example uses only three vectors, in practical cases you might have millions or more.
+List<VectorFloat<?>> baseVectors = List.of(
+    vector0,
+    vts.createFloatVector(new float[]{0.01f, 0.15f, -0.3f}),
+    vts.createFloatVector(new float[]{-0.2f, 0.1f, 0.35f})
+);
+
+// RAVV or `ravv` is convenient shorthand for a RandomAccessVectorValues instance
+RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, dimension /* 3 */);
+```
+
+> [!TIP]
+> In this example, all vectors are loaded in-memory, but RAAVs are quite versatile. For example, you might have a RAVV backed by disk (check out `MMapRandomAccessVectorValues.java`) or write your own custom RAAV that transfers data over a network interface.
+
+> [!NOTE]
+> A note on terminology:
+> - "Base" vectors are the vectors used to build the index. Each vector becomes a node in the graph. May also be referred to as the "train" set.
+> - "Query" vectors are vectors used as queries for ANN search after the index has been built. In some cases you may want to use some base vectors as queries. Also referred to as the "test" set.
+
+We're now ready to create a Graph-based vector index. We'll do this using a `GraphIndexBuilder` as an intermediate. Let's take a look at the signature of one of it's constructors:
+
+```java
+public GraphIndexBuilder(BuildScoreProvider scoreProvider,
+                         int dimension,
+                         int M,
+                         int beamWidth,
+                         float neighborOverflow,
+                         float alpha,
+                         boolean addHierarchy,
+                         boolean refineFinalGraph);
+```
+
+This constructor asks for something called a `BuildScoreProvider`, the vector dimension, and a set of graph parameters.
+
+The `BuildScoreProvider` is used by the graph builder to compute the similarity scores between any two vectors at build time. We'll use the RAVV we created earlier to generate a BuildScoreProvider:
+
+```java
+// The type of similarity score to use. JVector supports EUCLIDEAN (L2 distance), DOT_PRODUCT and COSINE.
+VectorSimilarityFunction similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
+
+// A simple score provider which can compute exact similarity scores by holding a reference to all the base vectors.
+BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, similarityFunction);
+```
+
+Let's also initialize the graph parameters. For now we won't worry about the exact function of the parameters, except to note that the below is a reasonable set of defaults. Refer the DiskANN and HNSW papers for more details.
+
+<!-- TODO describe graph parameters in a separate doc -->
+
+```java
+// Graph construction parameters
+int M = 32;  // maximum degree of each node
+int efConstruction = 100;  // search depth during construction
+float neighborOverflow = 1.2f;
+float alpha = 1.2f;  // note: not the best setting for 3D vectors, but good in the general case
+boolean addHierarchy = true;  // use an HNSW-style hierarchy
+boolean refineFinalGraph = true;
+```
+
+and now we can create the graph index
+
+```java
+// Build the graph index using a Builder
+// Remember to close the builder using builder.close() or a try-with-resources block
+GraphIndexBuilder builder = new GraphIndexBuilder(bsp,
+                                                  dimension,
+                                                  M,
+                                                  efConstruction,
+                                                  neighborOverflow,
+                                                  alpha,
+                                                  addHierarchy,
+                                                  refineFinalGraph);
+ImmutableGraphIndex graph = builder.build(ravv);
+```
+
+> [!NOTE]
+> You may notice that we supplied the same `ravv` to `builder.build`, even though we'd already passed in the RAVV while creating the `BuildScoreProvider`. This is necessary since generally speaking, the `BuildScoreProvider` won't keep a reference to the actual base vectors, it just so happens that we're using an "exact" score provider that does so.
+
+At this point, you have a completed Graph Index that's entirely fully in-memory.
+
+To perform a search operation, you need to first create a `GraphSearcher`.
+
+> [!IMPORTANT]
+> The graph index itself can be shared between threads, but `GraphSearcher`s maintain internal state and are therefore NOT thread-safe.
+
+```java
+// Remember to close the searcher using searcher.close() or a try-with-resources block
+var searcher = new GraphSearcher(graph);
+```
+
+Generally speaking, you can't pass in a `VectorFloat<?>` directly to the `GraphSearcher`. You need to wrap the query vector with a `SearchScoreProvider`, similar in spirit to the `BuildScoreProvider` we created earlier.
+
+```java
+VectorFloat<?> queryVector = vts.createFloatVector(new float[]{0.2f, 0.3f, 0.4f});  // for example
+// The in-memory graph index doesn't own the actual vectors used to construct it.
+// To compute exact scores at search time, you need to pass in the base RAVV again,
+// in addition to the actual query vector
+SearchScoreProvider ssp = DefaultSearchScoreProvider.exact(queryVector, similarityFunction, ravv);
+```
+
+Now we can run a search
+
+```java
+int topK = 10;  // number of approximate nearest neighbors to fetch
+// You can provide a filter to the query as a bit mask.
+// In this case we want the actual topK neighbors without filtering,
+// so we pass in a virtual bit mask representing all ones.
+SearchResult result = searcher.search(ssp, topK, Bits.ALL);
+
+for (NodeScore ns : result.getNodes()) {
+    int id = ns.node;  // you can look up this ID in the RAVV
+    float score = ns.score;  // the similarity score between this vector and the query vector (higher -> more similar)
+    System.out.println("ID: " + id + ", Score: " + score + ", Vector: " + ravv.getVector(id));
+}
+```
+
+For the full example, refer `jvector-examples/../VectorIntro.java`.
+
+Next steps:
+- Understand index construction parameters
+- Overquerying to improve search accuracy
+- Quantization for space efficiency
+- Building indexes for larger-than-memory datasets on disk
+- VectorizationProviders
@@ -48,6 +48,27 @@
                     </execution>
                 </executions>
             </plugin>
+            <plugin>
+                <groupId>org.codehaus.mojo</groupId>
+                <artifactId>exec-maven-plugin</artifactId>
+                <configuration>
+                    <skip>false</skip>
+                </configuration>
+                <executions>
+                    <execution>
+                        <id>intro</id>
+                        <configuration>
+                            <arguments>
+                                <argument>-classpath</argument>
+                                <classpath/>
+                                <argument>--add-modules=jdk.incubator.vector</argument>
+                                <argument>-ea</argument>
+                                <argument>io.github.jbellis.jvector.example.VectorIntro</argument>
+                            </arguments>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
         </plugins>
     </build>
     <dependencies>
@@ -156,6 +177,17 @@
                             <skip>false</skip>
                         </configuration>
                         <executions>
+                            <execution>
+                                <id>intro</id>
+                                <configuration>
+                                    <arguments>
+                                        <argument>-classpath</argument>
+                                        <classpath/>
+                                        <argument>-ea</argument>
+                                        <argument>io.github.jbellis.jvector.example.VectorIntro</argument>
+                                    </arguments>
+                                </configuration>
+                            </execution>
                             <execution>
                                 <id>sift</id>
                                 <configuration>
@@ -204,6 +236,18 @@
                             <skip>false</skip>
                         </configuration>
                         <executions>
+                            <execution>
+                                <id>intro</id>
+                                <configuration>
+                                    <arguments>
+                                        <argument>-classpath</argument>
+                                        <classpath/>
+                                        <argument>--add-modules=jdk.incubator.vector</argument>
+                                        <argument>-ea</argument>
+                                        <argument>io.github.jbellis.jvector.example.VectorIntro</argument>
+                                    </arguments>
+                                </configuration>
+                            </execution>
                             <execution>
                                 <id>sift</id>
                                 <configuration>
@@ -296,6 +340,20 @@
                             <skip>false</skip>
                         </configuration>
                         <executions>
+                            <execution>
+                                <id>intro</id>
+                                <configuration>
+                                    <arguments>
+                                        <argument>-classpath</argument>
+                                        <classpath/>
+                                        <argument>--enable-native-access=ALL-UNNAMED</argument>
+                                        <argument>--add-modules=jdk.incubator.vector</argument>
+                                        <argument>-ea</argument>
+                                        <argument>-Djvector.experimental.enable_native_vectorization=true</argument>
+                                        <argument>io.github.jbellis.jvector.example.VectorIntro</argument>
+                                    </arguments>
+                                </configuration>
+                            </execution>
                             <execution>
                                 <id>sift</id>
                                 <configuration>