Skip to content
1 change: 1 addition & 0 deletions docs/best-practices/using_data_skipping_indices.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ There are several types of data skipping indexes, each suited to different types
* **set(N)**: Tracks a set of values up to a specified size N for each block. Effective on columns with low cardinality per blocks.
* **bloom_filter**: Probabilistically determines if a value exists in a block, allowing fast approximate filtering for set membership. Effective for optimizing queries looking for the “needle in a haystack”, where a positive match is needed.
* **tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings — particularly useful for log data or text search use cases.
* **text**: Builds an inverted index over tokenized string data, enabling efficient and deterministic full-text search. Recommended for natural language or large free-form text columns where precise token lookup and scalable multi-term search are required, instead of approximate Bloom filter–based approaches.

While powerful, skip indexes must be used with care. They only provide benefit when they eliminate a meaningful number of data blocks, and can actually introduce overhead if the query or data structure doesn't align. If even a single matching value exists in a block, that entire block must still be read.

Expand Down
30 changes: 29 additions & 1 deletion docs/guides/best-practices/skipping-indexes-examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@
INDEX name expr TYPE type(...) [GRANULARITY N]
```

ClickHouse supports five skip index types:
ClickHouse supports six skip index types:

| Index Type | Description |
|------------|-------------|
| **minmax** | Tracks minimum and maximum values in each granule |
| **set(N)** | Stores up to N distinct values per granule |
| **text** | Inverted index over tokenized string data for full text search |
Comment thread
Ergus marked this conversation as resolved.
| **bloom_filter([false_positive_rate])** | Probabilistic filter for existence checks |
| **ngrambf_v1** | N-gram bloom filter for substring searches |
| **tokenbf_v1** | Token-based bloom filter for full-text searches |
Expand Down Expand Up @@ -76,6 +77,25 @@

A creation/materialization workflow and the before/after effect are shown in the [basic operation guide](/optimize/skipping-indexes#basic-operation).

## Text index (text) for full text search {#textindex-for-full-text-search}

`text` is an inverted index over tokenized text data.
Designed specifically for full-text search workloads, enabling efficient and deterministic token and term lookup.
Recommended for natural language or large-scale text search use cases.

Just see [Full-text Search with Text Indexes](/engines/table-engines/mergetree-family/textindexes) for more details and examples.

```sql
ALTER TABLE logs ADD INDEX msg_text msg TYPE text(tokenizer = splitByNonAlpha);
ALTER TABLE logs MATERIALIZE INDEX msg_text;

SELECT count() FROM logs WHERE hasAllTokens(msg, 'exception');
```

See a more complete observability example [here](/use-cases/observability/schema-design#text-index-for-full-text-search) documentation.

The text index is totally deterministic and fully tunable in terms of tokenization and text processing at a cost of some more storage consumption compared with bloom filter–based indexes,

## Generic Bloom filter (scalar) {#generic-bloom-filter-scalar}

The `bloom_filter` index is good for "needle in a haystack" equality/IN membership. It accepts an optional parameter which is the false-positive rate (default 0.025).
Expand All @@ -92,6 +112,10 @@

## N-gram Bloom filter (ngrambf\_v1) for substring search {#n-gram-bloom-filter-ngrambf-v1-for-substring-search}

> Note: With text indexes generally availability (GA) starting from ClickHouse version 26.2, bloom filter–based indexes are not recommended anymore for full text search.

Check notice on line 115 in docs/guides/best-practices/skipping-indexes-examples.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.Contractions

Suggestion: Use 'aren't' instead of 'are not'.
Although they are more compact, unfortunately they tend to produce false positives because they are probabilistic.
Furthermore, they offer limited configurability.

The `ngrambf_v1` index splits strings into n-grams. It works well for `LIKE '%...%'` queries. It supports String/FixedString/Map (via mapKeys/mapValues), as well as tunable size, hash count, and seed. See the documentation for [N-gram bloom filter](/engines/table-engines/mergetree-family/mergetree#n-gram-bloom-filter) for further details.

```sql
Expand Down Expand Up @@ -128,6 +152,10 @@

## Token Bloom filter (tokenbf\_v1) for word-based search {#token-bloom-filter-tokenbf-v1-for-word-based-search}

> Note: With text indexes generally availability (GA) starting from ClickHouse version 26.2, bloom filter–based indexes are not recommended anymore for full text search.

Check notice on line 155 in docs/guides/best-practices/skipping-indexes-examples.md

View workflow job for this annotation

GitHub Actions / vale

ClickHouse.Contractions

Suggestion: Use 'aren't' instead of 'are not'.
Although they are more compact, unfortunately they tend to produce false positives because they are probabilistic.
Furthermore, they offer limited configurability.

`tokenbf_v1` indexes tokens separated by non-alphanumeric characters. You should use it with [`hasToken`](/sql-reference/functions/string-search-functions#hasToken), `LIKE` word patterns or equals/IN. It supports `String`/`FixedString`/`Map` types.

See [Token bloom filter](/engines/table-engines/mergetree-family/mergetree#token-bloom-filter) and [Bloom filter types](/optimize/skipping-indexes#skip-index-types) pages for more details.
Expand Down
9 changes: 9 additions & 0 deletions docs/guides/best-practices/skipping-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,15 @@ an unlimited number of discrete values). This set contains all values in the bl

The cost, performance, and effectiveness of this index is dependent on the cardinality within blocks. If each block contains a large number of unique values, either evaluating the query condition against a large index set will be very expensive, or the index won't be applied because the index is empty due to exceeding max_size.

<!-- vale off -->
### text {#text}
<!-- vale on -->

For workloads that involve natural language or free-form text search (e.g., searching words or phrases in large text columns), ClickHouse provides a **text index** (a real inverted index).
Text index supports efficient full-text search semantics and tokenized lookups. It is the recommended choice for full-text search queries because it provides deterministic token indexing and better performance for search functions such as `hasAnyToken`, `hasAllTokens` but also optimize all common text search functions.

See the text index documentation for details [here](engines/table-engines/mergetree-family/textindexes.md).

### Bloom filter types {#bloom-filter-types}

A *Bloom filter* is a data structure that allows space-efficient testing of set membership at the cost of a slight chance of false positives. A false positive isn't a significant concern in the case of skip indexes because the only disadvantage is reading a few unnecessary blocks. However, the potential for false positives does mean that the indexed expression should be expected to be true, otherwise valid data may be skipped.
Expand Down
192 changes: 185 additions & 7 deletions docs/use-cases/observability/build-your-own/schema-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -1449,9 +1449,191 @@ You should read and understand the [guide to secondary indices](/optimize/skippi

**In general, they're effective when a strong correlation exists between the primary key and the targeted, non-primary column/expression and users are looking up rare values i.e. those which don't occur in many granules.**

### Bloom filters for text search {#bloom-filters-for-text-search}
### Text index for full text search {#text-index-for-full-text-search}

For Observability queries, secondary indices can be useful when you need to perform text searches. Specifically, the ngram and token-based bloom filter indexes [`ngrambf_v1`](/optimize/skipping-indexes#bloom-filter-types) and [`tokenbf_v1`](/optimize/skipping-indexes#bloom-filter-types) can be used to accelerate searches over String columns with the operators `LIKE`, `IN`, and hasToken. Importantly, the token-based index generates tokens using non-alphanumeric characters as a separator. This means only tokens (or whole words) can be matched at query time. For more granular matching, the [N-gram bloom filter](/optimize/skipping-indexes#bloom-filter-types) can be used. This splits strings into ngrams of a specified size, thus allowing sub-word matching.
For production-grade full text search, ClickHouse provides a specialized [text index](/engines/table-engines/mergetree-family/textindexes).
This index builds an inverted index over tokenized text data, enabling fast token-based search queries.

Text indexes are generally available (GA) starting from ClickHouse version 26.2.

They can be defined on the following column types in MergeTree tables: [String](/sql-reference/data-types/string.md), [FixedString](/sql-reference/data-types/fixedstring.md), [Array(String)](/sql-reference/data-types/array.md), [Array(FixedString)](/sql-reference/data-types/array.md), and [Map](/sql-reference/data-types/map.md) (via [mapKeys](/sql-reference/functions/tuple-map-functions.md/#mapKeys) and [mapValues](/sql-reference/functions/tuple-map-functions.md/#mapValues) map functions) columns in MergeTree tables.

A text index requires a `tokenizer` argument in its definition. Optionally, a preprocessor function can be specified to transform the input string before tokenization.

The recommended functions to search in the index are: `hasAnyTokens` and `hasAllTokens`.
Some traditional string search functions are also automatically optimized when a text index is present. See the documentation for details and supported functions [here](/engines/table-engines/mergetree-family/textindexes#using-a-text-index) and [here](/engines/table-engines/mergetree-family/textindexes#functions-example-hasanytokens-hasalltokens).

In the examples below, we use a structured logs dataset.

```sql
CREATE TABLE otel_logs
(
`Body` String,
`Timestamp` DateTime,
`ServiceName` LowCardinality(String),
`Status` UInt16,
`RequestProtocol` LowCardinality(String),
`RunTime` UInt32,
`Size` UInt32,
`UserAgent` String,
`Referer` String,
`RemoteUser` String,
`RequestType` LowCardinality(String),
`RequestPath` String,
`RemoteAddress` IPv4,
`RefererDomain` String,
`RequestPage` String,
`SeverityText` LowCardinality(String),
`SeverityNumber` UInt8
)
ENGINE = MergeTree
ORDER BY Timestamp
SETTINGS index_granularity = 8192
```

Without an index we can use the same functions.

```sql
SELECT count()
FROM otel_logs
WHERE hasAllTokens(Body, ['Connection', 'accepted'])

Query id: ff0b866c-6df7-47be-9e36-795ef3888169

┌─count()─┐
1. │ 27281 │
└─────────┘

1 row in set. Elapsed: 0.584 sec. Processed 19.95 million rows, 3.08 GB (34.15 million rows/s., 5.27 GB/s.)
```

This query performs a full scan of the Body column.

#### Adding a text index {#adding-a-text-index}

A text index can be added during table creation:

```sql
CREATE TABLE otel_logs_index_body
(
`Body` String,
`Timestamp` DateTime,
`ServiceName` LowCardinality(String),
`Status` UInt16,
`RequestProtocol` LowCardinality(String),
`RunTime` UInt32,
`Size` UInt32,
`UserAgent` String,
`Referer` String,
`RemoteUser` String,
`RequestType` LowCardinality(String),
`RequestPath` String,
`RemoteAddress` IPv4,
`RefererDomain` String,
`RequestPage` String,
`SeverityText` LowCardinality(String),
`SeverityNumber` UInt8,
INDEX idx_body Body TYPE text(tokenizer = splitByNonAlpha) GRANULARITY 100000000
)
ENGINE = MergeTree
ORDER BY Timestamp
SETTINGS index_granularity = 8192
```

Or added later using `ALTER TABLE`:

```sql
ALTER TABLE otel_logs ADD INDEX idx_body Body TYPE text(tokenizer = splitByNonAlpha) GRANULARITY 100000000;
ALTER TABLE otel_logs MATERIALIZE INDEX idx_body;
```

This creates an inverted index for the Body column using the `splitByNonAlpha` tokenizer.

> Note: A partially materialized index can already be used by queries, but maximum performance improvement is achieved after full materialization.

```sql
SELECT count()
FROM otel_logs_index_body
WHERE hasAllTokens(Body, ['Connection', 'accepted'])

Query id: ebc31a94-92b3-48aa-860a-939d7e788ef4

┌─count()─┐
1. │ 27281 │
└─────────┘

1 row in set. Elapsed: 0.013 sec. Processed 20.41 million rows, 20.41 MB (1.59 billion rows/s., 1.59 GB/s.)
Peak memory usage: 15.23 MiB.
```

The index reduces the scanned data from gigabytes to megabytes and improves performance by approximately `45x`.

#### Using a preprocessor {#using-a-preprocessor}

In this dataset, the Body column contains a JSON-formatted string with multiple key-value pairs (e.g., `msg`, `id`, `ctx`, `attr`, etc.).

Assume we are only interested in searching within the `msg` field.
Instead of indexing the entire JSON string, we can define a preprocessor to extract only the `msg` value before tokenization.

For example:

```sql
INDEX idx_text Body TYPE text(tokenizer = splitByNonAlpha, preprocessor = JSONExtract(Body, 'msg', 'String')) GRANULARITY 100000000
```

In this example the preprocessor:

- Reduces the amount of text that is tokenized and indexed
- Decreases index size
- Reduces the probability of false positives
- Improves query performance

```sql
SELECT count()
FROM otel_logs_text_body_preprocessed
WHERE hasAllTokens(Body, ['Connection', 'accepted'])

Query id: f6a5cd9c-665f-4e4f-82f2-d6a4408a68a8

┌─count()─┐
1. │ 27281 │
└─────────┘

1 row in set. Elapsed: 0.006 sec. Processed 13.54 million rows, 13.54 MB (2.45 billion rows/s., 2.45 GB/s.)
Peak memory usage: 1.95 MiB.
```

Compared to the non-preprocessed index, performance improves by approximately 2×.

Comparing Index Sizes

```sql
SELECT
`table`,
formatReadableSize(data_compressed_bytes) AS compressed_size,
formatReadableSize(data_uncompressed_bytes) AS uncompressed_size
FROM system.data_skipping_indices
WHERE startsWith(`table`, 'otel_logs')

Query id: 730e4b77-e697-40b3-a24d-67219ec42075

┌─table───────────────────────────────────┬─compressed_size─┬─uncompressed_size─┐
1. │ otel_logs_text_index_body_preprocessed │ 423.98 KiB │ 424.29 KiB │
2. │ otel_logs_text_index_body │ 2.76 GiB │ 2.78 GiB │
└─────────────────────────────────────────┴─────────────────┴───────────────────┘
```

Using a preprocessor reduces index size from gigabytes to a few hundred kilobytes — approximately 0.01% of the original size — while also improving query performance.

**Other indexes for text search

Further details on secondary skip indices can be found [here](/optimize/skipping-indexes#skip-index-functions).

<details markdown="1">

<summary>Bloom filters for text search</summary>

The ngram and token-based bloom filter indexes [`ngrambf_v1`](/optimize/skipping-indexes#bloom-filter-types) and [`tokenbf_v1`](/optimize/skipping-indexes#bloom-filter-types) can be used to accelerate searches over String columns with the operators `LIKE`, `IN`, and hasToken. Importantly, the token-based index generates tokens using non-alphanumeric characters as a separator. This means only tokens (or whole words) can be matched at query time. For more granular matching, the [N-gram bloom filter](/optimize/skipping-indexes#bloom-filter-types) can be used. This splits strings into ngrams of a specified size, thus allowing sub-word matching.

To evaluate the tokens that will be produced and therefore, matched, the `tokens` function can be used:

Expand All @@ -1477,10 +1659,6 @@ SELECT ngrams('https://www.zanbil.ir/m/filter/b113', 3)
1 row in set. Elapsed: 0.008 sec.
```

:::note Inverted indices
ClickHouse also has experimental support for inverted indices as a secondary index. We don't currently recommend these for logging datasets but anticipate they will replace token-based bloom filters when they're production-ready.
:::

For the purposes of this example we use the structured logs dataset. Suppose we wish to count logs where the `Referer` column contains `ultra`.

```sql
Expand Down Expand Up @@ -1629,7 +1807,7 @@ In the examples above, we can see the secondary bloom filter index is 12MB - alm

Bloom filters can require significant tuning. We recommend following the notes [here](/engines/table-engines/mergetree-family/mergetree#bloom-filter) which can be useful in identifying optimal settings. Bloom filters can also be expensive at insert and merge time. You should evaluate the impact on insert performance prior to adding bloom filters to production.

Further details on secondary skip indices can be found [here](/optimize/skipping-indexes#skip-index-functions).
</details>

### Extracting from maps {#extracting-from-maps}

Expand Down
1 change: 1 addition & 0 deletions scripts/aspell-ignore/en/aspell-dict.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2065,6 +2065,7 @@ cond
conf
config
configs
configurability
conformant
congruential
conjuctive
Expand Down