Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions docs/ai/text-search/search-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,29 @@ Usage

When `default_field` is provided, Doris expands bare terms or functions to that field. For example, `SEARCH('foo bar', 'tags', 'and')` behaves like `SEARCH('tags:ALL(foo bar)')`, while `SEARCH('foo bark', 'tags')` expands to `tags:ANY(foo bark)`. Explicit boolean operators inside the DSL always take precedence over the default operator.

### Options Parameter (JSON format)

The second parameter can also be a JSON string for advanced configuration:

```sql
SEARCH('<search_expression>', '<options_json>')
```

**Supported options:**

| Option | Type | Description |
|--------|------|-------------|
| `default_field` | string | Column name for terms without explicit field |
| `default_operator` | string | `and` or `or` for multi-term expressions |
| `mode` | string | `standard` (default) or `lucene` |
| `minimum_should_match` | integer | Minimum SHOULD clauses to match (Lucene mode only) |

**Example:**
```sql
SELECT * FROM docs WHERE search('apple banana',
'{"default_field":"title","default_operator":"and","mode":"lucene"}');
```

`SEARCH()` follows SQL three-valued logic. Rows where all referenced fields are NULL evaluate to UNKNOWN (filtered out in the `WHERE` clause) unless other predicates short-circuit the expression (`TRUE OR NULL = TRUE`, `FALSE OR NULL = NULL`, `NOT NULL = NULL`), matching the behavior of dedicated text search operators.

### Current Supported Queries
Expand Down Expand Up @@ -100,6 +123,38 @@ SELECT id, title FROM search_test_basic
WHERE SEARCH('tags:ANY(python javascript) AND (category:Technology OR category:Programming)');
```

#### Lucene Boolean Mode

Lucene mode mimics Elasticsearch/Lucene query_string behavior where boolean operators work as left-to-right modifiers instead of traditional boolean algebra.

**Key differences from standard mode:**
- AND/OR/NOT are modifiers that affect adjacent terms
- Operator precedence is left-to-right
- Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum)

**Enable Lucene mode:**
```sql
-- Basic Lucene mode
SELECT * FROM docs WHERE search('apple AND banana',
'{"default_field":"title","mode":"lucene"}');

-- With minimum_should_match
SELECT * FROM docs WHERE search('apple AND banana OR cherry',
'{"default_field":"title","mode":"lucene","minimum_should_match":1}');
```

**Behavior comparison:**

| Query | Standard Mode | Lucene Mode |
|-------|--------------|-------------|
| `a AND b` | a ∩ b | +a +b (both MUST) |
| `a OR b` | a ∪ b | a b (both SHOULD, min=1) |
| `NOT a` | ¬a | -a (MUST_NOT) |
| `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) |
| `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) |

**Note:** In Lucene mode, `a AND b OR c` parses left-to-right: the OR operator changes `b` from MUST to SHOULD. Use `minimum_should_match` to require SHOULD matches.

#### Phrase query
- Syntax: `column:"quoted phrase"`
- Semantics: matches contiguous tokens in order using the column's analyzer; quotes must wrap the entire phrase.
Expand Down Expand Up @@ -253,6 +308,31 @@ WHERE SEARCH('properties.message:hello OR properties.category:beta')
ORDER BY id;
```

#### Escape Characters

Use backslash (`\`) to escape special characters in DSL:

| Escape | Description | Example |
|--------|-------------|---------|
| `\ ` | Literal space (joins terms) | `title:First\ Value` matches "First Value" |
| `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" |
| `\:` | Literal colon | `title:key\:value` matches "key:value" |
| `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |

**Example:**
```sql
-- Search for value containing space as single term
SELECT * FROM docs WHERE search('title:First\\ Value');

-- Search for value with parentheses
SELECT * FROM docs WHERE search('title:hello\\(world\\)');

-- Search for value with colon
SELECT * FROM docs WHERE search('title:key\\:value');
```

**Note:** In SQL strings, backslashes need double escaping. Use `\\` in SQL to produce a single `\` in the DSL.

### Current Limitations

- Range and list clauses (`field:[a TO b]`, `field:IN(...)`) still degrade to term lookups; rely on regular SQL predicates for numeric/date ranges or explicit `IN` filters.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,29 @@ SEARCH('<search_expression>', '<default_field>', '<default_operator>')

提供 `default_field` 后,Doris 会把裸词项或函数自动扩展到该字段。例如 `SEARCH('foo bar', 'tags', 'and')` 等价于 `SEARCH('tags:ALL(foo bar)')`,而 `SEARCH('foo bark', 'tags')` 会展开为 `tags:ANY(foo bark)`。DSL 中显式出现的布尔操作优先级最高,会覆盖默认运算符。

### Options 参数(JSON 格式)

第二个参数也可以是 JSON 字符串,用于高级配置:

```sql
SEARCH('<search_expression>', '<options_json>')
```

**支持的选项:**

| 选项 | 类型 | 说明 |
|------|------|------|
| `default_field` | string | 未指定字段的词项使用的默认列名 |
| `default_operator` | string | 多词项表达式的默认运算符(`and` 或 `or`) |
| `mode` | string | `standard`(默认)或 `lucene` |
| `minimum_should_match` | integer | SHOULD 子句最小匹配数(仅 Lucene 模式) |

**示例:**
```sql
SELECT * FROM docs WHERE search('apple banana',
'{"default_field":"title","default_operator":"and","mode":"lucene"}');
```

`SEARCH()` 遵循 SQL 三值逻辑。当所有参与匹配的列值均为 NULL 时结果为 UNKNOWN(在 `WHERE` 中被过滤),但若与其他子表达式组合,可按布尔短路原则返回 TRUE 或继续保留 NULL(例如 `TRUE OR NULL = TRUE`、`FALSE OR NULL = NULL`、`NOT NULL = NULL`),行为与文本检索算子保持一致。

### 当前支持语法
Expand Down Expand Up @@ -100,6 +123,38 @@ SELECT id, title FROM search_test_basic
WHERE SEARCH('tags:ANY(python javascript) AND (category:Technology OR category:Programming)');
```

#### Lucene 布尔模式

Lucene 模式模拟 Elasticsearch/Lucene 的 query_string 行为,布尔操作符作为左到右的修饰符工作,而非传统的布尔代数。

**与标准模式的主要区别:**
- AND/OR/NOT 是影响相邻词项的修饰符
- 操作符优先级从左到右
- 内部使用 MUST/SHOULD/MUST_NOT(类似 Lucene 的 Occur 枚举)

**启用 Lucene 模式:**
```sql
-- 基本 Lucene 模式
SELECT * FROM docs WHERE search('apple AND banana',
'{"default_field":"title","mode":"lucene"}');

-- 使用 minimum_should_match
SELECT * FROM docs WHERE search('apple AND banana OR cherry',
'{"default_field":"title","mode":"lucene","minimum_should_match":1}');
```

**行为对比:**

| 查询 | 标准模式 | Lucene 模式 |
|------|----------|-------------|
| `a AND b` | a ∩ b | +a +b(都是 MUST) |
| `a OR b` | a ∪ b | a b(都是 SHOULD,min=1) |
| `NOT a` | ¬a | -a(MUST_NOT) |
| `a AND NOT b` | a ∩ ¬b | +a -b(MUST a,MUST_NOT b) |
| `a AND b OR c` | (a ∩ b) ∪ c | +a b c(只有 a 是 MUST) |

**注意:** 在 Lucene 模式中,`a AND b OR c` 从左到右解析:OR 操作符将 `b` 从 MUST 改为 SHOULD。使用 `minimum_should_match` 来要求 SHOULD 子句匹配。

#### 词组查询
- 语法:`column:"quoted phrase"`
- 语义:根据列的分析器匹配连续且有序的词项,需使用双引号包裹完整短语。
Expand Down Expand Up @@ -253,6 +308,31 @@ WHERE SEARCH('properties.message:hello OR properties.category:beta')
ORDER BY id;
```

#### 转义字符

使用反斜杠(`\`)转义 DSL 中的特殊字符:

| 转义 | 说明 | 示例 |
|------|------|------|
| `\ ` | 字面空格(连接词项) | `title:First\ Value` 匹配 "First Value" |
| `\(` `\)` | 字面括号 | `title:hello\(world\)` 匹配 "hello(world)" |
| `\:` | 字面冒号 | `title:key\:value` 匹配 "key:value" |
| `\\` | 字面反斜杠 | `title:path\\to\\file` 匹配 "path\to\file" |

**示例:**
```sql
-- 搜索包含空格的值作为单个词项
SELECT * FROM docs WHERE search('title:First\\ Value');

-- 搜索包含括号的值
SELECT * FROM docs WHERE search('title:hello\\(world\\)');

-- 搜索包含冒号的值
SELECT * FROM docs WHERE search('title:key\\:value');
```

**注意:** 在 SQL 字符串中,反斜杠需要双重转义。使用 `\\` 在 SQL 中产生 DSL 中的单个 `\`。

### 当前限制

- 范围与列表子句(如 `field:[a TO b]`、`field:IN(...)`)仍会降级为普通词项匹配,建议使用常规 SQL 范围/`IN` 过滤。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,19 @@ SEARCH 是一个返回布尔值的谓词函数,可作为过滤条件出现在
SEARCH('<search_expression>')
SEARCH('<search_expression>', '<default_field>')
SEARCH('<search_expression>', '<default_field>', '<default_operator>')
SEARCH('<search_expression>', '<options_json>')
```

- `<search_expression>`:SEARCH DSL 查询表达式(字符串字面量)
- `<default_field>`(可选):当 DSL 中的词项未显式指定字段时自动套用的列名。
- `<default_operator>`(可选):多词项表达式默认布尔运算符,仅接受 `and` 或 `or`(不区分大小写),默认为 `or`。
- `<options_json>`(可选):JSON 字符串,包含高级搜索选项(如多字段搜索)。支持的选项:
- `default_field`:同第二个参数
- `default_operator`:同第三个参数(`and` 或 `or`)
- `fields`:字段名数组,用于多字段搜索(与 `default_field` 互斥)
- `type`:多字段搜索模式,可选 `best_fields`(默认)或 `cross_fields`
- `mode`:解析模式,可选 `standard`(默认)或 `lucene`
- `minimum_should_match`:lucene 模式下的最小匹配数(默认:0)

用法

Expand Down Expand Up @@ -121,6 +129,46 @@ SELECT id, title FROM search_test_basic
WHERE SEARCH('tags:ALL(tutorial) AND category:Technology');
```

#### 多字段搜索(Elasticsearch 风格)
- 语法:使用带 `fields` 数组的 JSON 选项
- 语义:在多个字段中搜索相同的词项,自动展开;支持两种模式:
- `best_fields`(默认):所有词项必须出现在同一字段中,各字段结果通过 OR 组合
- `cross_fields`:词项可以分布在不同字段中(视为一个组合字段)
- 索引建议:为 `fields` 数组中的每个字段建立倒排索引

**best_fields 模式**(默认):每个字段必须包含所有词项,然后各字段结果通过 OR 组合。

```sql
-- 在 title 和 content 字段中搜索 "machine learning"
-- 展开为:(title:machine AND title:learning) OR (content:machine AND content:learning)
SELECT id, title FROM articles
WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and"}');

-- 显式指定 type 参数
SELECT id, title FROM articles
WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"best_fields"}');
```

**cross_fields 模式**:词项可以匹配不同字段,将所有字段视为一个组合字段。

```sql
-- 跨 title 和 content 搜索 "machine learning"
-- 展开为:(title:machine OR content:machine) AND (title:learning OR content:learning)
SELECT id, title FROM articles
WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}');

-- 适用于跨 firstname/lastname 字段搜索人名
SELECT id, name FROM people
WHERE SEARCH('John Smith', '{"fields":["firstname","lastname"],"default_operator":"and","type":"cross_fields"}');
```

**模式对比**:

| 模式 | 行为 | 适用场景 |
|------|------|----------|
| `best_fields` | 所有词项必须在同一字段内匹配 | 文档搜索,相关性与字段相关 |
| `cross_fields` | 词项可以跨任意字段匹配 | 实体搜索(如人名分布在多个字段) |

#### 通配符查询
- 语法:`column:prefix*`、`column:*mid*`、`column:?ingle`
- 语义:使用 `*` 匹配任意长度字符串,`?` 匹配单个字符。
Expand Down
48 changes: 48 additions & 0 deletions versioned_docs/version-4.x/ai/text-search/search-function.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,19 @@ Syntax
SEARCH('<search_expression>')
SEARCH('<search_expression>', '<default_field>')
SEARCH('<search_expression>', '<default_field>', '<default_operator>')
SEARCH('<search_expression>', '<options_json>')
```

- `<search_expression>` — string literal containing the SEARCH DSL expression.
- `<default_field>` *(optional)* — column name automatically applied to terms that do not specify a field.
- `<default_operator>` *(optional)* — default boolean operator for multi-term expressions; accepts `and` or `or` (case-insensitive). Defaults to `or`.
- `<options_json>` *(optional)* — JSON string containing search options for advanced features like multi-field search. Supported options:
- `default_field`: same as the second parameter
- `default_operator`: same as the third parameter (`and` or `or`)
- `fields`: array of field names for multi-field search (mutually exclusive with `default_field`)
- `type`: multi-field search mode, either `best_fields` (default) or `cross_fields`
- `mode`: parsing mode, either `standard` (default) or `lucene`
- `minimum_should_match`: integer for lucene mode (default: 0)

Usage

Expand Down Expand Up @@ -121,6 +129,46 @@ SELECT id, title FROM search_test_basic
WHERE SEARCH('tags:ALL(tutorial) AND category:Technology');
```

#### Multi-field search (Elasticsearch-style)
- Syntax: Use JSON options with `fields` array
- Semantics: search the same terms across multiple fields with automatic expansion; supports two modes:
- `best_fields` (default): matches if all terms appear in the same field, fields are ORed together
- `cross_fields`: matches if terms appear across different fields (like a single combined field)
- Indexing tip: add inverted indexes for each field in the `fields` array

**best_fields mode** (default): Each field must contain all terms, then results from all fields are combined with OR.

```sql
-- Search "machine learning" in both title and content fields
-- Expands to: (title:machine AND title:learning) OR (content:machine AND content:learning)
SELECT id, title FROM articles
WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and"}');

-- With explicit type parameter
SELECT id, title FROM articles
WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"best_fields"}');
```

**cross_fields mode**: Terms can match across different fields, treating all fields as one combined field.

```sql
-- Search "machine learning" across title and content
-- Expands to: (title:machine OR content:machine) AND (title:learning OR content:learning)
SELECT id, title FROM articles
WHERE SEARCH('machine learning', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}');

-- Useful for searching person names across firstname/lastname fields
SELECT id, name FROM people
WHERE SEARCH('John Smith', '{"fields":["firstname","lastname"],"default_operator":"and","type":"cross_fields"}');
```

**Comparison of modes**:

| Mode | Behavior | Use Case |
|------|----------|----------|
| `best_fields` | All terms must match within the same field | Document search where relevance is field-specific |
| `cross_fields` | Terms can match across any field | Entity search (e.g., person name split across fields) |

#### Wildcard query
- Syntax: `column:prefix*`, `column:*mid*`, `column:?ingle`
- Semantics: performs pattern matching with `*` (multi-character) and `?` (single-character) wildcards.
Expand Down