Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 78 additions & 38 deletions docs/user/ppl/cmd/ad.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,77 @@
# ad (deprecated by ml command)
# ad (Deprecated)

## Description
The `ad` command is deprecated in favor of the [`ml` command](./ml.md).
{: .warning}

The `ad` command applies Random Cut Forest (RCF) algorithm in the ml-commons plugin on the search result returned by a PPL command. Based on the input, the command uses two types of RCF algorithms: fixed-in-time RCF for processing time-series data, batch RCF for processing non-time-series data.
## Syntax
The `ad` command applies the Random Cut Forest (RCF) algorithm in the ML Commons plugin to the search results returned by a PPL command. The command provides two anomaly detection approaches:

## Fixed In Time RCF For Time-series Data
- [Anomaly detection for time-series data](#anomaly-detection-for-time-series-data) using the fixed-in-time RCF algorithm
- [Anomaly detection for non-time-series data](#anomaly-detection-for-non-time-series-data) using the batch RCF algorithm

ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] \<time_field\> [date_format] [time_zone] [category_field]
* number_of_trees: optional. Number of trees in the forest. **Default:** 30.
* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. **Default:** 8.
* sample_size: optional. The sample size used by stream samplers in this forest. **Default:** 256.
* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32.
* time_decay: optional. The decay factor used by stream samplers in this forest. **Default:** 0.0001.
* anomaly_rate: optional. The anomaly rate. **Default:** 0.005.
* time_field: mandatory. Specifies the time field for RCF to use as time-series data.
* date_format: optional. Used for formatting time_field. **Default:** "yyyy-MM-dd HH:mm:ss".
* time_zone: optional. Used for setting time zone for time_field. **Default:** "UTC".
* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted.
To use the `ad` command, `plugins.calcite.enabled` must be set to `false`.
{: .note}

## Syntax

The `ad` command has two different syntax variants, depending on the algorithm type.

### Anomaly detection for time-series data

Use this syntax to detect anomalies in time-series data. This method uses the fixed-in-time RCF algorithm, which is optimized for sequential data patterns.

The fixed-in-time RCF `ad` command has the following syntax:

```sql
ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] <time_field> [date_format] [time_zone] [category_field]
```

### Parameters

The fixed-in-time RCF algorithm supports the following parameters.

| Parameter | Required/Optional | Description |
| --- | --- | --- |
| `time_field` | Required | The time field for RCF to use as time-series data. |
| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. |
| `shingle_size` | Optional | The number of records in a shingle. A shingle is a consecutive sequence of the most recent records. Default is `8`. |
| `sample_size` | Optional | The sample size used by the stream samplers in this forest. Default is `256`. |
| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. |
| `time_decay` | Optional | The decay factor used by the stream samplers in this forest. Default is `0.0001`. |
| `anomaly_rate` | Optional | The anomaly rate. Default is `0.005`. |
| `date_format` | Optional | The format used for the `time_field` field. Default is `yyyy-MM-dd HH:mm:ss`. |
| `time_zone` | Optional | The time zone for the `time_field` field. Default is `UTC`. |
| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. |

## Batch RCF For Non-time-series Data

### Anomaly detection for non-time-series data

Use this syntax to detect anomalies in data where the order doesn't matter. This method uses the batch RCF algorithm, which is optimized for independent data points.

The batch RCF `ad` command has the following syntax:

```sql
ad [number_of_trees] [sample_size] [output_after] [training_data_size] [anomaly_score_threshold] [category_field]
* number_of_trees: optional. Number of trees in the forest. **Default:** 30.
* sample_size: optional. Number of random samples given to each tree from the training data set. **Default:** 256.
* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32.
* training_data_size: optional. **Default:** size of your training data set.
* anomaly_score_threshold: optional. The threshold of anomaly score. **Default:** 1.0.
* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted.
```

### Parameters

The batch RCF algorithm supports the following parameters.

| Parameter | Required/Optional | Description |
| --- | --- | --- |
| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. |
| `sample_size` | Optional | The number of random samples provided to each tree from the training dataset. Default is `256`. |
| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. |
| `training_data_size` | Optional | The size of the training dataset. Default is the full dataset size. |
| `anomaly_score_threshold` | Optional | The anomaly score threshold. Default is `1.0`. |
| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. |

## Example 1: Detecting events in New York City from taxi ridership data with time-series data

This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data.
## Example 1: Example 1: Detecting events in New York City taxi ridership time-series data

The following examples use the `nyc_taxi` dataset, which contains New York City taxi ridership data with fields including `value` (number of rides), `timestamp` (time of measurement), and `category` (time period classifications such as 'day' and 'night').

This example trains an RCF model and uses it to detect anomalies in time-series ridership data:

```ppl ignore
source=nyc_taxi
Expand All @@ -40,7 +80,7 @@ source=nyc_taxi
| where value=10844.0
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 1/1
Expand All @@ -51,9 +91,10 @@ fetched rows / total rows = 1/1
+---------+---------------------+-------+---------------+
```

## Example 2: Detecting events in New York City from taxi ridership data with time-series data independently with each category

This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values.
## Example 2: Detecting events in New York City taxi ridership time-series data by category

This example trains an RCF model and uses it to detect anomalies in time-series ridership data across multiple category values:

```ppl ignore
source=nyc_taxi
Expand All @@ -62,7 +103,7 @@ source=nyc_taxi
| where value=10844.0 or value=6526.0
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 2/2
Expand All @@ -74,9 +115,10 @@ fetched rows / total rows = 2/2
+----------+---------+---------------------+-------+---------------+
```

## Example 3: Detecting events in New York City from taxi ridership data with non-time-series data

This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data.
## Example 3: Detecting events in New York City taxi ridership non-time-series data

This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data:

```ppl ignore
source=nyc_taxi
Expand All @@ -85,7 +127,7 @@ source=nyc_taxi
| where value=10844.0
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 1/1
Expand All @@ -96,9 +138,10 @@ fetched rows / total rows = 1/1
+---------+-------+-----------+
```

## Example 4: Detecting events in New York City from taxi ridership data with non-time-series data independently with each category

This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values.
## Example 4: Detecting events in New York City taxi ridership non-time-series data by category

This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data across multiple category values:

```ppl ignore
source=nyc_taxi
Expand All @@ -107,7 +150,7 @@ source=nyc_taxi
| where value=10844.0 or value=6526.0
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 2/2
Expand All @@ -118,7 +161,4 @@ fetched rows / total rows = 2/2
| day | 6526.0 | 0.0 | False |
+----------+---------+-------+-----------+
```

## Limitations

The `ad` command can only work with `plugins.calcite.enabled=false`.
46 changes: 28 additions & 18 deletions docs/user/ppl/cmd/addcoltotals.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,31 @@
# AddColTotals

# addcoltotals

# Description
The `addcoltotals` command computes the sum of each column and adds a summary row showing the total for each column. This command is equivalent to using `addtotals` with `row=false` and `col=true`, making it useful for creating summary reports with column totals.

The `addcoltotals` command computes the sum of each column and add a summary event at the end to show the total of each column. This command works the same way `addtotals` command works with row=false and col=true option. This is useful for creating summary reports with subtotals or grand totals. The `addcoltotals` command only sums numeric fields (integers, floats, doubles). Non-numeric fields in the field list are ignored even if its specified in field-list or in the case of no field-list specified.
The command only processes numeric fields (integers, floats, doubles). Non-numeric fields are ignored regardless of whether they are explicitly specified in the field list.

# Syntax

`addcoltotals [field-list] [label=<string>] [labelfield=<field>]`
## Syntax

- `field-list`: Optional. Comma-separated list of numeric fields to sum. If not specified, all numeric fields are summed.
- `labelfield=<field>`: Optional. Field name to place the label. If it specifies a non-existing field, adds the field and shows label at the summary event row at this field.
- `label=<string>`: Optional. Custom text for the totals row labelfield\'s label. Default is \"Total\".
The `addcoltotals` command has the following syntax:

# Example 1: Basic Example
```sql
addcoltotals [field-list] [label=<string>] [labelfield=<field>]
```

## Parameters

The `addcoltotals` command supports the following parameters.

| Parameter | Required/Optional | Description |
| --- | --- | --- |
| `<field-list>` | Optional | A comma-separated list of numeric fields to add. By default, all numeric fields are added. |
| `labelfield` | Optional | The field in which the label is placed. If the field does not exist, it is created and the label is shown in the summary row (last row) of the new field. |
| `label` | Optional | The text that appears in the summary row (last row) to identify the computed totals. When used with `labelfield`, this text is placed in the specified field in the summary row. Default is `Total`. |

# Example 1: Basic example

The example shows placing the label in an existing field.
The following query places the label in an existing field:

```ppl
source=accounts
Expand All @@ -24,7 +34,7 @@ source=accounts
| addcoltotals labelfield='firstname'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 4/4
Expand All @@ -38,17 +48,17 @@ fetched rows / total rows = 4/4
+-----------+---------+
```

# Example 2: Adding column totals and adding a summary event with label specified.
# Example 2: Adding column totals with a custom summary label

The example shows adding totals after a stats command where final summary event label is \'Sum\' and row=true value was used by default when not specified. It also added new field specified by labelfield as it did not match existing field.
The following query adds totals after a `stats` command where the final summary event label is `Sum`. It also creates a new field specified by `labelfield` because this field does not exist in the data:

```ppl
source=accounts
| stats count() by gender
| addcoltotals `count()` label='Sum' labelfield='Total'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 3/3
Expand All @@ -61,9 +71,9 @@ fetched rows / total rows = 3/3
+---------+--------+-------+
```

# Example 3: With all options
# Example 3: Using all options

The example shows using addcoltotals with all options set.
The following query uses the `addcoltotals` command with all options set:

```ppl
source=accounts
Expand All @@ -73,7 +83,7 @@ source=accounts
| addcoltotals avg_balance, count label='Sum' labelfield='Column Total'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 4/4
Expand Down
53 changes: 32 additions & 21 deletions docs/user/ppl/cmd/addtotals.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,34 @@
# AddTotals
# addtotals

The `addtotals` command computes the sum of numeric fields and can create both column totals (summary row) and row totals (new field). This command is useful for creating summary reports with subtotals or grand totals.

## Description
The command only processes numeric fields (integers, floats, doubles). Non-numeric fields are ignored regardless of whether they are explicitly specified in the field list.

The `addtotals` command computes the sum of numeric fields and appends a row with the totals to the result. The command can also add row totals and add a field to store row totals. This is useful for creating summary reports with subtotals or grand totals. The `addtotals` command only sums numeric fields (integers, floats, doubles). Non-numeric fields in the field list are ignored even if it\'s specified in field-list or in the case of no field-list specified.

## Syntax

`addtotals [field-list] [label=<string>] [labelfield=<field>] [row=<boolean>] [col=<boolean>] [fieldname=<field>]`
The `addtotals` command has the following syntax:

- `field-list`: Optional. Comma-separated list of numeric fields to sum. If not specified, all numeric fields are summed.
- `row=<boolean>`: Optional. Calculates total of each row and add a new field with the total. Default is true.
- `col=<boolean>`: Optional. Calculates total of each column and add a new event at the end of all events with the total. Default is false.
- `labelfield=<field>`: Optional. Field name to place the label. If it specifies a non-existing field, adds the field and shows label at the summary event row at this field. This is applicable when col=true.
- `label=<string>`: Optional. Custom text for the totals row labelfield\'s label. Default is \"Total\". This is applicable when col=true. This does not have any effect when labelfield and fieldname parameter both have same value.
- `fieldname=<field>`: Optional. Calculates total of each row and add a new field to store this total. This is applicable when row=true.
```sql
addtotals [field-list] [label=<string>] [labelfield=<field>] [row=<boolean>] [col=<boolean>] [fieldname=<field>]
```

## Parameters

The `addtotals` command supports the following parameters.

## Example 1: Basic Example
| Parameter | Required/Optional | Description |
| --- | --- | --- |
| `<field-list>` | Optional | A comma-separated list of numeric fields to add. By default, all numeric fields are added. |
| `row` | Optional | Calculates the total of each row and adds a new field to store the row total. Default is `true`. |
| `col` | Optional | Calculates the total of each column and adds a summary event at the end with the column totals. Default is `false`. |
| `labelfield` | Optional | The field in which the label is placed. If the field does not exist, it is created and the label is shown in the summary row (last row) of the new field. Applicable when `col=true`. |
| `label` | Optional | The text that appears in the summary row (last row) to identify the computed totals. When used with `labelfield`, this text is placed in the specified field in the summary row. Default is `Total`. Applicable when `col=true`. This parameter has no effect when the `labelfield` and `fieldname` parameters specify the same field name. |
| `fieldname` | Optional | The field used to store row totals. Applicable when `row=true`. |

The example shows placing the label in an existing field.
## Example 1: Basic example

The following query places the label in an existing field:

```ppl
source=accounts
Expand All @@ -27,7 +37,7 @@ source=accounts
| addtotals col=true labelfield='firstname' label='Total'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 4/4
Expand All @@ -41,17 +51,17 @@ fetched rows / total rows = 4/4
+-----------+---------+-------+
```

## Example 2: Adding column totals and adding a summary event with label specified.
## Example 2: Adding column totals with a custom summary label

The example shows adding totals after a stats command where final summary event label is \'Sum\'. It also added new field specified by labelfield as it did not match existing field.
The following query adds totals after a `stats` command, with the final summary event labeled `Sum`. It also creates a new field specified by `labelfield` because the field does not exist in the data:

```ppl
source=accounts
| fields account_number, firstname , balance , age
| addtotals col=true row=false label='Sum' labelfield='Total'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 5/5
Expand All @@ -66,15 +76,16 @@ fetched rows / total rows = 5/5
+----------------+-----------+---------+-----+-------+
```

if row=true in above example, there will be conflict between column added for column totals and column added for row totals being same field \'Total\', in that case the output will have final event row label null instead of \'Sum\' because the column is number type and it cannot output String in number type column.
If you set `row=true` in the preceding example, both row totals and column totals try to use the same field name (`Total`), creating a conflict. When this happens, the summary row label displays as `null` instead of `Sum` because the field becomes numeric (for row totals) and cannot display string values:


```ppl
source=accounts
| fields account_number, firstname , balance , age
| addtotals col=true row=true label='Sum' labelfield='Total'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 5/5
Expand All @@ -89,9 +100,9 @@ fetched rows / total rows = 5/5
+----------------+-----------+---------+-----+-------+
```

## Example 3: With all options
## Example 3: Using all options

The example shows using addtotals with all options set.
The following query uses the `addtotals` command with all options set:

```ppl
source=accounts
Expand All @@ -101,7 +112,7 @@ source=accounts
| addtotals avg_balance, count row=true col=true fieldname='Row Total' label='Sum' labelfield='Column Total'
```

Expected output:
The query returns the following results:

```text
fetched rows / total rows = 4/4
Expand Down
Loading
Loading