Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions dataset-audit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
This plugin provides a recipe that takes a SQL-based or HDFS-based dataset as input, and outputs an audit of the data in the input dataset.

The output is a dataset with one line per column in the input dataset.
For each column, the recipe outputs:

- Type.
- Cardinality (number of distinct values).
- Number of missing/empty values.
- Most frequent value and most frequent value count.
- For numerical columns: min, max, avg.

The recipe uses in-processing or in-Hadoop processing, as appropriate for the input dataset.