DisMis: Detecting Disguised Missing Values

This is the official repository for DisMis, a fully automated approach to detect disguised missing values in relational data.

Our detection approach can be divided in two stages: In the first stage, we execute a set of unsupervised DMV detectors on the input data.

Items referenced in the paper

Evaluation Datasets

All evaluation dataset can be found in the data/ directory. data/polluted contains the 119 datasets we use for our main evaluation. data/clean contains the 10 clean datasets we synthetically pollute. data/unique_DMVs.txtlists all unique DMVs from the 119 datasets and their occurences.

Top 25 DMVs

DMV	Frequency
`0.0`	695,522
`-1`	508,059
`.`	151,967
`s`	64,430
`s`	40,901
`0%`	36,831
`-`	16,028
`Some-college`	7,291
`NA`	7,179
`-`	6,558
`?`	4,262
`N/A`	3,053
`20030101.0`	2,540
`NR`	2,450
`—`	876
`UNK`	526
`?`	377
`(n/a)`	297
`n/a`	142
`Unknown`	135
`Margate did not enter any league competitions`	117
`None`	114
`??`	103
`*`	90
`--`	77

As mentioned, we labeled 25 additional DMVs in the datasets provided by FAHES:

25 additional labeled DMVs

Dataset	Column	DMV
Mass1	issue_description	`None.`
Mass1	issue_description	`Missing`
Mass1	issue_description	`missing`
Mass1	issue_description	`TEST issue`
Mass1	issue_description	`This is a test issue`
Mass1	issue_description	`&`
Mass1	issue_description	`TEST ISSUE`
Mass1	issue_description	`In`
Mass1	issue_description	`same`
Mass1	issue_description	`we`
Mass1	issue_description	`this is a test issuse`
Mass1	issue_description	`Non`
Mass1	issue_description	`fcfh`
Mass1	issue_description	`TEST rejected #2`
Mass1	issue_description	`none`
Mass1	issue_description	`Unknown`
Mass1	issue_description	`na`
Mass1	issue_description	`Test Entry - Not a real issue`
Mass1	issue_description	`Issue`
Mass1	issue_description	`...`
Mass1	issue_description	`Unknown; still needed?`
Mass1	issue_description	`Hi`
Mass5	ProviderPhone	`9999999999x9999`
Mass5	ProviderPhone	`6173540022x.132`
Mass5	ProviderPhone	`6174718400x,168`

Full prompts (DMV example generation / LLM baseline)

The full logic to generate the prompts can be found in detection/detectors/llm_classifier.py and pollution/errors/LLMplaceholder.py. Below you can find two instantiations of the prompt using hypothetical example values. To run the LLM baseline, update the model_name argument. For more details on what to put here, see precompute_example_DMVs.md.

Example LLM baseline prompt (categorical/text column)

[system]: You are a data quality analyst specializing in identifying disguised missing values in datasets.
[user]:
You are a data quality analyst tasked with identifying Disguised Missing Values (DMVs) in a dataset.
DMVs are placeholder values that represent missing, unknown, or invalid data. Examples include:

Generic placeholders: "N/A", "Unknown", "Missing", "None", "Placeholder", "?", "---", "TBD"
Nonsense values: "asdf", "xxxx", "123test", random strings
Comments: "See notes", "Check later", "Not sure"
Unusual patterns: repeated characters, obvious outliers like "-999", "9999"

Column name: "nationality"
Now, classify each of the following values as either "DMV" (disguised missing value) or "VALID" (real data).
Consider the context of the column name.
Values to classify:

German (Occurs 412 times)
N/A (Occurs 87 times)
French (Occurs 301 times)
--- (Occurs 14 times)
Unknown (Occurs 53 times)

Respond with ONLY a comma-separated list of classifications in the same order as the values above.
Use only "DMV" or "VALID" for each value. Do not include any other text or explanations.
Example response format: VALID, DMV, VALID, VALID, DMV

Example DMV pollution prompt (LLMPlaceholderDMV2)

[system]: You are a data engineer working on quality control of tabular data.

[user]: Your task is to act as a data quality analyst to find Disguised Missing Values (DMVs).

Your process is in two steps:

Analysis: First, critically examine the 5 example values provided from the column. Reason about which of them seem like valid data points and which might be placeholders. Briefly explain your reasoning.
Generation: Based on your analysis of what the valid values look like, generate a list of 20 potential DMVs. These DMVs must be semantically and syntactically distinct from the values you identified as valid.

Generate the DMVs in two categories:

Generic: Common placeholders (e.g., 'N/A').
Context-Specific: Placeholders derived from the table and column names (e.g., 'Season cancelled'). This is the most important category. This should include identified potential placeholders from the input data.

IMPORTANT: Follow the JSON format exactly as shown in the example below. Use the same keys.

Example 1:

Table name: "students" Column name: "test_score"

Example values to analyze:

85
92
Not Graded
78
65

Your Response:

1. Analysis:

85, 92, 78, 65: These look like Likely Valid values. They are typical integer scores for a test.
Not Graded: This is a Potential Placeholder. It's a text string explaining why a numerical score is missing.

Summary:

{ "Valid": [85, 92, 78, 65], "Placeholders": ["Not Graded"] }

2. Potential placeholder values:

{ "Generic": ["N/A", "Unknown", "U", "Missing", "?", "-", "None", "Null", "To Be Determined", "TBD", "Not Available"], "Context-Specific": ["Not Graded", "Absent", "Incomplete", "Withdrew from course", "Pending grade", "Exempt", "There was no test this semester", "Test not taken", "Score not recorded", "Grade pending", "-1"] }

Input:

Table name: "employees" Column name: "department"

Example values to analyze:

Engineering
Marketing
N/A
Finance
pending

Your Response:

Running the Code:

Installation

Install the requirements from requirements.txt. While we list it in requirements.txt, it is recommended to install VLLM separately following the official documentation.

To prepare FAHES, clone their code and run make in the terminal. In our code in detection/detectors/fahes_detector.py, change the fahes_executable variable to /path/tp/cloned/dir/FAHES_Code/src/FAHES.

Run pretrained DisMis

Run our pretrained version of DisMis using this command: Before executing this command, read precompute_example_DMVs.md on how to precompute the example DMVs for the semantic detectors.

python dismis.py \
    --evaluation_config configs/inference_configs/dismis.json \
    --dataset data/polluted/<group_name>/<dataset_name>/ \
    --embedding_dim 512 \
    --value_embeddings "_value_embeddings.json" \
    --example_dmvs "example_dmvs_detection.json" \
    --example_embeddings "precomputed_example_embeddings.json" \
    --model_path "dismis.pkl"

The embeddings need to be generated using a models that supports MRL. One can set embedding_dim here to prioritize either accuracy or lower memory footprint.

Retrain DisMis

To train a new instance of DisMis or evaluate it using LODO cross validation, we provide the following steps:

Run only the detectors of DisMis. For this, run the same command as above but remove the model_path argument:

python dismis.py \
    --evaluation_config configs/inference_configs/dismis.json \
    --dataset data/polluted/<group_name>/<dataset_name>/ \
    --embedding_dim 512 \
    --value_embeddings "_value_embeddings.json" \
    --example_dmvs "example_dmvs_detection.json" \
    --example_embeddings "precomputed_example_embeddings.json"

This needs to run once for every polluted dataset.

Merge the resulting detection scores into CSV files. When DisMis is run, it outputs CSVs in the same shape as the input table containing the scores of one detector each. We need to transform and merge these files into one big table with one column per detector / feature and columns for metadata like datasets and labels.

python merge.py \
    --result_base_path "dismis_results" \
    --data_base_path "data/polluted" \
    --output_dir "merged_features/"

To perform LODO cross validation, run the following command:

python predict.py \
    --data_dir "merged_features/" \
    --output "dismis_predictions.csv" \
    --output_dir "merged_features/"

To train a new isntance of DisMis on all merged data:

python train_dismis.py \
    --data_dir "merged_features/" \
    --output_dir "trained_models/"

Synthetic DMV Pollution

To generate the synthetic data, we provide our generate_data scrpt. Here, one can specify one or more erros or datasets using the provided configs. It can be useful to override the repetitions and DMV_percentage arguments from the config to repeat the same pollution in different versions. For our ablation study, we ran the command below with DMV percentages 0.2, 0.3, 0.4 and 0.5 (and suffixes 02, 03, 04 and 05).

python generate_data.py \
    --error_configs configs/error_types/*/*/*.json \
    --dataset_configs configs/data_configs/*.json \
    --repetitions 5 \
    --DMV_percentage 0.2 \
    --suffix "02"

Adding a new dataset

To add a new dataset, create a new folder data/clean/<name>. In that folder, place the dataset as csv named data/clean/<name>`/<name>.csv and the column data types data/clean/<name>`/<name>_types.json. The types JSON file is expected to have the columns of the CSV as keys and the types of the columns (one of "numeric", "categorical", "date", "text") as values. Next, add a data config configs/data_configs/<name>.json that follows this format:

{
    "dataset_name": "<name>",
    "repetitions": 5,
    "dataset_path": "data/clean/<name>/<name>.csv",
    "types_path": "data/clean/<name>/<name>_types.json",
    "example_DMVs_pollution": "data/clean/<name>/example_dmvs_pollution.json",
    "target_columns": <list of desired target columns>
}

The number of repetitions can be chosen freely, and the target columns are expected to be a list of strings with correspond to the names of the columns in the CSV. The "example_DMVs_pollution" option can be removed, but we reccomend generating context dependent example DMVs for pollution like this. We describe the commands needed in precompute_example_DMVs.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DisMis: Detecting Disguised Missing Values

Items referenced in the paper

Evaluation Datasets

Full prompts (DMV example generation / LLM baseline)

Example 1:

Example values to analyze:

Your Response:

1. Analysis:

Summary:

2. Potential placeholder values:

Input:

Example values to analyze:

Your Response:

Running the Code:

Installation

Run pretrained DisMis

Retrain DisMis

Synthetic DMV Pollution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
detection		detection
evaluation		evaluation
images		images
pollution		pollution
utils		utils
.gitignore		.gitignore
README.md		README.md
dismis.pkl		dismis.pkl
dismis.py		dismis.py
generate_data.py		generate_data.py
generate_data.sh		generate_data.sh
generate_example_dmvs.py		generate_example_dmvs.py
merge.py		merge.py
openai_LLM.py		openai_LLM.py
precompute_detection_example_embeddings.py		precompute_detection_example_embeddings.py
precompute_example_DMVs.md		precompute_example_DMVs.md
precompute_value_embeddings.py		precompute_value_embeddings.py
predict.py		predict.py
requirements.txt		requirements.txt
train_dismis.py		train_dismis.py
train_utils.py		train_utils.py

Folders and files

Latest commit

History

Repository files navigation

DisMis: Detecting Disguised Missing Values

Items referenced in the paper

Evaluation Datasets

Full prompts (DMV example generation / LLM baseline)

Example 1:

Example values to analyze:

Your Response:

1. Analysis:

Summary:

2. Potential placeholder values:

Input:

Example values to analyze:

Your Response:

Running the Code:

Installation

Run pretrained DisMis

Retrain DisMis

Synthetic DMV Pollution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages