Skip to content

HPI-Information-Systems/DisMis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DisMis: Detecting Disguised Missing Values

This is the official repository for DisMis, a fully automated approach to detect disguised missing values in relational data. DisMis Architecture

Our detection approach can be divided in two stages: In the first stage, we execute a set of unsupervised DMV detectors on the input data.

Items referenced in the paper

Evaluation Datasets

All evaluation dataset can be found in the data/ directory. data/polluted contains the 119 datasets we use for our main evaluation. data/clean contains the 10 clean datasets we synthetically pollute. data/unique_DMVs.txtlists all unique DMVs from the 119 datasets and their occurences.

Top 25 DMVs
DMV Frequency
0.0 695,522
-1 508,059
. 151,967
s 64,430
 s  40,901
0% 36,831
- 16,028
 Some-college 7,291
NA 7,179
 -    6,558
 ? 4,262
N/A 3,053
20030101.0 2,540
NR 2,450
876
UNK 526
? 377
(n/a) 297
n/a 142
Unknown 135
Margate did not enter any league competitions 117
None 114
?? 103
* 90
-- 77

As mentioned, we labeled 25 additional DMVs in the datasets provided by FAHES:

25 additional labeled DMVs
Dataset Column DMV
Mass1 issue_description None.
Mass1 issue_description Missing
Mass1 issue_description missing
Mass1 issue_description TEST issue
Mass1 issue_description This is a test issue
Mass1 issue_description &
Mass1 issue_description TEST ISSUE
Mass1 issue_description In
Mass1 issue_description same
Mass1 issue_description we
Mass1 issue_description this is a test issuse
Mass1 issue_description Non
Mass1 issue_description fcfh
Mass1 issue_description TEST rejected #2
Mass1 issue_description none
Mass1 issue_description Unknown
Mass1 issue_description na
Mass1 issue_description Test Entry - Not a real issue
Mass1 issue_description Issue
Mass1 issue_description ...
Mass1 issue_description Unknown; still needed?
Mass1 issue_description Hi
Mass5 ProviderPhone 9999999999x9999
Mass5 ProviderPhone 6173540022x.132
Mass5 ProviderPhone 6174718400x,168

Full prompts (DMV example generation / LLM baseline)

The full logic to generate the prompts can be found in detection/detectors/llm_classifier.py and pollution/errors/LLMplaceholder.py. Below you can find two instantiations of the prompt using hypothetical example values. To run the LLM baseline, update the model_name argument. For more details on what to put here, see precompute_example_DMVs.md.

Example LLM baseline prompt (categorical/text column)
[system]: You are a data quality analyst specializing in identifying disguised missing values in datasets.

[user]: You are a data quality analyst tasked with identifying Disguised Missing Values (DMVs) in a dataset.

DMVs are placeholder values that represent missing, unknown, or invalid data. Examples include:

  • Generic placeholders: "N/A", "Unknown", "Missing", "None", "Placeholder", "?", "---", "TBD"
  • Nonsense values: "asdf", "xxxx", "123test", random strings
  • Comments: "See notes", "Check later", "Not sure"
  • Unusual patterns: repeated characters, obvious outliers like "-999", "9999"

Column name: "nationality"

Now, classify each of the following values as either "DMV" (disguised missing value) or "VALID" (real data). Consider the context of the column name.

Values to classify:

  1. German (Occurs 412 times)
  2. N/A (Occurs 87 times)
  3. French (Occurs 301 times)
  4. --- (Occurs 14 times)
  5. Unknown (Occurs 53 times)

Respond with ONLY a comma-separated list of classifications in the same order as the values above. Use only "DMV" or "VALID" for each value. Do not include any other text or explanations.

Example response format: VALID, DMV, VALID, VALID, DMV

Example DMV pollution prompt (LLMPlaceholderDMV2)
[system]: You are a data engineer working on quality control of tabular data.

[user]: Your task is to act as a data quality analyst to find Disguised Missing Values (DMVs).

Your process is in two steps:

  1. Analysis: First, critically examine the 5 example values provided from the column. Reason about which of them seem like valid data points and which might be placeholders. Briefly explain your reasoning.
  2. Generation: Based on your analysis of what the valid values look like, generate a list of 20 potential DMVs. These DMVs must be semantically and syntactically distinct from the values you identified as valid.

Generate the DMVs in two categories:

  • Generic: Common placeholders (e.g., 'N/A').
  • Context-Specific: Placeholders derived from the table and column names (e.g., 'Season cancelled'). This is the most important category. This should include identified potential placeholders from the input data.

IMPORTANT: Follow the JSON format exactly as shown in the example below. Use the same keys.

Example 1:

Table name: "students" Column name: "test_score"

Example values to analyze:

  • 85
  • 92
  • Not Graded
  • 78
  • 65

Your Response:

1. Analysis:

  • 85, 92, 78, 65: These look like Likely Valid values. They are typical integer scores for a test.
  • Not Graded: This is a Potential Placeholder. It's a text string explaining why a numerical score is missing.

Summary:

{ "Valid": [85, 92, 78, 65], "Placeholders": ["Not Graded"] }

2. Potential placeholder values:

{ "Generic": ["N/A", "Unknown", "U", "Missing", "?", "-", "None", "Null", "To Be Determined", "TBD", "Not Available"], "Context-Specific": ["Not Graded", "Absent", "Incomplete", "Withdrew from course", "Pending grade", "Exempt", "There was no test this semester", "Test not taken", "Score not recorded", "Grade pending", "-1"] }


Input:

Table name: "employees" Column name: "department"

Example values to analyze:

  • Engineering
  • Marketing
  • N/A
  • Finance
  • pending

Your Response:

Running the Code:

Installation

Install the requirements from requirements.txt. While we list it in requirements.txt, it is recommended to install VLLM separately following the official documentation.

To prepare FAHES, clone their code and run make in the terminal. In our code in detection/detectors/fahes_detector.py, change the fahes_executable variable to /path/tp/cloned/dir/FAHES_Code/src/FAHES.

Run pretrained DisMis

Run our pretrained version of DisMis using this command: Before executing this command, read precompute_example_DMVs.md on how to precompute the example DMVs for the semantic detectors.

python dismis.py \
    --evaluation_config configs/inference_configs/dismis.json \
    --dataset data/polluted/<group_name>/<dataset_name>/ \
    --embedding_dim 512 \
    --value_embeddings "_value_embeddings.json" \
    --example_dmvs "example_dmvs_detection.json" \
    --example_embeddings "precomputed_example_embeddings.json" \
    --model_path "dismis.pkl"

The embeddings need to be generated using a models that supports MRL. One can set embedding_dim here to prioritize either accuracy or lower memory footprint.

Retrain DisMis

To train a new instance of DisMis or evaluate it using LODO cross validation, we provide the following steps:

  1. Run only the detectors of DisMis. For this, run the same command as above but remove the model_path argument:
python dismis.py \
    --evaluation_config configs/inference_configs/dismis.json \
    --dataset data/polluted/<group_name>/<dataset_name>/ \
    --embedding_dim 512 \
    --value_embeddings "_value_embeddings.json" \
    --example_dmvs "example_dmvs_detection.json" \
    --example_embeddings "precomputed_example_embeddings.json"

This needs to run once for every polluted dataset.

  1. Merge the resulting detection scores into CSV files. When DisMis is run, it outputs CSVs in the same shape as the input table containing the scores of one detector each. We need to transform and merge these files into one big table with one column per detector / feature and columns for metadata like datasets and labels.
python merge.py \
    --result_base_path "dismis_results" \
    --data_base_path "data/polluted" \
    --output_dir "merged_features/" 
  1. To perform LODO cross validation, run the following command:
python predict.py \
    --data_dir "merged_features/" \
    --output "dismis_predictions.csv" \
    --output_dir "merged_features/" 
  1. To train a new isntance of DisMis on all merged data:
python train_dismis.py \
    --data_dir "merged_features/" \
    --output_dir "trained_models/" 

Synthetic DMV Pollution

To generate the synthetic data, we provide our generate_data scrpt. Here, one can specify one or more erros or datasets using the provided configs. It can be useful to override the repetitions and DMV_percentage arguments from the config to repeat the same pollution in different versions. For our ablation study, we ran the command below with DMV percentages 0.2, 0.3, 0.4 and 0.5 (and suffixes 02, 03, 04 and 05).

python generate_data.py \
    --error_configs configs/error_types/*/*/*.json \
    --dataset_configs configs/data_configs/*.json \
    --repetitions 5 \
    --DMV_percentage 0.2 \
    --suffix "02" 
Adding a new dataset

To add a new dataset, create a new folder data/clean/<name>. In that folder, place the dataset as csv named data/clean/<name>`/<name>.csv and the column data types data/clean/<name>`/<name>_types.json. The types JSON file is expected to have the columns of the CSV as keys and the types of the columns (one of "numeric", "categorical", "date", "text") as values. Next, add a data config configs/data_configs/<name>.json that follows this format:

{
    "dataset_name": "<name>",
    "repetitions": 5,
    "dataset_path": "data/clean/<name>/<name>.csv",
    "types_path": "data/clean/<name>/<name>_types.json",
    "example_DMVs_pollution": "data/clean/<name>/example_dmvs_pollution.json",
    "target_columns": <list of desired target columns>
}

The number of repetitions can be chosen freely, and the target columns are expected to be a list of strings with correspond to the names of the columns in the CSV. The "example_DMVs_pollution" option can be removed, but we reccomend generating context dependent example DMVs for pollution like this. We describe the commands needed in precompute_example_DMVs.md

About

Disguised MIssing Value Detection & Benchmarking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages