-
Notifications
You must be signed in to change notification settings - Fork 89
Open
Description
- DataSynthesizer version: 0.1.13
- Python version: 3.9
- Operating System: Windows 11
Description
Trying to create a synthetic dataset from the Kaggle adult census dataset (with the fnlwgt column removed) in the correlated attribute mode results in the generator failing to parse the description file.
The reason for this seems to be in L281 of PrivBayes.py:
parents_key = str([parents_instance]) if len(parents) == 1 else str(list(parents_instance))This resolves int types as np.int64(0) instead of just 0 for parents > 1 . This in turn causes L99 of the DataGenerator to fail, as it does not import numpy:
parents_instance = list(eval(parents_instance))I could fix it locally by simply adding import numpy as np to the DataGenerator.py file, but maybe it would be cleaner to correctly print the base int type into the description file in the first place.
The relevant section of the description file:
"conditional_probabilities": {
"income": [
0.6269945618560558,
0.37300543814394416
],
"relationship": {
"[0]": [
0.31958572087575393,
0.26864155111683646,
0.062246949021475276,
0.17143605132431283,
0.1260161383716099,
0.05207358929001161
],
"[1]": [
0.4276133198945046,
0.16299606959384128,
0.027753228447322167,
0.17927266942607956,
0.12404103847621967,
0.07832367416203281
]
},
"sex": {
"[np.int64(0), np.int64(0)]": [
0.11899038829847323,
0.8810096117015268
],
"[np.int64(0), np.int64(1)]": [
0.1370384306577154,
0.8629615693422846
],What I Did
Python script:
import os.path
import pandas as pd
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator
from generators.generator import Generator
class PrivBayesGenerator(Generator):
def generate(self, rows: int=None):
input_data = str(self.real_data_path)
description_file = str(self.real_data_path.parent / 'description.json')
synthetic_data = self.synthetic_data_path
epsilon = 0.1
if rows is None:
rows = pd.read_csv(input_data).shape[0]
threshold_value = 50
num_tuples_to_generate = rows
# Describe Dataset
if not os.path.exists(description_file):
describer = DataDescriber(category_threshold=threshold_value)
describer.describe_dataset_in_correlated_attribute_mode(input_data, epsilon=epsilon)
describer.save_dataset_description_to_file(description_file)
# Generate Synthetic Data
generator = DataGenerator()
generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
generator.save_synthetic_data(synthetic_data)Traceback:
Traceback (most recent call last):
File "D:\...\helpers\generate_main.py", line 27, in <module>
main()
File "D:\...\helpers\generate_main.py", line 21, in main
generator.generate(rows)
File "D:\...\generators\priv_bayes_generator.py", line 35, in generate
generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 66, in generate_dataset_in_correlated_attribute_mode
self.encoded_dataset = DataGenerator.generate_encoded_dataset(self.n, self.description)
File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 100, in generate_encoded_dataset
parents_instance = list(eval(parents_instance))
File "<string>", line 1, in <module>
NameError: name 'np' is not defined
jfraj
Metadata
Metadata
Assignees
Labels
No labels