Skip to content

Numpy Datatypes in Conditional Distributions of Description File #44

@co-de-pot

Description

@co-de-pot
  • DataSynthesizer version: 0.1.13
  • Python version: 3.9
  • Operating System: Windows 11

Description

Trying to create a synthetic dataset from the Kaggle adult census dataset (with the fnlwgt column removed) in the correlated attribute mode results in the generator failing to parse the description file.

The reason for this seems to be in L281 of PrivBayes.py:

parents_key = str([parents_instance]) if len(parents) == 1 else str(list(parents_instance))

This resolves int types as np.int64(0) instead of just 0 for parents > 1 . This in turn causes L99 of the DataGenerator to fail, as it does not import numpy:

parents_instance = list(eval(parents_instance))

I could fix it locally by simply adding import numpy as np to the DataGenerator.py file, but maybe it would be cleaner to correctly print the base int type into the description file in the first place.

The relevant section of the description file:

"conditional_probabilities": {
        "income": [
            0.6269945618560558,
            0.37300543814394416
        ],
        "relationship": {
            "[0]": [
                0.31958572087575393,
                0.26864155111683646,
                0.062246949021475276,
                0.17143605132431283,
                0.1260161383716099,
                0.05207358929001161
            ],
            "[1]": [
                0.4276133198945046,
                0.16299606959384128,
                0.027753228447322167,
                0.17927266942607956,
                0.12404103847621967,
                0.07832367416203281
            ]
        },
        "sex": {
            "[np.int64(0), np.int64(0)]": [
                0.11899038829847323,
                0.8810096117015268
            ],
            "[np.int64(0), np.int64(1)]": [
                0.1370384306577154,
                0.8629615693422846
            ],

What I Did

Python script:

import os.path

import pandas as pd
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator

from generators.generator import Generator

class PrivBayesGenerator(Generator):
    def generate(self, rows: int=None):
        input_data = str(self.real_data_path)
        description_file = str(self.real_data_path.parent / 'description.json')
        synthetic_data = self.synthetic_data_path

        epsilon = 0.1
        if rows is None:
            rows = pd.read_csv(input_data).shape[0]
        threshold_value = 50
        num_tuples_to_generate = rows

        # Describe Dataset
        if not os.path.exists(description_file):
            describer = DataDescriber(category_threshold=threshold_value)
            describer.describe_dataset_in_correlated_attribute_mode(input_data, epsilon=epsilon)
            describer.save_dataset_description_to_file(description_file)

        # Generate Synthetic Data
        generator = DataGenerator()
        generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
        generator.save_synthetic_data(synthetic_data)

Traceback:

Traceback (most recent call last):
  File "D:\...\helpers\generate_main.py", line 27, in <module>
    main()
  File "D:\...\helpers\generate_main.py", line 21, in main
    generator.generate(rows)
  File "D:\...\generators\priv_bayes_generator.py", line 35, in generate
    generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
  File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 66, in generate_dataset_in_correlated_attribute_mode
    self.encoded_dataset = DataGenerator.generate_encoded_dataset(self.n, self.description)
  File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 100, in generate_encoded_dataset
    parents_instance = list(eval(parents_instance))
  File "<string>", line 1, in <module>
NameError: name 'np' is not defined

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions