Thoughts on respecting missing_rate in DataGenerator

Hi, firstly, I haven't said it so far, but thanks for creating and maintaining DataSynthesizer! It's a useful tool.

`DataDescriber` creates a value `missing_rate` in the attribute descriptions. I was wondering what your thoughts are on making use of these values in `DataGenerator` along with the distribution bins which are already used.

My use case is pretty simple, I want to create a synthesised data set for non-production use which is as representative of the original data set as possible. Two extremes of the problem I'm having:

- Columns which are mostly populated with values but with some null records result in no null values at all in the synthesised data set
- Columns which are mostly null but may have a very small number of records populated with a value will result in all records being set to that value and no null values in the synthesised data set

In some instances where it's more important for me, I have addressed this in pre and post-processing steps myself, but as `DataDescriber` collects this metric, I was wondering if it would be reasonable to implement this in DataSynthesizer itself, perhaps as an option passed to the relevant generator methods.

Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thoughts on respecting missing_rate in DataGenerator #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thoughts on respecting missing_rate in DataGenerator #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions