Performance Tips

Large Datasets

For datasets with >100k rows:

# Use batch processing
cleaner.clean_columns(columns, show_progress=True)

# Cache statistics for repeated operations
cleaner.add_zscore_columns(columns, cache_stats=True)

Memory Optimization

# Process columns individually for memory efficiency
for col in large_columns:
    cleaner.remove_outliers_zscore(col)
    
# Use in-place operations when possible
cleaner = StatClean(df, preserve_index=False)

Multivariate Performance

# For many variables, consider dimensionality reduction first
from sklearn.decomposition import PCA
pca_data = PCA(n_components=5).fit_transform(df)

Back to top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tips

Performance Tips

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally