Skip to content

Performance Tips

automation edited this page Aug 8, 2025 · 2 revisions

Performance Tips

Large Datasets

For datasets with >100k rows:

# Use batch processing
cleaner.clean_columns(columns, show_progress=True)

# Cache statistics for repeated operations
cleaner.add_zscore_columns(columns, cache_stats=True)
Memory Optimization
# Process columns individually for memory efficiency
for col in large_columns:
    cleaner.remove_outliers_zscore(col)
    
# Use in-place operations when possible
cleaner = StatClean(df, preserve_index=False)
Multivariate Performance
# For many variables, consider dimensionality reduction first
from sklearn.decomposition import PCA
pca_data = PCA(n_components=5).fit_transform(df)

Back to top

Clone this wiki locally