Problem
DsKit currently does not provide a high-level text data cleaning pipeline.
Although the full version installs NLTK, there is no built-in NLTK-based pre-processing utility for text/NLP workflows. This forces users to repeatedly implement custom cleaning logic outside the library.
Proposed Solution
Introduce a flexible, high-level text cleaning function named apply_nltk that enables NLP/text preprocessing directly within DsKit.
The function
-
Uses NLTK internally
-
Offers fine-grained control to users (case handling, stopwords, token processing, etc.)
-
Is configurable and reusable across NLP pipelines
-
Reduces boilerplate code for common text-cleaning tasks
-
This would significantly improve DsKit’s usability for NLP and text-heavy datasets.
Current Progress
✅ Feature apply_nltk has already been implemented
✅ A Pull Request is open
🔄 Open to feedback and ready to revise:
Why This Matters
-
Makes DsKit more NLP-friendly out of the box
-
Encourages standardized text preprocessing
-
Reduces repetitive user-side implementations
-
Aligns with DsKit’s goal of simplifying data preparation workflows
Related
Pull Request: #2 (comment)
Problem
DsKit currently does not provide a high-level text data cleaning pipeline.
Although the full version installs NLTK, there is no built-in NLTK-based pre-processing utility for text/NLP workflows. This forces users to repeatedly implement custom cleaning logic outside the library.
Proposed Solution
Introduce a flexible, high-level text cleaning function named apply_nltk that enables NLP/text preprocessing directly within DsKit.
The function
Uses NLTK internally
Offers fine-grained control to users (case handling, stopwords, token processing, etc.)
Is configurable and reusable across NLP pipelines
Reduces boilerplate code for common text-cleaning tasks
This would significantly improve DsKit’s usability for NLP and text-heavy datasets.
Current Progress
✅ Feature apply_nltk has already been implemented
✅ A Pull Request is open
🔄 Open to feedback and ready to revise:
Code structure
API design
Contribution-guideline compliance
Complexity or performance concerns
Why This Matters
Makes DsKit more NLP-friendly out of the box
Encourages standardized text preprocessing
Reduces repetitive user-side implementations
Aligns with DsKit’s goal of simplifying data preparation workflows
Related
Pull Request: #2 (comment)