Skip to content

New Feature: Apply NLTK #6

@ItsRikan

Description

@ItsRikan

Problem

DsKit currently does not provide a high-level text data cleaning pipeline.
Although the full version installs NLTK, there is no built-in NLTK-based pre-processing utility for text/NLP workflows. This forces users to repeatedly implement custom cleaning logic outside the library.

Proposed Solution

Introduce a flexible, high-level text cleaning function named apply_nltk that enables NLP/text preprocessing directly within DsKit.

The function

  • Uses NLTK internally

  • Offers fine-grained control to users (case handling, stopwords, token processing, etc.)

  • Is configurable and reusable across NLP pipelines

  • Reduces boilerplate code for common text-cleaning tasks

  • This would significantly improve DsKit’s usability for NLP and text-heavy datasets.

Current Progress

✅ Feature apply_nltk has already been implemented

✅ A Pull Request is open

🔄 Open to feedback and ready to revise:

  • Code structure

  • API design

  • Contribution-guideline compliance

  • Complexity or performance concerns

Why This Matters

  • Makes DsKit more NLP-friendly out of the box

  • Encourages standardized text preprocessing

  • Reduces repetitive user-side implementations

  • Aligns with DsKit’s goal of simplifying data preparation workflows

Related

Pull Request: #2 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions