Problem
DsKit lacks a simple utility to generate a vocabulary from text columns, which is a common preprocessing step in NLP workflows.
Proposed Solution
- Introduce generate_vocabulary(df:pd.DataFrame,text_col:str,case:Literal['lower','upper']=None)
- Returns a list of unique words present in the specified text column.
Progress
✅ Both features are already implemented
🔄 Open to feedback and ready to refactor or split PRs if needed
Use-Case
- Useful when developing Bag of Words like an ML model
- Provide knowledge about a specific text field
Pull Request
#2 (comment)
Problem
DsKit lacks a simple utility to generate a vocabulary from text columns, which is a common preprocessing step in NLP workflows.
Proposed Solution
Progress
✅ Both features are already implemented
🔄 Open to feedback and ready to refactor or split PRs if needed
Use-Case
Pull Request
#2 (comment)