Welcome to your first text pre-processing assignment! In this project, you’ll apply your knowledge of tokenization to process a short story using Python and the Natural Language Toolkit (NLTK).
Read a short story from a .txt file and:
- Tokenize it into sentences using
sent_tokenize() - Tokenize it into words using
word_tokenize() - (Optional) Clean the story using
re.sub()to remove unwanted characters
story.txt: A short story about a beginner coder.main.py: A starter Python script where you will write your code.README.md: You’re reading it now!
-
Install NLTK if you haven’t already:
pip install nltk
-
Download NLTK resources in
main.py: The script already includes thenltk.download('punkt')command for tokenizer models. -
Read the file: Use Python’s file reading tools to load
story.txt. -
Clean the text (optional): Use
re.sub()to remove any characters you don’t want included in your analysis. -
Tokenize the story:
- Use
sent_tokenize()to break the story into sentences. - Use
word_tokenize()to break the story into individual words.
- Use
-
Print the results to compare and see the difference between sentence and word tokenization.
- How does
sent_tokenize()determine where a sentence ends? - How does punctuation affect
word_tokenize()? - How could this process help a chatbot better understand user input?
- Count how many sentences and how many words are in the story.
- Print the most frequent word (ignoring common stopwords like "the", "and", etc.)
Happy tokenizing! 🧠💬