A Python package for cleaning Project Gutenberg books and datasets.
- nltk package
[sudo] pip install gutenberg-cleanerThe package provides two main functions: "simple_cleaner" and "super_cleaner".
from gutenberg_cleaner import simple_cleaner, super_cleanerRemoves lines that are part of the Project Gutenberg header or footer without altering the main text.
simple_cleaner(book: str) -> strPerforms a thorough cleaning of the book by removing titles, footnotes, images, book information, and other non-content elements. Note that it may happen to remove some valid content too (but rarely).
super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> strmin_token: The minimum number of tokens required for a non-dialog/non-quote paragraph to be retained. Set to -1 to skip tokenization (faster but less effective cleaning).max_token: The maximum number of tokens allowed for any paragraph.
Deleted paragraphs will be marked with: [deleted]
- Peyman Mohseni kiasari
This project is licensed under the MIT License - see the LICENSE.md file for details
