gutenberg-cleaner

A Python package for cleaning Project Gutenberg books and datasets.

Prerequisites

nltk package

Installing

[sudo] pip install gutenberg-cleaner

How to use it?

The package provides two main functions: "simple_cleaner" and "super_cleaner".

from gutenberg_cleaner import simple_cleaner, super_cleaner

simple_claner:

Removes lines that are part of the Project Gutenberg header or footer without altering the main text.

simple_cleaner(book: str) -> str

super_cleaner:

Performs a thorough cleaning of the book by removing titles, footnotes, images, book information, and other non-content elements. Note that it may happen to remove some valid content too (but rarely).

super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str

min_token: The minimum number of tokens required for a non-dialog/non-quote paragraph to be retained. Set to -1 to skip tokenization (faster but less effective cleaning).
max_token: The maximum number of tokens allowed for any paragraph.

Deleted paragraphs will be marked with: [deleted]

Author

Peyman Mohseni kiasari

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.idea		.idea
_cleaning_options		_cleaning_options
gutenberg_cleaner.egg-info		gutenberg_cleaner.egg-info
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
gutenberg_cleaner.py		gutenberg_cleaner.py
logo.png		logo.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gutenberg-cleaner

Prerequisites

Installing

How to use it?

simple_claner:

super_cleaner:

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gutenberg-cleaner

Prerequisites

Installing

How to use it?

simple_claner:

super_cleaner:

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages