-
-
Notifications
You must be signed in to change notification settings - Fork 463
Description
Is your feature request related to a problem? Please describe.
Over time, systems accumulate duplicate files across different directories (e.g., backups, downloads, synced folders, datasets). These files may have different names or locations but are identical in content, leading to wasted disk space and unnecessary data management overhead.
Currently, this repository contains many useful automation scripts, but there is no tool that safely identifies and manages duplicate files based on file content, rather than filename or size alone. Relying only on filenames or file size can be inaccurate, as files with the same size (e.g., CSVs or binaries) may have completely different content.
Describe the solution you'd like.
I’d like to add a File Deduplication Python script that accurately detects duplicate files by hashing their contents in a memory-efficient and safe manner.
Key features:
- Scan one or more directories for duplicate files
- Use file size as a fast pre-filter, followed by full content-based hashing (SHA-256)
- Read files in chunks to support very large files without high memory usage
- Optional concurrency (threading or multiprocessing) to improve performance
- Provide safe actions:
-- Report duplicates
-- Move duplicates to a separate folder
-- Optionally delete duplicates with explicit confirmation
Let me know what your thoughts are and if you think this would be valuable to have. I can then work on a PR