SEFS (Semantic Entropy File System) is a revolutionary file organization system that goes beyond traditional file extensions. It uses Deep Learning and LLMs (Large Language Models) to understand the actual meaning of your files and organize them into semantically relevant folders automatically.
- 📂 Semantic Organization: Files are grouped by content, not just extension. (e.g., An invoice PDF and an invoice TXT will live together).
- 🤖 AI-Powered Naming: Uses Google Gemini API to analyze clusters and name folders descriptively (e.g.,
Financial_Invoices,Project_Documentation). - 👀 Real-time Monitoring: Automatically detects new, modified, or moved files and re-organizes them instantly.
- 📊 2D Semantic Map: A beautiful graphical interface visualizing your files in a high-dimensional vector space reduced to 2D using UMAP.
- 🔍 Multi-format Support: Extracts text from PDFs (OCR-style), Word Docs, Markdowns, and Plain Text.
The system is built on a modular "Engine" architecture:
- File Monitor Engine: Watchdog-based listener that triggers the pipeline on OS file events.
- Embedding Engine: Uses
sentence-transformers(MPNet/MiniLM) to convert text into 768-dimensional dense vectors. - Clustering Engine: Performs density-based clustering using HDBSCAN on top of UMAP dimensionality reduction.
- AI Namer Service: Probes the clusters using Gemini-1.5-Flash to distill cluster contents into a 2-3 word folder name.
- Folder Manager: Handles the safe movement and atomic updates of the file system structure.
- Database Layer: SQLite-backed persistent storage for embeddings, hashes (to prevent redundant processing), and metadata.
graph TD
A[File System Change] -->|Detected by| B(File Monitor)
B -->|New/Modified| C{Hash Check}
C -->|New Hash| D[Text Extraction]
C -->|Existing| Z[Skip]
D -->|Text| E[Embedding Engine]
E -->|Vector| F[(SQLite DB)]
F -->|All Vectors| G[Clustering Engine]
G -->|UMAP + HDBSCAN| H[Semantic Clusters]
H -->|Content Samples| I[Gemini AI Namer]
I -->|Generated Name| J[Folder Manager]
J -->|Move File| K[Organized Hierarchy]
H -->|Coordinates| L[PyQt6 Visualization]
subgraph Organization Structure
K --> Folder1[Invoices_2024]
K --> Folder2[Research_Papers]
K --> Folder3[Miscellaneous_Files]
end
- GUI Framework: PyQt6 (for the interactive dashboard)
- Embedding Model: HuggingFace
all-mpnet-base-v2orall-MiniLM-L6-v2 - Clustering: HDBSCAN (Density-based) & UMAP (Manifold Learning)
- AI/LLM: Google Gemini API (Generative AI)
- Storage: SQLite3 with BLOB support for NumPy arrays
- Dependencies:
watchdog,sentence-transformers,numpy,scikit-learn,PyMuPDF,python-docx
-
Clone the repository:
git clone https://github.com/your-username/Semantic-Entropy-File-System-Project.git cd Semantic-Entropy-File-System-Project/sefs -
Set up Virtual Environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install Core Dependencies:
pip install -r requirements.txt pip install umap-learn hdbscan # Required for clustering engine -
Configuration: Create a
.envfile in thesefs/directory:GEMINI_API_KEY=your_google_gemini_api_key_here ROOT_DIR=C:/Path/To/Your/Documents
Run the application:
python run.py- Select Folder: Use the UI to pick any folder you want to organize.
- Watch: Files will start appearing as dots on the graph.
- Check File Explorer: Watch as SEFS creates folders and moves your files into semantically correct categories.
- Miscellaneous Files: Files that don't fit into any dense cluster are automatically grouped into a
Miscellaneous_Filesfolder (Noise handling via HDBSCAN). - Collision Prevention: The system checks file hashes before moving to ensure no data is lost or duplicated.
Supreeth Gollapally
Semantic Entropy File System Project