ResuMariner

We are looking for backend Python Developer for parsing project with unknown bounds and requirements. By the way, have you worked with RAGs and vector databases?

— Undisclosed CTO during what was supposed to be a casual coffee chat

This project started after one of those meetings where "simple resume parsing" evolved into "candidate matching with AI predictions." Instead of building another one-off script, I built a proper architecture that handles changing requirements. Neo4j for graph data, Qdrant for vector search, Django for the backend. Can scale from basic text extraction to complex queries without rewrites.

What it does

The system handles CV uploads (PDF, DOCX, JPG formats), parses them into structured data, and makes them searchable. You can throw natural language queries at it ("find someone who's built payment systems") or search by specific criteria (5+ years Python, knows Docker, lives in Berlin). There's also a hybrid mode that combines both approaches when you need the flexibility of semantic search with the precision of structured filters.

The frontend is a React + TypeScript app (Vite) with pages for CV upload, job status tracking, search across all three modes, health monitoring, and admin cleanup.

Architecture

V1 was microservices. Spent more time coordinating services than building features, plus the "single source of truth" problem for shared domain objects (Resume, AIReview) had no good solution - shared libs and copypasting both were bad. V2 is a Django monolith with clean separation through apps.

The processor app handles uploads and LLM parsing. CVs get queued in Redis, processed by background workers, and stored in two places. Neo4j holds the graph - candidates connecting to companies, skills, locations, education. Structured searches are just relationship traversals. Qdrant stores vector embeddings, with each resume chunked into multiple vectors (summary, skills, work bullets, projects) to avoid semantic soup. Searches match against all chunks independently, then group by candidate.

Traefik is the reverse proxy. Redis handles queuing, caching, and worker coordination. Workers run Django Q for async processing.

Getting started

You need Docker, Docker Compose, and at least 4GB of RAM. Grab an API key from OpenAI, Gemini, or whatever LLM provider you prefer - just one is enough. Copy backend/.env.example to backend/.env and drop your key in there.

Then run:

docker-compose up --build

The backend API will be at http://localhost:8000 and the frontend at http://localhost:5173.

How the API works

Full OpenAPI spec at /api/schema/.

POST /api/v1/upload/ returns a job ID immediately. Processing happens in the background - parsing PDFs, calling LLM APIs, generating embeddings takes time. Check status with GET /api/v1/jobs/{job_id}/, get results from GET /api/v1/jobs/{job_id}/result/ once done.

Search

Three endpoints, different approaches.

POST /search/semantic/ takes natural language queries, converts them to embeddings, searches Qdrant's vector space. Ranked by semantic similarity. "Built payment systems in Python" matches "developed financial transaction services using Python" even without exact keywords.

POST /search/structured/ takes exact criteria - years of experience, specific skills, locations. Queries Neo4j directly, traverses the graph. Fast and precise.

POST /search/hybrid/ runs both in parallel, merges results with configurable weights. Useful for queries like "senior engineers" (semantic) who know "React, TypeScript, AWS" (structured).

GET /filters/ returns searchable values - skills, locations, experience ranges from your actual data.

Administrative endpoints

POST /api/v1/jobs/cleanup/ purges old processing jobs. By default the system keeps results for 30 days, but this lets you manually trigger cleanup if needed. It also clears temporary files from uploads.

GET /api/v1/health/ checks if the service is alive. It pings Redis, Neo4j, and Qdrant to make sure everything's actually reachable, not just that the web server is running.

Testing

There's a test script that uploads a CV and runs all the different search types against it:

python ./backend/test_script.py path/to/resume.pdf

If you don't pass a path, it uses the example resume in backend/test_inputs/Max_Azatian.pdf. Check test_script.py for additional flags.

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
monitoring		monitoring
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
package-lock.json		package-lock.json
traefik-dashboard-users.example		traefik-dashboard-users.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ResuMariner

What it does

Architecture

Getting started

How the API works

Search

Administrative endpoints

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

License

HardMax71/ResuMariner

Folders and files

Latest commit

History

Repository files navigation

ResuMariner

What it does

Architecture

Getting started

How the API works

Search

Administrative endpoints

Testing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages