JobScraperAI is a Python-based web application designed to scrape the web for job postings at tech companies and intelligently match those postings to candidates' resumes using state-of-the-art AI models. The application utilizes AI-based parsing and data transformation to convert raw HTML job postings into structured data formats that can be easily compared against a candidate's profile.
- Utilizes Large Language Models (LLMs) like GPT-4 and Anthropic's Claude to parse job postings.
- Converts HTML job postings into structured JSON formats with fields like:
company_nametitledescriptionlocationrequirementsbenefitssalary
- Matches parsed job descriptions to a candidate's resume using NLP-based techniques.
- Identifies the best-fitting jobs by analyzing skill sets, experience, and keywords from resumes.
- Uses SQLModel (based on Pydantic) and SQLAlchemy to store parsed job postings and matched candidate information in a relational database.
- Includes functionality to migrate, store, and manage job and candidate data seamlessly.
- Uploads and stores processed job postings in Supabase.
- Supports authentication and session management for user interaction logging.
- Contains
docker-compose.ymlfor easy setup of the development environment. - Streamlines containerized deployment and testing.
- Organized into API modules for scraping, authentication, database interaction, and AI integration.
- FastAPI is used to serve the web API, providing efficient and scalable endpoints for interaction.
- (In-progress) Aims to include user authentication via token-based or session-based systems.
- Allows tracking user interactions and logging their actions for analytics.
- Contains a variety of test scripts to verify the functionality of individual components.
- Includes both unit and integration tests to ensure the stability and accuracy of the scraping and parsing logic.
- ai/: Contains AI models and parsers, including methods to parse job postings from HTML using GPT-4 and Claude.
- api/: Modular API subdirectories for scraping, authentication, Supabase integration, etc.
- auth/: Utilities for handling user authentication.
- testingStuff/: Test scripts for parsing, AI model interaction, and data processing.
- models/: Likely contains Pydantic/SQLModel database models for job postings and resumes.
- migrations/: Alembic migrations for database schema updates.
- main.py: Entry point for the web application.
- db.py: Handles database initialization and connection setup.
- docker-compose.yml: Configuration file for containerizing the app in a Docker environment.
- requirements.txt: List of Python dependencies.
- ai/htmlParsers.py: Contains functions such as
parseHTMLgpt4andparseHTMLClaud3for converting HTML job postings to structured JSON using AI models. - testingStuff/tempCodeRunnerFile.py: A test script that includes
parse_html, which filters text content from HTML files. - testingStuff/aiRefac.py: Defines the
RoleBaseclass using Pydantic, representing the structure of job roles. - supaBasetest.py: Demonstrates how to upload parsed job postings to Supabase storage.
- testingStuff/stg_role.py: Script responsible for processing HTML job postings and interacting with AI models.
- testingStuff/sampleChain.py: Script for loading job data from CSV files and interacting with OpenAI's API.
- testingStuff/tutorial.py: Demonstrates LangChain integration with OpenAI for document embedding and similarity analysis.
#workflow:
- Change Model or code
- Commit to branch
- generate migrations file: alembic revision --autogenerate -m "Git Branch Name"
- Review Revision generated in versions
- run upgrade on test database
fastUi: https://github.com/pydantic/FastUI
- fastui and fastapi: https://www.youtube.com/watch?v=eBWrnSyN2iw
pydantic: https://docs.pydantic.dev/2.7/api/validate_call/
sqlModel: https://sqlmodel.tiangolo.com/features/?h=validation#based-on-pydantic
fastAPI: https://fastapi.tiangolo.com/
alembic: https://alembic.sqlalchemy.org/en/latest/
- alembic best practices: https://thinhdanggroup.github.io/alembic-python/
- fastapi and alembic: https://www.youtube.com/watch?v=zTSmvUVbk8M
- Add in authenticaion
- Tie to user session
- Log actions in session
- Figure out poetry
- setup correct docker compose for app deployment
- understand session tracking
- create authentication
- create fast ui for barebones
- create cli tool for easy dev
- figure out testing files
- migrate over code from Jobbr repo