MapleScrape

About The Project

Simplistic web scraping function to help retrieve information of graphics cards from sites such as Newegg, Best Buy, and Canada Computers.

The data scraped by the function is locally stored in a JSON file named based on the time which the scraping process occurs. Furthermore, this data is saved into SQLite to be used on the site.

It is suggested to make use of Celery to have scheduled scraping, default scheduling being every midnight. Otherwise, it is also possible to have individual runs.

NOTE: Don't let your computer sleep otherwise it may cause the scraping process to abruptly stop

Built With

Selenium
BeautifulSoup4
Pandas
lxml
SQLite3
Django
Chromedriver
Celery

Getting Started

Install the required packages using:

pip install -r requirements.txt

Set your `.env` variables

SECRET_KEY -> Django secret key
CHROME_PATH -> Absolute path to Google Chrome
CHROMEDRIVER_PATH -> Absolute path to Chromedriver
NEWEGG_URL -> URL for Newegg GPUs
BEST_BUY_URL -> URL for Best Buy GPUs
CANADA_COMPUTERS_URL -> URL for Canada Computers GPUs

Ensure the URLs you set are for the GPU pages of the respective sites.

For example:

Run the scraping process

There are three ways of running the scraping process:

With Celery and RabbitMQTT on a schedule

Make sure you install RabbitMQTT
Start up RabbitMQ using sudo rabbitmq-server in a terminal
1. Wait until it completes when message Starting broker... completed with x plugins
Start up the actual service with the Celery command: celery -A gpu-web-scraper worker -B -l INFO
1. -A gpu-web-scraper specifies which project
2. -B tells it to run on the given beat schedule
3. -l INFO tells it log information

With Celery and RabbitMQTT without a schedule

Do the same first two steps as above
Start up the Celery service using: celery -A gpu-web-scraper worker -l INFO
- Note that we don't use -B as we don't want it on a service
Open up a third terminal
Import the Django settings module using export DJANGO_SETTINGS_MODULE=gpu-web-scraper.settings

Enter the Python console and run the following:

import django
django.setup()
from scraping.tasks import scrape
scrape.apply_async()

Note: You can end RabbitMQTT service using rabbitmqctl stop

Without Celery and RabbitMQTT

Import the Django settings module using export DJANGO_SETTINGS_MODULE=gpu-web-scraper.settings

Enter the Python console and run the following:

import django
django.setup()
from scraping.tasks import scrape
scrape()

Collect the static files and run the site

Run the following commands:

python3 manage.py collectstatic (not required)
python3 manage.py makemigrations (if made changes to model)
python3 manage.py migrate
python3 manage.py runserver

Future Goals

Fix drivers to run concurrently again!
Make use of Docker to always have the server running
Save backups of data online

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
assets		assets
boot-table/dist		boot-table/dist
boot		boot
gpu-web-scraper		gpu-web-scraper
scraping		scraping
static		static
templates		templates
.env		.env
.gitignore		.gitignore
README.md		README.md
celerybeat-schedule.db		celerybeat-schedule.db
chromedriver		chromedriver
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MapleScrape

About The Project

Built With

Getting Started

Install the required packages using:

Set your `.env` variables

Run the scraping process

With Celery and RabbitMQTT on a schedule

With Celery and RabbitMQTT without a schedule

Without Celery and RabbitMQTT

Collect the static files and run the site

Future Goals

About

Uh oh!

Uh oh!

Languages

JoshFung/MapleScrape

Folders and files

Latest commit

History

Repository files navigation

MapleScrape

About The Project

Built With

Getting Started

Install the required packages using:

Set your .env variables

Run the scraping process

With Celery and RabbitMQTT on a schedule

With Celery and RabbitMQTT without a schedule

Without Celery and RabbitMQTT

Collect the static files and run the site

Future Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

Set your `.env` variables