Simplistic web scraping function to help retrieve information of graphics cards from sites such as Newegg, Best Buy, and Canada Computers.
The data scraped by the function is locally stored in a JSON file named based on the time which the scraping process occurs. Furthermore, this data is saved into SQLite to be used on the site.
It is suggested to make use of Celery to have scheduled scraping, default scheduling being every midnight. Otherwise, it is also possible to have individual runs.
NOTE: Don't let your computer sleep otherwise it may cause the scraping process to abruptly stop
- Selenium
- BeautifulSoup4
- Pandas
- lxml
- SQLite3
- Django
- Chromedriver
- Celery
pip install -r requirements.txtSECRET_KEY -> Django secret key
CHROME_PATH -> Absolute path to Google Chrome
CHROMEDRIVER_PATH -> Absolute path to Chromedriver
NEWEGG_URL -> URL for Newegg GPUs
BEST_BUY_URL -> URL for Best Buy GPUs
CANADA_COMPUTERS_URL -> URL for Canada Computers GPUs
Ensure the URLs you set are for the GPU pages of the respective sites.
For example:
There are three ways of running the scraping process:
- Make sure you install RabbitMQTT
- Start up RabbitMQ using
sudo rabbitmq-serverin a terminal- Wait until it completes when message
Starting broker... completed with x plugins
- Wait until it completes when message
- Start up the actual service with the Celery command:
celery -A gpu-web-scraper worker -B -l INFO-A gpu-web-scraperspecifies which project-Btells it to run on the given beat schedule-l INFOtells it log information
-
Do the same first two steps as above
-
Start up the Celery service using:
celery -A gpu-web-scraper worker -l INFO- Note that we don't use
-Bas we don't want it on a service
- Note that we don't use
-
Open up a third terminal
-
Import the Django settings module using
export DJANGO_SETTINGS_MODULE=gpu-web-scraper.settings -
Enter the Python console and run the following:
import django django.setup() from scraping.tasks import scrape scrape.apply_async()
Note: You can end RabbitMQTT service using rabbitmqctl stop
-
Import the Django settings module using
export DJANGO_SETTINGS_MODULE=gpu-web-scraper.settings -
Enter the Python console and run the following:
import django django.setup() from scraping.tasks import scrape scrape()
Run the following commands:
python3 manage.py collectstatic(not required)python3 manage.py makemigrations(if made changes to model)python3 manage.py migratepython3 manage.py runserver
- Fix drivers to run concurrently again!
- Make use of Docker to always have the server running
- Save backups of data online
