The script extracts arabic text from a PDF (with a certain format) translates it using googletrans and adds the content of every PDF as a row to a csv file.
- Create a local copy of this repo
- Run
pip install -r requirements.txtto install packages. This is tested for MacOs - Add PDF's to the
pdf_inputfolder - Run
main.py - All information will be appended in
output.csv
This script use poppler-utils to convert a PDF into an image and then uses tesseractto extract the text via OCR. We first extract the latin characters and numbers and then run OCR a second time for the arabic. Finally, we use python's CSV library to append the data to a CSV file and also write name of extracted PDFs into a text file. The script verifies the text file at the start to see if we have already processed the files.