- Python 3.8+
- pandas
- pyarrow or fastparquet
- lxml
Run sequentially:
- Fix errors in XML files with
correct.sh. Note that XML files must be placed within the same directory with Bash script. - Parse XML file(s) by running
xml_parser.pywith name(s) of file(s) as mandatory parameters. JSONL file(s) will be created as a result. - Create
*.parquetfiles by runningjsonl_parser.py. Use--activeand--addressparameters in order to parse only active-state entities, and parse address strings into its components.
- You need a lot of space on your drive (approx. 40 Gb to parse both UO and FOP files as of January 2022).
- Parsing addresses into components somewhen inaccurate. Managers posts too.