Here we address a simple use case of applying a single transform to a
set of parquet files.
We'll use the docling2parquet transform as an example, but in general, this process
will work for any of the transforms contained in Data Prep Kit.
Additionally, what follows uses the
python runtime
but the examples below should also work for the
ray
or
spark
runtimes.
The latest version of the Data Prep Kit is available on PyPi for Python 3.10, 3.11 or 3.12. It can be installed using:
pip install 'data-prep-toolkit-transforms[ray,all]'The above installs all available transforms and both the python and Ray runtimes.
NOTE: As of this writing, on linux systems there is an
issue
installing fasttext for the lang_id transform.
A workaround is to
install using conda.
Alternatively, you may choose to install only the transform(s) of interest (see below).
When installing select transforms, users can specify the name of the transform in the pip command, rather than [all]. For example, use the following command to install only the docling2parquet transform:
pip install 'data-prep-toolkit-transforms[docling2parquet]'As an alternative, installing in a conda environment can be found here.
Here we run the docling2parquet transform on its input data to
import pdf content into rows of a parquet file.
First, we load some data for the transform to run on using the following python code:
import urllib.request
import shutil
shutil.os.makedirs("input", exist_ok=True)
urllib.request.urlretrieve("https://raw.githubusercontent.com/data-prep-kit/data-prep-kit/dev/transforms/language/docling2parquet/test-data/input/archive1.zip", "input/archive1.zip")
urllib.request.urlretrieve("https://raw.githubusercontent.com/data-prep-kit/data-prep-kit/dev/transforms/language/docling2parquet/test-data/input/redp5110-ch1.pdf", "input/redp5110-ch1.pdf")% ls input
archive1.zip redp5110-ch1.pdfNext we run docling2parquet on the data in the input folder.
python -m dpk_docling2parquet.runtime \
--data_local_config "{ 'input_folder': 'input', 'output_folder': 'output'}" \
--data_files_to_use "['.pdf', '.zip']" Parquet files are generated in the designated output folder:
% ls output
archive1.parquet metadata.json redp5110-ch1.parquetAll transforms are runnable from the command line in the manner above.