SafeAgree-OPP115Loader is a data processing pipeline designed to parse, format, and upload the OPP-115 (Online Privacy Policies) dataset into a structured Hugging Face dataset.
This repository was used to generate the official dataset available at: exiort/SafeAgree-OPP115.
The tool processes raw HTML privacy policies, aligns them with their respective CSV annotations, and formats the output into a JSON-string target suitable for training and fine-tuning Large Language Models (LLMs).
- Automated Parsing: Uses
BeautifulSoupto extract clean text from sanitized HTML policy segments. - Data Alignment: Matches sanitized policies with
annotationsandpretty_printfiles to ensure accurate feature mapping. - HF Dataset Conversion: Converts the processed segments into a Hugging Face
Datasetobject containinginput_textandtarget_json_stringfeatures, with optionalpolicy_idandsegment_idmetadata. - Automated Splits: Automatically splits the dataset into
train(80%),validation(10%), andtest(10%) sets using a fixed seed (31) before uploading. - Environment Validation: Includes a robust validation script (
environment_validation_check.py) to verify your local LLM training environment (checking PyTorch, CUDA, Unsloth, bitsandbytes, and xformers).
Ensure you have the required Python packages installed:
pip install -r requirements.txtFor the load_opp115 command to work, your base data directory must contain the following three subdirectories:
annotations/sanitized_policies/pretty_print/
The main.py script acts as the entry point and accepts two primary commands: load_opp115 and upload_opp115.
Run the loader command to parse the raw files and save them as a Hugging Face dataset to your local disk.
python main.py load_opp115Prompts:
BasePath:The path to the folder containing the required subdirectories.SavePath:The destination path where the Hugging Face dataset will be saved.IncludeMetadata(Y/N):Choose 'Y' to includepolicy_idandsegment_idin the output features.
Run the upload command to apply train/test/validation splits and push the dataset to a Hugging Face repository.
python main.py upload_opp115Prompts:
DatasetPath:The local path where you saved the dataset in step 1.RepoID:Your Hugging Face repository ID (e.g.,your-username/SafeAgree-OPP115).Token:Your Hugging Face write access token.CommitMessage:A message for the commit.IsPrivate(Y/N):Whether the repository should be private.
If you plan to use the resulting dataset to fine-tune an LLM locally using Unsloth or PEFT, run the validation script to ensure your CUDA environment and dependencies are correctly configured:
python environment_validation_check.pyThis project is licensed under the MIT License. Copyright (c) 2026 Buğrahan İmal. You are free to use, copy, modify, merge, publish, and distribute this software as per the license conditions.