Skip to content

Error in grafting ocr'd text back into PDF #336

@duckduckgrayduck

Description

@duckduckgrayduck

What I did:
I'm adding the ability to optionally force OCR a document after uploading them via URL in the Python wrapper.
urls= ["https://www.chicago.gov/content/dam/city/depts/dcd/tif/24reports/T_063_CanalCongressAR24.pdf", "https://www.chicago.gov/content/dam/city/depts/dcd/tif/24reports/T_072_24thMichiganAR24.pdf"]
uploaded_docs = client.documents.upload_urls(urls, force_ocr=True, ocr_engine="textract")

One document succeeded just fine, the other encountered this error:
https://muckrock.sentry.io/issues/6895388692/?alert_rule_id=1010155&alert_timestamp=1758569111535&alert_type=email&notification_uuid=b002be3c-4651-4b4e-98b1-3e929498be0d&project=2873549

https://www.documentcloud.org/documents/26105901-t_063_canalcongressar24/

This left the document in a failed state, which could probably be handled better (present a working version of the PDF)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions