Scrape PDFs for Bill Text

**Summary**  
There are 236 bills in the current session that do not have the `DocumentText` field available through the MA Legislature Document API. That does not mean there isn't DocumentText - it means that text is only available through the PDF of the bill that the legislature provides.

For these bills, we should:
* Explore the data (for bills where `content.DocumentText` is null)
  * As best we can, categorize and document the cases where `DocumentText` is null
* Try to scrape the bill text from the PDF (if `DocumentText` is not available through `api.getDocument`)
* Re-run the LLM summarizer/tagger on these bills once `DocumentText` is available

There is a little exploratory work in how well we can extract text and what formats are in play, but I think our chances are good here. I suspect there are at least two types of PDFs and likely more - I've seen omnibus spending bills and Ballot Initiative bills in cursory exploration).

## Additional Resources
- Nathan pulled together a script to handle this for Ballot Initiative bills in Python using PdfPlumber - we can get likely get something similar in Typescript with a suitable library: https://github.com/nesanders/ma_ballot_bill_text_extraction
- You can query Firestore in `generalCourts/194/bills` to find all of the bills in question, but here are a few bill ids to get the exploration started: `H1`, `H18`, `H4787`, `H5008`(no longer - I manually overrode this one because it is one of the Ballot Initiative bills - but the API will still reflect the null DocumentText), and `S2539`. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scrape PDFs for Bill Text #2081

Additional Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Scrape PDFs for Bill Text #2081

Description

Additional Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions