Skip to content

Scrape PDFs for Bill Text #2081

@Mephistic

Description

@Mephistic

Summary
There are 236 bills in the current session that do not have the DocumentText field available through the MA Legislature Document API. That does not mean there isn't DocumentText - it means that text is only available through the PDF of the bill that the legislature provides.

For these bills, we should:

  • Explore the data (for bills where content.DocumentText is null)
    • As best we can, categorize and document the cases where DocumentText is null
  • Try to scrape the bill text from the PDF (if DocumentText is not available through api.getDocument)
  • Re-run the LLM summarizer/tagger on these bills once DocumentText is available

There is a little exploratory work in how well we can extract text and what formats are in play, but I think our chances are good here. I suspect there are at least two types of PDFs and likely more - I've seen omnibus spending bills and Ballot Initiative bills in cursory exploration).

Additional Resources

  • Nathan pulled together a script to handle this for Ballot Initiative bills in Python using PdfPlumber - we can get likely get something similar in Typescript with a suitable library: https://github.com/nesanders/ma_ballot_bill_text_extraction
  • You can query Firestore in generalCourts/194/bills to find all of the bills in question, but here are a few bill ids to get the exploration started: H1, H18, H4787, H5008(no longer - I manually overrode this one because it is one of the Ballot Initiative bills - but the API will still reflect the null DocumentText), and S2539.

Metadata

Metadata

Assignees

Labels

backendBackend DevelopmentscraperBackend work related to content scraping

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions