Summary
There are 236 bills in the current session that do not have the DocumentText field available through the MA Legislature Document API. That does not mean there isn't DocumentText - it means that text is only available through the PDF of the bill that the legislature provides.
For these bills, we should:
- Explore the data (for bills where
content.DocumentText is null)
- As best we can, categorize and document the cases where
DocumentText is null
- Try to scrape the bill text from the PDF (if
DocumentText is not available through api.getDocument)
- Re-run the LLM summarizer/tagger on these bills once
DocumentText is available
There is a little exploratory work in how well we can extract text and what formats are in play, but I think our chances are good here. I suspect there are at least two types of PDFs and likely more - I've seen omnibus spending bills and Ballot Initiative bills in cursory exploration).
Additional Resources
- Nathan pulled together a script to handle this for Ballot Initiative bills in Python using PdfPlumber - we can get likely get something similar in Typescript with a suitable library: https://github.com/nesanders/ma_ballot_bill_text_extraction
- You can query Firestore in
generalCourts/194/bills to find all of the bills in question, but here are a few bill ids to get the exploration started: H1, H18, H4787, H5008(no longer - I manually overrode this one because it is one of the Ballot Initiative bills - but the API will still reflect the null DocumentText), and S2539.
Summary
There are 236 bills in the current session that do not have the
DocumentTextfield available through the MA Legislature Document API. That does not mean there isn't DocumentText - it means that text is only available through the PDF of the bill that the legislature provides.For these bills, we should:
content.DocumentTextis null)DocumentTextis nullDocumentTextis not available throughapi.getDocument)DocumentTextis availableThere is a little exploratory work in how well we can extract text and what formats are in play, but I think our chances are good here. I suspect there are at least two types of PDFs and likely more - I've seen omnibus spending bills and Ballot Initiative bills in cursory exploration).
Additional Resources
generalCourts/194/billsto find all of the bills in question, but here are a few bill ids to get the exploration started:H1,H18,H4787,H5008(no longer - I manually overrode this one because it is one of the Ballot Initiative bills - but the API will still reflect the null DocumentText), andS2539.