Skip to content

feat: add article parsing and handling to timeline and scraper modules#195

Open
fibonacci998 wants to merge 2 commits into
the-convocation:mainfrom
fibonacci998:feat/article-parsing
Open

feat: add article parsing and handling to timeline and scraper modules#195
fibonacci998 wants to merge 2 commits into
the-convocation:mainfrom
fibonacci998:feat/article-parsing

Conversation

@fibonacci998
Copy link
Copy Markdown

  • feat: add article parsing and handling to timeline and scraper modules
  • test: add article parsing coverage to tweets.test.ts

fibonacci998 and others added 2 commits May 25, 2026 09:48
Ports the article-extraction work from PR the-convocation#146 (LiamVDB1) onto current
main. The previous PR drifted behind main since 2025-07-11 and never got
its requested tests; this commit applies the same diff cleanly and
follow-up commits add the tests + drop the unrelated `prepare` script
change that broke CI for downstream consumers.

Adds support for X "Articles" (long-form posts) inside the timeline
data structure:

* `ArticleRaw`, `ArticleResultRaw`, `ArticleContentStateRaw` interfaces
  in src/timeline-v1.ts representing the raw article payload, including
  metadata, media, and content state.
* `parseArticleToMarkdown` and `parseArticle` in src/timeline-v2.ts that
  walk `content_state.blocks` and produce markdown (handling text,
  links, bold/italic, headers, lists, and inline media).
* `parseResult` now detects `result.article.article_results.result` and,
  when present, sets `tweet.isArticle = true`, populates `tweet.article`,
  and overwrites `tweet.text` with the rendered markdown (since
  `legacy.full_text` for an Article tweet is just the t.co URL stub).
* `Tweet` interface gains optional `isArticle` and `article` fields.

Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>
Addresses karashiiro's review request on PR the-convocation#146. Two tests against the
public article tweet 2053808119709659225 (subnetamplify):

* isArticle flag is set, article.id matches the article rest_id (not the
  tweet id — they are distinct), and content_state is populated.
* tweet.text is replaced with the rendered markdown body, far larger
  than the t.co URL stub and starting with an H1 of the article title.

Co-authored-by: LiamVDB1 <liam.van.den.berge@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant