domain-skills/facebook: add page-archival.md (full-preservation scraping) by thetimechain · Pull Request #385 · browser-use/browser-harness

thetimechain · 2026-05-22T12:24:21Z

Summary

Adds a third Facebook domain skill, page-archival.md, alongside the existing groups.md and pages.md. The existing two are tuned for monitoring (top-N recent posts + outbound URLs). This new one is tuned for full preservation of a Page: walk every reachable post permalink, visit each in permalink view, recursively expand every comment and reply thread, download every image, and emit one wiki-compatible markdown file per post.

Why this is a separate skill

pages.md and groups.md quite reasonably stop at feed-view harvesting since their workflows don't need comments + replies + images. But the moment the goal is preservation (admin's content might disappear), feed-view falls apart:

FB virtualizes the feed; scrolled-past posts unmount from the DOM, taking their comment subtrees with them
Feed-view "View N more comments" is a no-op stub; comment expansion only works on the dedicated permalink view
Image <img src> in feed-view returns thumbnails; full-res is in the wrapping <a href>

So a Page archive needs a two-phase architecture: phase 1 walks the feed and captures only permalinks, phase 2 visits each permalink and does the deep extraction. Keeping that as its own skill avoids muddling the simpler pages.md flow.

Patterns captured

All field-tested against a long-running community Page (~63 posts, ~268 comments, ~108 images preserved, zero account checkpoints through a 581-URL run):

Two-phase manifest-then-scrape — explained as a table with its own "why"
Vanity-scoped permalink filter (a[href*=\"/{vanity}/posts/pfbid\"]) — unscoped pfbid matches leak ~30% notification/recommendation pollution
The "Chats" <h1> misdirection — FB's chat widget renders its own <h1>, so global h1 returns "Chats" instead of the Page name; filter inside [role=\"main\"]
Recursive expand_all_once() loop — tight regex covering "See more / View N comments / View N replies / Show more replies"; idempotent passes until zero clicks, capped at 30
Comment depth from DOM nesting — counts ancestor div[role=\"article\"] between comment and post article (indentation pixels are unreliable)
Image filter — exclude emoji.php, drop width < 60 to avoid avatars + reaction icons + tracking pixels
Subprocess-per-post stale-daemon recovery — long sessions stress the daemon; one harness invocation per permalink is more robust, with ensure_real_tab() between ops as the lighter-weight option
30s pause every 50 scrolls in the manifest phase; 60s every 25 permalinks in the scrape phase
"/posts → /" 302 quirk documented; use the bare Page URL
End-of-feed via 8 consecutive empty scrolls — FB rarely renders an explicit "no more posts" sentinel

What I deliberately didn't include

Pixel coordinates (break on layout)
Run narration (the specific Page archived)
Secrets, cookies, user-specific state
A monolithic single-harness-call architecture (subprocess-per-post is what survived stale-daemon failures during the run)

Verification

Run rg --files agent-workspace/domain-skills/facebook and confirm the three skills coexist. The new file mirrors the style + section ordering of pages.md (URL patterns table, DOM anchors table, scrolling pattern, decoder helpers, rate-limit discipline, self-inspection block, full Python example, gotchas log).

Happy to iterate on tone or trim sections if the maintainers prefer a leaner skill.

Summary by cubic

Adds the Facebook domain skill page-archival.md to fully preserve a Page by building a permalink manifest, scraping each post in permalink view, expanding all comments/replies, downloading images, and writing one markdown file per post. Complements pages.md and groups.md, which focus on monitoring.

New Features
- Adds agent-workspace/domain-skills/facebook/page-archival.md documenting end-to-end Page archival.
- Two-phase flow: manifest → scrape, with vanity-scoped permalink filtering to avoid off-Page noise.
- Recursive comment/reply expansion with capped passes; depth computed from DOM nesting.
- Image extraction with filters (emoji.php, width ≥ 60) and per-post markdown output.
- Practical pacing and reliability guidance (scroll/load pauses, long breaks, subprocess-per-post or ensure_real_tab()), plus a self-inspection block and a full Python example.

^{Written for commit 303ae0f. Summary will update on new commits. Review in cubic}

…raping Third Facebook domain skill alongside groups.md and pages.md. Where those two are tuned for monitoring (top-N recent posts + outbound URLs), this one is tuned for full preservation of a Page: - walk every reachable post permalink - visit each one in permalink view (not feed view) - recursively expand every comment and reply thread - download every image - emit one wiki-compatible markdown file per post Captures patterns from a live archival session against a long-running community Page (~63 posts, every comment + reply + image preserved, zero account checkpoints). Most operationally important findings: - Two-phase manifest-then-scrape — feed virtualization makes comment expansion impossible in feed view; permalink view is required. - Vanity-scoped permalink filter (a[href*="/{vanity}/posts/pfbid"]) — unscoped pfbid matches leak in ~30% notification/recommendation pollution. - Comment depth from DOM nesting — each comment/reply is its own div[role="article"]; indentation pixels are unreliable. - 30s pause every 50 scrolls — pacing floor that kept the test account un-checkpointed through a 581-URL run. No pixel coordinates, no user-specific narration, no secrets.

cubic-dev-ai

1 issue found across 1 file

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="agent-workspace/domain-skills/facebook/page-archival.md">

<violation number="1" location="agent-workspace/domain-skills/facebook/page-archival.md:526">
P2: Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback `(blocks[0] && blocks.length === 1 ? blocks[0] : null)`, but the full example only uses `blocks.slice(1).join('\\n') || null`, which drops comment text when a comment renders with only one `div[dir="auto"]` block. Mirror the fallback from the standalone extraction snippet to prevent data loss.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-05-22T12:37:51Z

+          const t = el.querySelector('a[href*="comment_id="]');
+          comments.push({
+            depth, author: blocks[0] || null, author_url: a?.href || null,
+            text: blocks.slice(1).join('\\n') || null,


P2: Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback (blocks[0] && blocks.length === 1 ? blocks[0] : null), but the full example only uses blocks.slice(1).join('\\n') || null, which drops comment text when a comment renders with only one div[dir="auto"] block. Mirror the fallback from the standalone extraction snippet to prevent data loss.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At agent-workspace/domain-skills/facebook/page-archival.md, line 526: <comment>Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback `(blocks[0] && blocks.length === 1 ? blocks[0] : null)`, but the full example only uses `blocks.slice(1).join('\\n') || null`, which drops comment text when a comment renders with only one `div[dir="auto"]` block. Mirror the fallback from the standalone extraction snippet to prevent data loss.</comment> <file context> @@ -0,0 +1,630 @@ + const t = el.querySelector('a[href*="comment_id="]'); + comments.push({ + depth, author: blocks[0] || null, author_url: a?.href || null, + text: blocks.slice(1).join('\\n') || null, + time_hint: t?.innerText || null, + }); </file context>

cubic-dev-ai Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

domain-skills/facebook: add page-archival.md (full-preservation scraping)#385

domain-skills/facebook: add page-archival.md (full-preservation scraping)#385
thetimechain wants to merge 1 commit into
browser-use:mainfrom
thetimechain:add-facebook-page-archival-skill

thetimechain commented May 22, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thetimechain commented May 22, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is a separate skill

Patterns captured

What I deliberately didn't include

Verification

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thetimechain commented May 22, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot May 22, 2026 •

edited

Loading