Skip to content

domain-skills/facebook: add page-archival.md (full-preservation scraping)#385

Open
thetimechain wants to merge 1 commit into
browser-use:mainfrom
thetimechain:add-facebook-page-archival-skill
Open

domain-skills/facebook: add page-archival.md (full-preservation scraping)#385
thetimechain wants to merge 1 commit into
browser-use:mainfrom
thetimechain:add-facebook-page-archival-skill

Conversation

@thetimechain
Copy link
Copy Markdown

@thetimechain thetimechain commented May 22, 2026

Summary

Adds a third Facebook domain skill, page-archival.md, alongside the existing groups.md and pages.md. The existing two are tuned for monitoring (top-N recent posts + outbound URLs). This new one is tuned for full preservation of a Page: walk every reachable post permalink, visit each in permalink view, recursively expand every comment and reply thread, download every image, and emit one wiki-compatible markdown file per post.

Why this is a separate skill

pages.md and groups.md quite reasonably stop at feed-view harvesting since their workflows don't need comments + replies + images. But the moment the goal is preservation (admin's content might disappear), feed-view falls apart:

  • FB virtualizes the feed; scrolled-past posts unmount from the DOM, taking their comment subtrees with them
  • Feed-view "View N more comments" is a no-op stub; comment expansion only works on the dedicated permalink view
  • Image <img src> in feed-view returns thumbnails; full-res is in the wrapping <a href>

So a Page archive needs a two-phase architecture: phase 1 walks the feed and captures only permalinks, phase 2 visits each permalink and does the deep extraction. Keeping that as its own skill avoids muddling the simpler pages.md flow.

Patterns captured

All field-tested against a long-running community Page (~63 posts, ~268 comments, ~108 images preserved, zero account checkpoints through a 581-URL run):

  1. Two-phase manifest-then-scrape — explained as a table with its own "why"
  2. Vanity-scoped permalink filter (a[href*=\"/{vanity}/posts/pfbid\"]) — unscoped pfbid matches leak ~30% notification/recommendation pollution
  3. The "Chats" <h1> misdirection — FB's chat widget renders its own <h1>, so global h1 returns "Chats" instead of the Page name; filter inside [role=\"main\"]
  4. Recursive expand_all_once() loop — tight regex covering "See more / View N comments / View N replies / Show more replies"; idempotent passes until zero clicks, capped at 30
  5. Comment depth from DOM nesting — counts ancestor div[role=\"article\"] between comment and post article (indentation pixels are unreliable)
  6. Image filter — exclude emoji.php, drop width < 60 to avoid avatars + reaction icons + tracking pixels
  7. Subprocess-per-post stale-daemon recovery — long sessions stress the daemon; one harness invocation per permalink is more robust, with ensure_real_tab() between ops as the lighter-weight option
  8. 30s pause every 50 scrolls in the manifest phase; 60s every 25 permalinks in the scrape phase
  9. "/posts/" 302 quirk documented; use the bare Page URL
  10. End-of-feed via 8 consecutive empty scrolls — FB rarely renders an explicit "no more posts" sentinel

What I deliberately didn't include

  • Pixel coordinates (break on layout)
  • Run narration (the specific Page archived)
  • Secrets, cookies, user-specific state
  • A monolithic single-harness-call architecture (subprocess-per-post is what survived stale-daemon failures during the run)

Verification

Run rg --files agent-workspace/domain-skills/facebook and confirm the three skills coexist. The new file mirrors the style + section ordering of pages.md (URL patterns table, DOM anchors table, scrolling pattern, decoder helpers, rate-limit discipline, self-inspection block, full Python example, gotchas log).

Happy to iterate on tone or trim sections if the maintainers prefer a leaner skill.


Summary by cubic

Adds the Facebook domain skill page-archival.md to fully preserve a Page by building a permalink manifest, scraping each post in permalink view, expanding all comments/replies, downloading images, and writing one markdown file per post. Complements pages.md and groups.md, which focus on monitoring.

  • New Features
    • Adds agent-workspace/domain-skills/facebook/page-archival.md documenting end-to-end Page archival.
    • Two-phase flow: manifest → scrape, with vanity-scoped permalink filtering to avoid off-Page noise.
    • Recursive comment/reply expansion with capped passes; depth computed from DOM nesting.
    • Image extraction with filters (emoji.php, width ≥ 60) and per-post markdown output.
    • Practical pacing and reliability guidance (scroll/load pauses, long breaks, subprocess-per-post or ensure_real_tab()), plus a self-inspection block and a full Python example.

Written for commit 303ae0f. Summary will update on new commits. Review in cubic

…raping

Third Facebook domain skill alongside groups.md and pages.md. Where
those two are tuned for monitoring (top-N recent posts + outbound URLs),
this one is tuned for full preservation of a Page:

- walk every reachable post permalink
- visit each one in permalink view (not feed view)
- recursively expand every comment and reply thread
- download every image
- emit one wiki-compatible markdown file per post

Captures patterns from a live archival session against a long-running
community Page (~63 posts, every comment + reply + image preserved,
zero account checkpoints). Most operationally important findings:

- Two-phase manifest-then-scrape — feed virtualization makes comment
  expansion impossible in feed view; permalink view is required.
- Vanity-scoped permalink filter (a[href*="/{vanity}/posts/pfbid"]) —
  unscoped pfbid matches leak in ~30% notification/recommendation pollution.
- Comment depth from DOM nesting — each comment/reply is its own
  div[role="article"]; indentation pixels are unreliable.
- 30s pause every 50 scrolls — pacing floor that kept the test account
  un-checkpointed through a 581-URL run.

No pixel coordinates, no user-specific narration, no secrets.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="agent-workspace/domain-skills/facebook/page-archival.md">

<violation number="1" location="agent-workspace/domain-skills/facebook/page-archival.md:526">
P2: Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback `(blocks[0] && blocks.length === 1 ? blocks[0] : null)`, but the full example only uses `blocks.slice(1).join('\\n') || null`, which drops comment text when a comment renders with only one `div[dir="auto"]` block. Mirror the fallback from the standalone extraction snippet to prevent data loss.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

const t = el.querySelector('a[href*="comment_id="]');
comments.push({
depth, author: blocks[0] || null, author_url: a?.href || null,
text: blocks.slice(1).join('\\n') || null,
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback (blocks[0] && blocks.length === 1 ? blocks[0] : null), but the full example only uses blocks.slice(1).join('\\n') || null, which drops comment text when a comment renders with only one div[dir="auto"] block. Mirror the fallback from the standalone extraction snippet to prevent data loss.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At agent-workspace/domain-skills/facebook/page-archival.md, line 526:

<comment>Single-block comments silently lose body text in the full end-to-end example. The standalone section earlier in the same file correctly handles this with a fallback `(blocks[0] && blocks.length === 1 ? blocks[0] : null)`, but the full example only uses `blocks.slice(1).join('\\n') || null`, which drops comment text when a comment renders with only one `div[dir="auto"]` block. Mirror the fallback from the standalone extraction snippet to prevent data loss.</comment>

<file context>
@@ -0,0 +1,630 @@
+          const t = el.querySelector('a[href*="comment_id="]');
+          comments.push({
+            depth, author: blocks[0] || null, author_url: a?.href || null,
+            text: blocks.slice(1).join('\\n') || null,
+            time_hint: t?.innerText || null,
+          });
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant