ArchiveBox-compatible plugin suite (hooks and config schemas).
This package contains only the plugins, to run them use abx-dl or archivebox.
Tools like abx-dl and ArchiveBox can discover plugins from this package
without symlinks or environment-variable tricks.
Each plugin lives under plugins/<name>/ and may include:
config.jsonconfig schemaon_Crawl__...per-crawl hook scripts (optional) - install dependencies / set up shared resourceson_Snapshot__...per-snapshot hooks - for each URL: do xyz...
Hooks run with:
- SNAP_DIR = base snapshot directory (default:
.) - CRAWL_DIR = base crawl directory (default:
.) - Snapshot hook output =
SNAP_DIR/<plugin>/... - Crawl hook output =
CRAWL_DIR/<plugin>/... - Other plugin outputs can be read via
../<other-plugin>/...from your own output dir
SNAP_DIR- base snapshot directory (default:.)CRAWL_DIR- base crawl directory (default:.)LIB_DIR- binaries/tools root (default:~/.config/abx/lib)PERSONAS_DIR- persona profiles root (default:~/.config/abx/personas)ACTIVE_PERSONA- persona name (default:Default)
Lifecycle:
on_Crawl__*install*declares crawl dependencies.on_Binary__*install*resolves/installs one binary with one provider.
on_Crawl output (dependency declaration):
{"type":"Binary","name":"yt-dlp","binproviders":"pip,brew,apt,env","overrides":{"pip":{"install_args":["yt-dlp[default]"]}},"machine_id":"<optional>"}on_Binary input/output:
- CLI input should accept
--binary-id,--machine-id,--name(plus optional provider args). - Output should emit installed facts like:
{"type":"Binary","name":"yt-dlp","abspath":"/abs/path","version":"2025.01.01","sha256":"<optional>","binprovider":"pip","machine_id":"<recommended>","binary_id":"<recommended>"}Optional machine patch record:
{"type":"Machine","config":{"PATH":"...","NODE_MODULES_DIR":"...","CHROME_BINARY":"..."}}Semantics:
stdout: JSONL records onlystderr: human logs/debug- exit
0: success or intentional skip - exit non-zero: hard failure
State/OS:
- working dir:
CRAWL_DIR/<plugin>/ - durable install root:
LIB_DIR(e.g. npm prefix, pip venv, puppeteer cache) - providers:
apt(Debian/Ubuntu),brew(macOS/Linux), many hooks currently assume POSIX paths
Lifecycle:
- runs once per snapshot, typically after crawl setup
- common Chrome flow: crawl browser/session ->
chrome_tab->chrome_navigate-> downstream extractors
State:
- output cwd is usually
SNAP_DIR/<plugin>/ - hooks may read sibling outputs via
../<plugin>/...
Output records:
- terminal record is usually:
{"type":"ArchiveResult","status":"succeeded|noresults|skipped|failed","output_str":"path-or-message"}- discovery hooks may also emit
SnapshotandTagrecords beforeArchiveResult - search indexing hooks are a known exception and may use exit code + stderr without
ArchiveResult
Semantics:
stdout: JSONL recordsstderr: diagnostics/logging- exit
0: succeeded, noresults, or skipped - exit non-zero: failed
The base/ plugin provides shared Python and JS helpers that all other plugins import:
Python (base/utils.py):
sys.path.append(str(Path(__file__).resolve().parent.parent))
from base.utils import load_config, emit_archive_result, get_envload_config()β loadconfig.jsonvia PydanticSettings with env var + alias resolutionemit_archive_result(status, output_str)β print{"type":"ArchiveResult",...}JSONL to stdoutoutput_binary(name, abspath, version, ...)β emitBinaryJSONL recordoutput_machine_config(config_dict)β emitMachineconfig patchwrite_text_atomic(path, content)β write file atomically (temp + rename)find_html_source(snap_dir, ...)β locate HTML from sibling pluginshas_staticfile_output(snap_dir, path)β check if a sibling plugin produced a fileget_env(name, default),get_env_bool,get_env_int,get_env_arrayβ typed env helpersenforce_lib_permissions()β lock downLIB_DIRso snapshot hooks can read/execute but not write
JS (base/utils.js):
const { getEnv, getEnvBool, getEnvInt, getEnvArray, emitArchiveResult } = require('../base/utils.js');Test helpers (base/test_utils.py):
from base.test_utils import parse_jsonl_output, run_hook, get_hook_scriptparse_jsonl_output(stdout)β extract first matching JSONL record from hook stdoutrun_hook(hook_script, url, snapshot_id)β run a hook subprocess with standard argsget_hook_script(plugin_dir, pattern)β find hook script by glob pattern
Note: Use
sys.path.append()(notinsert(0, ...)) because thessl/plugin directory would shadow Python's stdlibsslmodule.
- all plugins should:
- overwrite existing files cleanly if re-run in the same dir, do not skip if files are already present (do not delete and then download, because if a process fails we want to leave previous output intact).
- the exception to always overwriting files is: chrome.pid. target_id.txt, navigation.json, etc. chrome state which gets re-used if it's not stale. we should detect if any of it is stale during chrome launch and tab creation, and clear all of it together if it is stale to prevent subtle drift errors / reuse of stale values.
- status
succeededif they ran and produced output - status
noresultsif they ran succesfully but produced no meaningful output (e.g. git on a non-github url, ytdlp on a site with no media, paperdl on a site with no pdfs, etc.) - status
skippedif only if config caused them not to run (e.g.YTDLP_ENABLED=False) - status
failedif any hard dependencies are missing/invalid (e.g. chrome) or if the process exited non-0 / raised an exception - return a short, meaningful
output_stre.g. the page title, mimetype, return status code, or the relative path of the primary output file produced likeoutput.pdfor0 modals closedorThe Page Title Verbatimorfavicon.ioorNot a git URL - define execution order solely using lexicographic sort order of hook filenames
- use bg hooks for either short-lived tasks that can run in parallel, or long-lived daemons that run for the whole duration of the snapshot and get killed for cleanup/final output at the end
- bg hooks that depend on other bg hook outputs must implement their own waiters internally + check that inputs are truly ready and not just that the files are present, because they may be spawned in parallel/before the earlier one's outputs are actually ready and race. e.g. html/artifact generation should usually be fg so that later bg parsing hooks can safely depend on it being finished and not just part of the file being present
- use rich_click for cli arg parsing with a uv file header when hooks are written in python. do not depend on archivebox or django, try to only depend on chrome or the output files of other plugins instead of importing code from them. the one exception is to always use chrome_utils.js as the interface for anything involving chrome.
Hooks emit JSONL events to stdout. They do not need to import bbus.
The event envelope matches the bbus style so higher layers can stream/replay.
Minimal envelope:
{
"event_id": "uuidv7",
"event_type": "SnapshotCreated",
"event_created_at": "2026-02-01T20:10:22Z",
"event_parent_id": "uuidv7-or-null",
"event_schema": "abx.events.v1",
"event_path": "abx-plugins",
"data": { "...": "event-specific fields" }
}Conventions:
- Active verb names are requests (e.g.
BinaryInstall,ProcessLaunch). - Past tense names are facts (e.g.
BinaryInstalled,ProcessExited). - Plugins can emit additional fields inside
datawithout coordination.
Common event types emitted by hooks:
ArchiveResultCreated(status + output files)Binaryrecords (dependency detection/install)ProcessStarted/ProcessExited
Higher-level tools (abx-dl / ArchiveBox) can:
- Parse these events from stdout
- Persist or project them (SQLite/JSONL/Django) without plugins knowing
