Skip to content

feat(scripts): add cross-reference tag tooling#39662

Draft
kim-em wants to merge 1 commit into
leanprover-community:masterfrom
kim-em:crossref-tooling-scripts
Draft

feat(scripts): add cross-reference tag tooling#39662
kim-em wants to merge 1 commit into
leanprover-community:masterfrom
kim-em:crossref-tooling-scripts

Conversation

@kim-em
Copy link
Copy Markdown
Contributor

@kim-em kim-em commented May 21, 2026

This PR adds two standalone scripts in scripts/ to support review of PRs that add @[stacks], @[kerodon], or @[wikidata] attributes. Neither needs a lakefile.lean entry — both run via lake env lean --run and depend only on Lean core (plus curl on PATH for the fetcher).

  • scripts/crossref-snippet.lean fetches a one-line label/description for one or more tags from the upstream database. Wikidata uses the wbgetentities API (with recovery when one bad QID poisons a batch); Stacks and Kerodon use the documented Gerby /data/tag/<TAG>/content/statement endpoint with a tolerant HTML→text strip that survives < inside math. Exit codes distinguish all-resolved (0) from missing (2) from transient network failure (3). When CROSSREF_CACHE_DIR is set, responses are memoised per (database, tag) so repeat lookups are instant.

  • scripts/extract-crossref-tags.lean walks a .lean file (or the added lines of a git diff range) and emits TSV of every cross-reference attribute it finds, paired with the declaration it decorates. A byte-level scanner correctly skips string literals and line/block comments, handles multi-attribute blocks (@[simp, stacks 01AB]), multi-line attribute blocks, doc comments between attribute and declaration, and modifier keywords (private, noncomputable, …). Signatures are collapsed to one line for downstream consumption.

Sample usage:

lake env lean --run scripts/crossref-snippet.lean wikidata Q42 Q937
# Q42  Douglas Adams  British science fiction writer and humorist (1952–2001)
# Q937 Albert Einstein german-born theoretical physicist (1879–1955)

lake env lean --run scripts/crossref-snippet.lean stacks 01AB
# 01AB Lemma 14.32.3 Lemma 14.32.3. Let \$A\$, \$B\$ be sets ...

lake env lean --run scripts/extract-crossref-tags.lean --file Mathlib/Algebra/CharP/Basic.lean
# stacks  Mathlib/Algebra/CharP/Basic.lean  195  09FS  First part. ...  instance  charP  instance charP (n : ℕ) [NeZero n] : CharP (Fin n) n

lake env lean --run scripts/extract-crossref-tags.lean --diff origin/master..HEAD --db wikidata

The scripts are deliberately text-only — they do not elaborate the files they inspect. That lets the companion CI workflow (a follow-up PR) operate on PR files as data without executing PR code under a privileged GitHub token.

Two stacked follow-up PRs (not yet open):

  1. An LSP widget that surfaces these snippets in the editor when the cursor sits on a cross-reference attribute, reusing the fetch logic via a new Mathlib/Tactic/CrossRef/Fetch.lean module.
  2. A GitHub Actions workflow that posts a once-per-PR comment with tag / signature / snippet tables and fails CI on unresolved tags.

🤖 Prepared with Claude Code

@github-actions github-actions Bot added the CI Modifies the continuous integration setup or other automation label May 21, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

PR summary 61c1da021f

Import changes for modified files

No significant changes to the import graph

Import changes for all files
Files Import difference

Declarations diff

+ AttrBlock
+ Clause
+ Database
+ Database.fetch
+ Database.gerbyBase?
+ Database.name
+ Database.ofString?
+ DeclHeader
+ DiffHunks
+ Mode
+ Pos
+ Record
+ Record.emit
+ Result
+ Result.emit
+ WikidataParse
+ afterFirst?
+ byteSlice
+ cacheFile?
+ cacheLoad
+ cacheStore
+ collapseSignature
+ databases
+ declKeywords
+ extractFromFile
+ fetchGerby
+ fetchGerbyAll
+ fetchUrl
+ fetchWikidata
+ findDecl
+ flattenWhitespace
+ gitDiffLeanFiles
+ jsonStrPath?
+ modifierKeywords
+ parseArgs
+ parseClause
+ parseDiff
+ parseGerbyTitle
+ parseWikidataResponse
+ readIdent
+ readWord
+ recordInHunks
+ splitAttrInstances
+ stepPos
+ stripHtml
+ takeIdent
+ takeString
+ takeTag
+ takeUntilChar
+ tsvEscape
+ userAgent
+ utf8Slice
+ wikidataBatchSize
+ wikidataResultOf
++ main
++ usage

You can run this locally as follows
## from your `mathlib4` directory:
git clone https://github.com/leanprover-community/mathlib-ci.git ../mathlib-ci

## summary with just the declaration names:
../mathlib-ci/scripts/pr_summary/declarations_diff.sh <optional_commit>

## more verbose report:
../mathlib-ci/scripts/pr_summary/declarations_diff.sh long <optional_commit>

The doc-module for scripts/pr_summary/declarations_diff.sh in the mathlib-ci repository contains some details about this script.


No changes to strong technical debt.
No changes to weak technical debt.

⚠️ Scripts folder reminder

This PR adds files under scripts/.
Please consider whether each added script belongs in this repository or in leanprover-community/mathlib-ci.

A script belongs in mathlib-ci if it is a CI automation script that interacts with GitHub (e.g. managing labels, posting comments, triggering bots), runs from a trusted external checkout in CI, or requires access to secrets.

A script belongs in this repository (scripts/) if it is a developer or maintainer tool to be run locally, a code maintenance or analysis utility, a style linting tool, or a data file used by the library's own linters.

See the mathlib-ci README for more details.

Added scripts files:

  • scripts/crossref-snippet.lean
  • scripts/extract-crossref-tags.lean

Add two standalone scripts to support review of PRs that add `@[stacks]`,
`@[kerodon]`, or `@[wikidata]` attributes:

* `scripts/crossref-snippet.lean` fetches a one-line label/description for
  one or more tags from the upstream database. Wikidata uses the
  `wbgetentities` API (with recovery when one bad QID poisons a batch);
  Stacks and Kerodon use the documented Gerby `/data/tag/<TAG>/content/statement`
  endpoint with a tolerant HTML→text strip that survives `<` inside math.
  Exit codes distinguish all-resolved (0) from missing (2) from transient
  network failure (3). When `CROSSREF_CACHE_DIR` is set, responses are
  memoised per `(database, tag)` so repeat lookups are instant.

* `scripts/extract-crossref-tags.lean` walks a `.lean` file (or the added
  lines of a git diff range) and emits TSV of every cross-reference
  attribute it finds, paired with the declaration it decorates. A
  byte-level scanner correctly skips string literals and line/block
  comments, handles multi-attribute blocks (`@[simp, stacks 01AB]`),
  multi-line attribute blocks, doc comments between attribute and
  declaration, and modifier keywords (`private`, `noncomputable`, …).
  Signatures are collapsed to one line for downstream consumption.

Both scripts run via `lake env lean --run` and depend only on Lean core
(plus `curl` on `PATH` for the fetcher), so no Mathlib build is required
and a future CI workflow can drive them on untrusted PR files without
elaborating PR code.

Companion follow-ups (separate PRs) will add an LSP widget that surfaces
these snippets in the editor and a GitHub Actions workflow that posts a
once-per-PR comment with tag / signature / snippet tables and fails CI
on unresolved tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kim-em kim-em force-pushed the crossref-tooling-scripts branch from 4ba4247 to 61c1da0 Compare May 21, 2026 18:59
kim-em added a commit to kim-em/mathlib4 that referenced this pull request May 21, 2026
When a Mathlib declaration carries `@[stacks ...]`, `@[kerodon ...]`, or
`@[wikidata ...]`, the info view now shows the upstream label / description
when the cursor sits on the attribute. The fetch is on-demand via an RPC
call from the widget to the Lean server, which shells out to `curl`; we
deliberately don't fetch at attribute-elaboration time so offline builds
stay clean and the only network access is when an editing session opens
the panel.

Two new modules and a small refactor to the existing one:

* `Mathlib/Tactic/CrossRef/Fetch.lean` (new) holds the `Database` enum and
  the snippet-fetch logic. The companion `scripts/crossref-snippet.lean`
  in the previous PR reimplements the same logic standalone for CI usage;
  the two are kept in sync deliberately so the CI script doesn't need a
  Mathlib build.
* `Mathlib/Tactic/Widget/CrossRefHover.lean` (new) defines the widget
  itself: an `RpcEncodable` props record, a `RequestM` RPC method that
  calls `fetchSnippet`, and a `mk_rpc_widget%` Component that renders
  the result as Html.
* `Mathlib/Tactic/CrossRefAttribute.lean` (modified) imports the two new
  modules, removes the now-duplicated `Database` / `databaseURL` /
  `databaseLabel` definitions, and calls `Widget.savePanelWidgetInfo` in
  each attribute's `add` handler. Tag storage, docstring rewriting, and
  the trace commands are untouched.

If `curl` is missing or the upstream site is unreachable, the widget
renders an inline error instead of blocking the LSP. The existing
`MathlibTest/CrossRefAttribute.lean` suite continues to pass — none of
the behaviour visible to it changes.

Stacked on leanprover-community#39662
(scripts only). A third follow-up will add a GitHub Actions workflow
that posts a once-per-PR comment with tag / signature / snippet tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kim-em added a commit to kim-em/mathlib4 that referenced this pull request May 21, 2026
When a PR touches `Mathlib/**/*.lean` and adds `@[stacks ...]`,
`@[kerodon ...]`, or `@[wikidata ...]` attributes, this workflow posts (or
updates, or deletes) a single bot comment with one row per added tag,
showing the upstream label / description fetched from the source database,
the Mathlib declaration's first-line signature, and the author's optional
attribute comment. CI fails when any tag cannot be resolved upstream and
emits a warning (not a failure) when only transient network errors occur.

Three new pieces:

* `scripts/crossref-pr-comment.py` — pure-stdlib Python orchestrator that
  drives `extract-crossref-tags.lean` then `crossref-snippet.lean` per
  database, formats the Markdown comment body with a hidden HTML marker
  for in-place updates, and returns a 0/1/2/3 exit code so the workflow
  can distinguish "nothing to post" from "post but pass" from
  "post and fail" from "post but treat as a network blip."

* `.github/workflows/crossref_review.yml` — `pull_request_target` workflow
  with `pull-requests: write` (and only that). Installs elan, runs the
  orchestrator on the PR's diff range, finds the existing bot comment via
  `gh api … --jq` matching the marker, and `POST`s / `PATCH`es / `DELETE`s
  through the GitHub API. A startup gate exits cleanly when the branch
  pre-dates the companion scripts (so the workflow doesn't fail on PRs
  to an older base).

* `docs/workflows.md` — new entry in the workflow inventory, matching
  the existing row format.

Stacked on leanprover-community#39664
which adds the LSP widget side, which itself depends on
leanprover-community#39662 (the scripts).

Security note: the workflow runs the PR's own version of the scripts in
the privileged `pull_request_target` context. The exposed token is
GITHUB_TOKEN with only `pull-requests: write`, and the scripts only read
.lean files and shell out to `curl` against three fixed upstream APIs, so
the blast radius of a malicious script edit is limited to posting / editing
PR comments. If maintainers prefer the trusted-only execution pattern used
by `PR_summary.yml`, the orchestrator can move to mathlib-ci in a
follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Modifies the continuous integration setup or other automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant