ENH: Add ExternalDataUpload skill for local developer and AI agent testing content-link upload workflow#6111
Conversation
826ac24 to
1c13696
Compare
da60d98 to
1e287d1
Compare
|
@greptileai, please review so this can be taken out of draft mode. |
This comment was marked as resolved.
This comment was marked as resolved.
|
@thewtex. FYI, I can't get the upload access to work for either of the recommended services. I am using this skill and resources to help me configure the upload mechanisms, but I seem to be running into roadblocks: Pinata Filebase |
Adds Utilities/Maintenance/ExternalDataUpload/ with a Claude Code skill that uploads test data to IPFS under the UnixFS v1 2025 profile, pins on the redundant itk-pinata and itk-filebase remote services, optionally mirrors bytes into an ITKTestingData clone at CID/<cid> (with a 50 MB guard for GitHub's per-file push limit), maintains a new Testing/Data/content-links.manifest index, batch-pins every manifest CID, and normalizes existing .md5 / .sha256 / .cid links by fetching through the gateway templates parsed directly from CMake/ITKExternalData.cmake and re-uploading under the current UnixFS profile. Documents the one-time Kubo + IPFS Desktop setup and references the skill from Testing/Data/README.md.
Add `--background` to both `ipfs-upload.sh` and `content-link-normalize.sh` to submit remote pin requests asynchronously via `ipfs pin remote add --background`. The default remains synchronous (surfaces failures immediately, safer for one-off uploads); `--background` is intended for batch runs where waiting for each remote to reach `pinned` (minutes per file on Filebase) would be impractical. Also dedup remote-pin submission: before calling `ipfs pin remote add`, query `ipfs pin remote ls --status=queued,pinning,pinned` for the CID and skip the add if a pin already exists on that service. This avoids Pinata's `DUPLICATE_OBJECT` (400) error on re-runs of previously uploaded content, and prevents Filebase from accumulating duplicate queue entries. README.md and SKILL.md document the new flag, the synchronous vs. asynchronous tradeoff, and the post-run verification command (`ipfs pin remote ls --status=...`).
Convert the 24 `.md5` content links in
Modules/Filtering/AnisotropicDiffusionLBR/test/{Input,Baseline}/ to
`.cid` links under the UnixFS v1 2025 profile, produced by
`Utilities/Maintenance/ExternalDataUpload/content-link-normalize.sh
--hash-only --background`. Bytes were fetched through the gateway
templates in CMake/ITKExternalData.cmake, verified against each
declared MD5 hash, and re-uploaded; all new CIDs are pinned locally
plus on `itk-pinata` and `itk-filebase`.
Record the 24 new CIDs in Testing/Data/content-links.manifest along
with two additional entries picked up as a `--cid-only` sampling run
(CurvatureAnisotropicDiffusionImageFilter.2.png and warp3D.nii.gz),
both of which re-hashed to identical CIDs — confirming existing `.cid`
links in the tree are already compatible with the 2025 profile.
No test semantics change: `CMake/ITKExternalData.cmake` resolves
`DATA{...}` references by whichever `.md5` / `.sha256` / `.cid` link
sits next to the referenced path, so the filter tests continue to
fetch the same bytes.
In content-link-normalize.sh, the prerequisite warning pre-check was
iterating every sha variant (sha1/224/256/384/512) and requiring GNU
coreutils `*sum` binaries. Two issues:
1. ITK content links in practice are only .md5 (legacy) and .sha512
(current), so warning about missing sha224/sha384 tools was noise.
Narrow the pre-check to md5 and sha512.
2. macOS ships BSD `md5` and `shasum`, not coreutils `md5sum` /
`sha512sum`. Warning on their absence was a false positive for
macOS contributors, and the verification path invoked them by
name ("$tool" "$file") so it would actually fail.
Replace `hash_tool_for_ext` (name-only) with `hash_cmd_for_ext` that
returns a full command line — preferring GNU `md5sum` / `shaNsum`
when present, falling back to `md5 -r` (BSD md5 with md5sum-compatible
output) and `shasum -a NNN` (BSD shasum). `verify_bytes` uses
intentional word-splitting so the multi-word fallback
(e.g. "shasum -a 256") expands to distinct argv entries.
Addresses review at
https://github.com/InsightSoftwareConsortium/ITK/pull/6111/files#r3132434963
1e287d1 to
65b04cd
Compare
|
Matt should take a look now. |
Rewrite Documentation/docs/contributing/upload_binary_data.md and data.md to describe the new Kubo + pinning-service workflow driven by Utilities/Maintenance/ExternalDataUpload/ipfs-upload.sh, replacing the obsolete web3.storage / w3cli and content-link-upload.itk.org instructions. Document the one-time Kubo + itk-pinata / itk-filebase setup, the upload script's behavior (CIDv1 under the UnixFS v1 2025 profile, synchronous vs. --background pinning, manifest update), the optional --testing-data-repo mirror step with the 50 MB GitHub limit, and the content-link-normalize.sh conversion workflow for legacy .md5 / .sha256 / .sha512 links. Refresh the storage-location list and testing-data figure caption to match the gateways enumerated in CMake/ITKExternalData.cmake, and remove the now-orphaned content-link-upload.png screenshot of the retired web app.
65b04cd to
65f7c84
Compare
Pinata's `pin remote add` endpoint (the IPFS Pinning Service API) is gated to paid plans — the free plan rejects pin-by-CID with PAID_FEATURE_ONLY (HTTP 403), as reported by @hjmjohnson while exercising the new ExternalDataUpload skill. Filebase's free tier still accepts PSA pin-by-CID, so it remains the baseline pin provider for contributors who don't have a paid Pinata account. ipfs-upload.sh now splits its remote-pinning configuration into a required list (`itk-filebase`) and an optional list (`itk-pinata`): the script aborts if Filebase isn't registered, but logs an informational notice and continues if Pinata isn't. The remote-pin loop walks the merged ACTIVE_SERVICES list so Pinata is still pinned to whenever it is configured. The reorder also surfaces Filebase first in every user-facing list (storage locations, log lines, manifest-skipped warnings, README setup section, contributor docs) to match the new "required first, optional second" hierarchy. Documentation in README.md, SKILL.md, Documentation/docs/contributing/ upload_binary_data.md, and Documentation/docs/contributing/data.md is updated to reorder Filebase ahead of Pinata, mark Pinata as optional, and explain the paid-plan restriction. README.md gains a troubleshooting entry for the PAID_FEATURE_ONLY error pointing at `ipfs pin remote service rm itk-pinata` as the cleanest fix when no paid plan is available. Agent-Session-Id: 40f8eba4-dc94-4d4f-94bd-ff3d2fccf04f Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@hjmjohnson @dzenanz — addressed the Pinata issue in 0e9dd0b.
So contributors without a paid Pinata plan can now configure only |
|
FYI: I can not get the pinning services to work. On both pinata and filebase I get API restrictions requiring a paid account. I was trying to mirror the recent ITKTestingData additions to these external services for redundancy. ┌─[johnsonhj@ENGR-ECE-M030] - [~/src/XXX/ITK_REMOTE_MODULES_STABLE] - [2026-05-01 07:55:29]
└─[0] find . -name "*.md5" cd /Users/johnsonhj/src/XXX/ITK_REMOTE_MODULES_STABLE
# Smoke (no --background to surface auth errors immediately):
ls ~/src/XXX/ITKTestingData/CID | head -5 | while read cid; do
ipfs pin remote add --service=itk-filebase --name="$cid" "$cid"
done
find: cd: unknown primary or operator
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden |
Drops the local Kubo / IPFS-Desktop daemon, the `ipfs config profile
apply unixfs-v1-2025` setup step, the `ipfs pin remote service add`
PSA registrations (`itk-filebase`, `itk-pinata`), and the bash upload
trio (`ipfs-upload.sh`, `content-link-normalize.sh`, `ipfs-pin-all.sh`)
that drove them. The new contributor flow is pure Python on top of a
small pixi environment:
1. `npx ipfs-car pack <file> --no-wrap` builds a CARv1 locally.
ipfs-car v1+ defaults (1 MiB chunks, 1024 children, raw leaves,
CIDv1) match the unixfs-v1-2025 / IPIP-0499 profile, so no extra
flags are needed to produce a reproducible CID.
2. `boto3` PUTs the CAR to a Filebase IPFS bucket through Filebase's
S3-compatible REST API with `x-amz-meta-import: car`. Filebase
imports the CAR server-side and exposes the resulting CID via
`head_object` metadata.
3. The local CID and the CID Filebase reports are compared, and on
success the file is replaced with `<file>.cid`, the manifest at
`Testing/Data/content-links.manifest` is updated, and the optional
`--testing-data-repo` mirror step still copies the bytes into a
local ITKTestingData clone (subject to the same 50 MB GitHub push
limit as before).
Concretely:
- Add `boto3`, `nodejs`, and `requests` to a new
`[tool.pixi.feature.external-data-upload]` feature plus an
`external-data-upload` environment in `pyproject.toml`. Run
`pixi install -e external-data-upload` once, then
`pixi run -e external-data-upload python ...` for every upload.
- New `Utilities/Maintenance/ExternalDataUpload/upload.py` is the
single-file uploader: input validation (in-repo, no whitespace, not
already a content link), CAR build, boto3 put_object with the
`import: car` metadata header, head_object CID round-trip, manifest
update, optional ITKTestingData mirror, and the same `git rm` /
`git add` instructions as before.
- New `Utilities/Maintenance/ExternalDataUpload/normalize.py` parses
`ExternalData_URL_TEMPLATES` from `CMake/ITKExternalData.cmake` with a
paren-aware scanner (the `%(hash)` / `%(algo)` substrings break naive
`re.DOTALL` lazy matching), fetches each `.md5` / `.shaNNN` / `.cid`
link via the gateway templates, verifies the bytes
algorithmically (or via the `/ipfs/` server-side guarantee for
CID links), and re-uploads through `upload.upload_file_to_filebase`.
- `Utilities/Maintenance/ExternalDataUpload/README.md` is rewritten end
to end: pixi setup, Filebase S3-key creation, `FILEBASE_ACCESS_KEY` /
`FILEBASE_SECRET_KEY` / `FILEBASE_BUCKET` env-var contract, new
troubleshooting section (missing npx, missing credentials, Filebase
did not return a CID, CID mismatch).
- `Utilities/Maintenance/ExternalDataUpload/SKILL.md` updated to
describe the same flow for the AI agent: pixi env + Filebase
credentials prerequisites; no Kubo, no PSA service registration.
- `Documentation/docs/contributing/upload_binary_data.md` and
`Documentation/docs/contributing/data.md` rewrite the
one-time-setup, upload-a-file, mirror, and normalize sections
around the pixi + Filebase workflow. The storage-locations list and
testing-data-figure caption are reworded so Filebase appears as the
upload destination and Kubo / Pinata only show up as build-time read
paths (gateways, not pinning targets).
- `Testing/Data/content-links.manifest` header rewritten to credit
`upload.py` as the maintainer (previously named
`ipfs-upload.sh`).
The Filebase free tier supports the S3 import-as-CAR path used here,
so the workflow needs no paid subscription — addressing the original
Pinata \`PAID_FEATURE_ONLY\` blocker reported by @hjmjohnson — and CI
runners can use the same env-var contract via GitHub Actions secrets.
7c8594a to
05be188
Compare
|
@hjmjohnson thanks for the note and testing 🥇. I have paid accounts for Pinata and Filebase and did not know that these features are not available 🤦. I pushed an update that just uses Filebase, but its S3 API, which is on the free tier. It also uses dependencies that are made available via pixi. And removes the kubo / ipfs-desktop installation / setup / run requirement. |
|
@thewtex I pushed many objects manually to ITKTestingData so that I could move forward on the remote module conversions. Would you mind uploading the cache of ITKTestingData blobs to the various resources that you have available for the purposes of replication? |
|
@hjmjohnson I'll add a script that sync's the ITKTestingData Git repo to the content link manifest here, and I'll run the pinning script. |
Adds Utilities/Maintenance/ExternalDataUpload/ with a Claude Code skill that uploads test data to IPFS under the UnixFS v1 2025 profile, pins on the redundant itk-pinata and itk-filebase remote services, optionally mirrors bytes into an ITKTestingData clone at CID/ (with a 50 MB guard for GitHub's per-file push limit), maintains a new Testing/Data/content-links.manifest index, batch-pins every manifest CID, and normalizes existing .md5 / .sha256 / .cid links by fetching through the gateway templates parsed directly from CMake/ITKExternalData.cmake and re-uploading under the current UnixFS profile. Documents the one-time Kubo + IPFS Desktop setup and references the skill from Testing/Data/README.md.
WIP Todos: