OAI-PMH record harvester.
Mise is recommended for version management.
mise trust
mise install
mise run gemsThe harvester requires a postgres connection. The default connection config is in .env but can be overriden by .env.local. To setup:
docker compose up -d postgres postgrestThis initializes:
harvesterdatabase owned byadminharvester_readerrole for PostgREST with read access onpublictables (including future tables)
Using cargo for harvesting:
cargo run -- harvest -m oai_ead -r fixtures/rules.txt https://test.archivesspace.org/oaiUsing cargo for indexing (ArcLight):
cargo run -- index arclight \
allen-doe-research-center \
"https://test.archivesspace.org/oai" \
"Allen Doe Research Center"This uses a range of default values so will only work if your setup is aligned.
For all options run: cargo run -- index arclight --help.
Retry failed index operations for a specific endpoint/repository pair:
cargo run -- index arclight \
allen-doe-research-center \
"https://test.archivesspace.org/oai" \
"Allen Doe Research Center" \
--retry \
--message-filter "timed out" \
--max-attempts 5Requeue all parsed/deleted records for a specific endpoint/repository pair:
cargo run -- index arclight \
allen-doe-research-center \
"https://test.archivesspace.org/oai" \
"Allen Doe Research Center" \
--reindexThis is an optional feature (though required for indexing). Omit the -r arg to bypass.
A rules file looks like:
title,unittitle,required
unit_id,unitid,required
repository,repository/corpname,required- col 1 is used as a json attribute key for grouping values
- col 2 identifies a path in the oai xml to scan for values
- col 3 can be empty or "required", with the latter enforcing an error if a value is not found
# adjust envvar values as appropriate
PGHOST=localhost PGUSER=admin PGPASSWORD=admin psql \
-c "DROP DATABASE harvester;"
./scripts/init_db.shResetting index failed records to pending via the db:
UPDATE oai_records
SET index_status = 'pending', index_attempts = 0, index_message = ''
WHERE index_status = 'index_failed'
AND endpoint = 'https://example.com/oai';
# start postgres + postgrest
docker compose up -d postgres postgrest
# build harvester image via compose
docker compose build harvester
# run harvest (uses defaults from .env)
docker compose run --rm harvester harvest https://test.archivesspace.org/oai
# run index (override SOLR_URL as needed)
docker compose run --rm \
-e SOLR_URL=http://host.docker.internal:8983/solr/arclight \
harvester index arclight \
"allen-doe-research-center" \
"https://test.archivesspace.org/oai" \
"Allen Doe Research Center"Override any default with -e KEY=value on docker compose run.
For rootless Docker, if bind mount permissions fail, add --user root to docker compose run commands.
If you get an error like:
{"code":"PGRST205","details":null,"hint":null,"message":"Could not find the table 'public.oai_records' in the schema cache"}Run docker compose restart postgrest.