Install dependencies using uv (recommended):
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -e .
# For development dependencies (including testing and linting tools)
uv pip install -e ".[dev]"To generate or update the lockfile:
uv pip compile pyproject.toml -o uv.lockTo install dependencies from the lockfile:
uv pip install --requirements uv.lockThe default settings should work for local development, but if you need to tweak the environment variables, you can do so by copying the .env.example file to .env and making your changes.
cp .env.example .envRun the Docker containers:
The application uses several services:
-
ParadeDB (PostgreSQL-compatible database)
- Port: 2345
- Default credentials: postgres/postgres
- Database: btaa_geospatial_api
-
Elasticsearch (Search engine)
- Port: 9200
- Single-node configuration
- Security disabled for development
- 2GB memory allocation
- Index: btaa_geospatial_api
-
Redis (Caching and message broker)
- Port: 6379
- Persistence enabled
- Used for API caching and Celery tasks
-
DuckDB (Embedded analytical database)
- Runs in-process with the Python application
- No separate service or port required
- Database file:
data/duckdb/btaa_geospatial_api.duckdb - Used for analytical queries and data processing
- Access via Python
duckdbpackage
-
Celery Worker (Background task processor)
- Processes asynchronous tasks
- Connected to Redis and ParadeDB
- Logs available in ./logs directory
-
Flower (Celery monitoring)
- Port: 5555
- Web interface for monitoring Celery tasks
- Access at http://localhost:5555
Start all services:
docker compose up -dImports a flat file of GeoBlacklight OpenGeoMetadata Aardvark test fixture data:
cd data
psql -h localhost -p 2345 -U postgres -d btaa_geospatial_api -f btaa_geospatial_api.txtRun the API server:
uvicorn main:app --reloadThis script will create all the database tables needed for the application.
.venv/bin/python run_migrations.pyThis script will populate the item_relationships triplestore.
.venv/bin/python scripts/populate_relationships.pyThis script will create and populate the application ES index.
.venv/bin/python run_index.pyThis script will download and import all the gazetteer data.
.venv/bin/python run_gazetteers.pyThe application is also available as a Docker image on Docker Hub. You can pull and run the image using the following commands:
docker pull ewlarson/btaa-geospatial-api:latest
docker run -d -p 8000:8000 ewlarson/btaa-geospatial-api:latestThis will start the API server on port 8000.
Returns the API documentation.
The API supports aggressive Redis-based caching to improve performance. Caching can be controlled through environment variables:
# Enable/disable caching
ENDPOINT_CACHE=true
# Redis connection settings
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=optional_password
REDIS_DB=0
# Cache TTL settings (in seconds)
DOCUMENT_CACHE_TTL=86400 # 24 hours
SEARCH_CACHE_TTL=3600 # 1 hour
SUGGEST_CACHE_TTL=7200 # 2 hours
LIST_CACHE_TTL=43200 # 12 hours
CACHE_TTL=43200 # Default TTL (12 hours)
# Rate Limiting settings
RATE_LIMIT_ENABLED=true # Enable/disable rate limiting
RATE_LIMIT_REDIS_DB=2 # Redis database number for rate limiting (uses same Redis instance)
# API Usage Analytics Enrichment (User Agent Parsing)
# Note: Geocoding has been removed due to licensing complexity
When caching is enabled:
- API responses are cached in Redis based on the endpoint and its parameters
- Search results are cached for faster repeated queries
- Resource details are cached to reduce database load
- Suggestions are cached to improve autocomplete performance
The cache is automatically invalidated when:
- Resources are created, updated, or deleted
- The Elasticsearch index is rebuilt
You can manually clear the cache using:
GET /api/v1/cache/clear?cache_type=search|resource|suggest|all
The API automatically logs all requests to the api_usage_logs table for analytics purposes. This includes:
- Request metadata (endpoint, method, status code, response time)
- API key and tier information
- IP address and user agent
- Referrer and UTM parameters
- Query parameters (stored in JSON properties field)
The public API supports service tiers and API key–based rate limiting.
- Service tiers are defined in the
api_service_tierstable and seeded by the migrations into tiers such as:btaa_primary/btaa_secondary– internal BTAA applications with unlimited accessbtaa_member_primary/btaa_member_affiliated– member applications with higher limitsgeneral_registered– registered external usersanonymous– unauthenticated access with the lowest limits
- API keys are stored (hashed) in the
api_keystable and associated with a tier. - Rate limits are enforced per tier, per identifier (API key hash or IP address) using Redis.
Clients can authenticate with an API key in one of three ways (in order of precedence):
-
X-API-Keyheader:X-API-Key: your-api-key-here -
Authorizationheader with Bearer token:Authorization: Bearer your-api-key-here -
api_keyquery parameter:GET /api/v1/search?q=roads&api_key=your-api-key-here
If no valid API key is provided, the request is treated as anonymous and uses the anonymous tier’s rate limit.
Admin users (protected by HTTP Basic auth with ADMIN_USERNAME / ADMIN_PASSWORD) can manage keys and inspect tiers:
POST /api/v1/admin/api-keys– create a new API key for a giventier_name.- Request body:
{ "tier_name": "anonymous", "name": "optional friendly name" } - Response includes the plaintext
api_keyonce, pluskey_idandtier_name.
- Request body:
GET /api/v1/admin/api-keys– list existing keys and their tiers.PATCH /api/v1/admin/api-keys/{key_id}– updatetier_name,is_active, orname.DELETE /api/v1/admin/api-keys/{key_id}– revoke (deactivate) a key.GET /api/v1/admin/api-tiers– list all tiers, limits, and descriptions.
The admin endpoints are intended for trusted operators only; do not expose them directly to the public internet without appropriate protections (e.g., network restrictions, stronger auth).
Rate limiting is enforced by middleware in front of all non-admin API routes:
-
Configuration is controlled via environment variables:
RATE_LIMIT_ENABLED=true # Enable/disable rate limiting middleware RATE_LIMIT_REDIS_DB=2 # Redis database used for rate limiting REDIS_HOST=redis # Redis host REDIS_PORT=6379 # Redis port REDIS_PASSWORD=optional_password -
For each request, the middleware:
- Resolves the caller’s tier from the API key (if provided) or falls back to the
anonymoustier. - Uses Redis to track the number of requests per minute per
(tier_name, identifier), whereidentifieris the API key hash or client IP (viaX-Forwarded-Foror socket address). - Enforces the tier’s
requests_per_minutelimit.
- Resolves the caller’s tier from the API key (if provided) or falls back to the
When rate limiting is enabled, responses include:
X-RateLimit-Limit– the allowed number of requests per minute for the current tier (orunlimited).X-RateLimit-Remaining– remaining requests in the current window (orunlimited).X-RateLimit-Reset– UNIX timestamp when the window resets.
If a client exceeds its rate limit:
- The API returns HTTP 429 Too Many Requests with a JSON body describing the error.
- The response includes
Retry-AfterandX-RateLimit-*headers indicating when to retry.
API usage logs are automatically enriched in the background with:
- User agent parsing: Browser, operating system, and device type
This enrichment happens asynchronously via Celery tasks to avoid blocking API requests.
Note: IP geocoding (country, region, city, latitude, longitude) has been removed due to licensing complexity with geocoding databases.
To enrich existing API usage logs that were created before enrichment was enabled, you can use the batch enrichment task:
from app.tasks.api_usage_enrichment import enrich_api_usage_logs_batch
# Enrich 100 logs at a time
enrich_api_usage_logs_batch.delay(batch_size=100)This can be run repeatedly until all logs are enriched.
The API uses OpenAI's ChatGPT API to generate summaries and identify geographic named entities of historical maps and geographic datasets. To use this feature:
- Set your OpenAI API key in the
.envfile:
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-3.5-turbo
- The summarization service will automatically use this API key to generate summaries.
- The geo_entities service will use the same API to identify and extract geographic named entities from the content.
The API can process various types of assets to enhance summaries:
- IIIF Images: Extracts metadata and visual content from IIIF image services
- IIIF Manifests: Processes IIIF manifests to extract metadata, labels, and descriptions
- Cloud Optimized GeoTIFFs (COG): Extracts geospatial metadata from COG files
- PMTiles: Processes PMTiles assets to extract tile information
- Downloadable Files: Processes various file types (Shapefiles, Geodatabases, etc.)
To generate a summary for a resource:
POST /api/v1/resources/{id}/summarize
This will trigger an asynchronous task to generate a summary. You can retrieve the summary using:
GET /api/v1/resources/{id}/summaries
The API can identify and extract geographic named entities from resources. This includes:
- Place names
- Geographic coordinates
- Administrative boundaries
- Natural features
- Historical place names
To extract geographic entities:
POST /api/v1/resources/{id}/extract_entities
The response will include:
- Extracted entities with confidence scores
- Geographic coordinates when available
- Links to gazetteer entries
- Historical context when relevant
The following AI features are planned:
- Metadata summaries
- Imagery summaries
- Tabular data summaries
- OCR text extraction
- Subject enhancements
Data from BTAA GIN. @TODO add license.
Data from GeoNames. License - CC BY 4.0
Data from FAST (Faceted Application of Subject Terminology) which is made available by OCLC Online Computer Library Center, Inc. under the License - ODC Attribution License.
Data from Who's On First. License
- Docker Image - Published on Docker Hub
- Search - basic search across all text fields
- Search - autocomplete
- Search - spelling suggestions
- Search - more complex search with filters
- Search - pagination
- Search - sorting
- Search - basic faceting
- Performance - Redis caching
- Search - facet include/exclude
- Search - facet alpha and numerical pagination, and search within facets
- Search - advanced/fielded search
- Search - spatial search
- Search Results - thumbnail images (needs improvements)
- Search Results - bookmarked resources
- Item View - citations
- Item View - downloads
- Item View - relations (triplestore)
- Item View - exports (Shapefile, CSV, GeoJSON)
- Item View - export conversions (Shapefile to: GeoJSON, CSV, TSV, etc)
- Item View - code previews (Py, R, Leaflet)
- Item View - embeds
- Item View - allmaps integration (via embeds)
- Item View - data dictionaries
- Item View - web services
- Item View - metadata
- Item View - related resources (vector metadata search)
- Item View - similar images (vector imagery search)
- Collection View
- Place View
- Gazetteer - BTAA Spatial
- Gazetteer - Geonames
- Gazetteer - OCLC Fast (Geographic)
- Gazetteer - Who's on First
- Gazetteer - USGS Geographic Names Information System (GNIS), needed?
- GeoJSONs
- AI - Metadata summaries
- AI - Geographic entity extraction
- AI - Subject enhancements
- AI - Imagery - Summary
- AI - Imagery - OCR'd text
- AI - Tabular data summaries
- API - Analytics (PostHog?)
- API - Authentication/Authorization for "Admin" endpoints
- API - Throttling
- Heirarchical Faceting > Spatial, ex: https://geo.btaa.org/catalog/p16022coll230:1750
