Skip to content

Latest commit

 

History

History
267 lines (193 loc) · 8.35 KB

File metadata and controls

267 lines (193 loc) · 8.35 KB

S3 / Cloud Storage

RSS-Lance can store all its data on S3 instead of local disk. Both the Python fetcher and Go server use LanceDB's native S3 support (via the Rust object_store crate) — no boto3, no AWS SDK, no extra dependencies.

Prerequisites

  1. An S3 bucket (or S3-compatible service: MinIO, GCS, Azure Blob)
  2. AWS credentials configured via the standard AWS credential chain

AWS Credentials

RSS-Lance reads credentials the same way the AWS CLI does. If aws s3 ls s3://your-bucket/ works from your terminal, RSS-Lance will work too.

Method How Best for
Shared credentials file ~/.aws/credentials via aws configure Local dev, personal machines
Named profile AWS_PROFILE=myprofile Multiple AWS accounts
Environment variables AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY Docker, Lambda, CI
EC2 instance role Automatic via IMDS EC2 instances
ECS task role Automatic via task metadata ECS / Fargate

Do not put access keys in config.toml. Use ~/.aws/credentials, environment variables, or IAM roles. This keeps secrets out of your repo and lets you rotate credentials without touching the app.

Setting up credentials (if you haven't already)

# Install AWS CLI (one-time)
# https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

# Configure credentials - creates ~/.aws/credentials and ~/.aws/config
aws configure
#   AWS Access Key ID: AKIA...
#   AWS Secret Access Key: ****
#   Default region name: us-east-1
#   Default output format: json

# Verify it works
aws s3 ls s3://your-bucket/

On EC2 or ECS, skip aws configure entirely — attach an IAM role to your instance/task and credentials are provided automatically.

Configuration

Edit config.toml:

[storage]
type = "s3"
path = "s3://my-rss-bucket/rss-lance"

# Only needed if not set in ~/.aws/config or AWS_DEFAULT_REGION
# s3_region = "us-east-1"

That's it. Start the fetcher and server as usual:

./run.sh fetch-once
./run.sh server

Example: Full S3 Setup from Scratch

# 1. Create an S3 bucket
aws s3 mb s3://my-rss-feeds --region us-east-1

# 2. Clone and build
git clone https://github.com/sysadminmike/rss-lance rss-lance && cd rss-lance
./build.sh all

# 3. Configure for S3
cat > config.toml << 'EOF'
[storage]
type = "s3"
path = "s3://my-rss-feeds/data"

[fetcher]
interval_minutes = 30
max_concurrent = 5

[server]
host = "127.0.0.1"
port = 8080
EOF

# 4. Add feeds and start
./run.sh demo-data
./run.sh fetch-once
./run.sh server

Open http://127.0.0.1:8080 — your data is now on S3.

Example: Split Fetcher and Server Across Machines

Because the data lives on S3, the fetcher and server don't need to be on the same machine:

  Machine A (Linux server)          S3 bucket           Machine B (laptop)
  ┌──────────────────┐         ┌──────────────┐      ┌──────────────────┐
  │  Feed Fetcher     │─writes─►│ s3://my-rss/ │◄─reads─│  Go Server       │
  │  (cron / daemon)  │         │              │      │  (serves UI)     │
  └──────────────────┘         └──────────────┘      └──────────────────┘

Both machines just need:

  • The same config.toml pointing to s3://my-rss/data
  • Valid AWS credentials (each machine can use its own IAM role)
# Machine A — fetcher only (daemon or cron)
./run.sh fetcher

# Machine B — server only
./run.sh server

Example: AWS Lambda Fetcher + Local Server

Run the fetcher as a scheduled Lambda (zero infrastructure when not fetching), and the server on your local machine:

# Lambda fetcher uses an IAM execution role — no credentials to manage
# Local server uses ~/.aws/credentials
./run.sh server

MinIO (S3-Compatible)

For self-hosted S3-compatible storage:

[storage]
type = "s3"
path = "s3://my-bucket/rss-lance"
s3_endpoint = "http://localhost:9000"   # MinIO endpoint
s3_region = "us-east-1"                 # required even for MinIO

Set MinIO credentials via environment variables:

export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
./run.sh server

Cloudflare R2

R2 uses S3-compatible API tokens:

[storage]
type = "s3"
path = "s3://my-rss-bucket/data"
s3_endpoint = "https://<account-id>.r2.cloudflarestorage.com"
s3_region = "auto"
export AWS_ACCESS_KEY_ID=<r2-access-key-id>
export AWS_SECRET_ACCESS_KEY=<r2-secret-access-key>
./run.sh server

Note: R2 does not support conditional PUT (PUT-IF-NONE-MATCH). With a single fetcher + single server (the normal setup), this is fine — Lance's optimistic concurrency handles it. For multi-writer setups, see database.md for DynamoDB as an external manifest store.

Security Model

With S3, your security perimeter is your AWS IAM policy — not your application:

  • Bucket policy controls who can read/write
  • IAM roles grant access without long-lived keys
  • Server-side encryption (SSE-S3 or SSE-KMS) encrypts data at rest
  • VPC endpoints keep traffic off the public internet
  • CloudTrail logs every access

No firewalls, no nginx, no reverse proxy, no application-level auth needed.

Minimal IAM Policy

The fetcher and server need these S3 permissions on the bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-rss-bucket",
        "arn:aws:s3:::my-rss-bucket/*"
      ]
    }
  ]
}

For read-only server access (if the server never needs to mark articles as read/starred from this machine):

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::my-rss-bucket",
    "arn:aws:s3:::my-rss-bucket/*"
  ]
}

How It Works Under the Hood

Both the Python lancedb library and the Go lancedb-go SDK use the Rust object_store crate for storage. This is the same storage layer used by Apache Arrow, Delta Lake, and other data infrastructure tools.

When you pass s3://bucket/path as the data path:

  1. LanceDB resolves AWS credentials via the standard chain (~/.aws/credentials → env vars → IMDS)
  2. Reads use S3 GET with range requests (only fetches the bytes needed)
  3. Writes create new immutable files via S3 PUT
  4. MVCC commits use PUT-IF-NONE-MATCH (conditional PUT) to ensure exactly one writer wins a race
  5. DuckDB's Lance extension also reads from S3 via httpfs for SQL queries

No boto3 dependency. No AWS SDK. The Rust crate reads the same credential files that boto3/AWS CLI use, but it's a completely independent implementation.

Costs

S3 costs for a typical single-user RSS reader are minimal:

Operation Approximate cost
Storage (1000 articles) ~$0.01/month
GET requests (browsing) ~$0.01/month
PUT requests (each fetch cycle) ~$0.01/month
Total < $0.05/month

The fetcher batches all writes into a single Lance append per cycle to minimise PUT costs.

Troubleshooting

"NoCredentialProviders" or "could not find credential"

AWS credentials are not configured. Run aws configure or set AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY environment variables.

"Access Denied" or "403 Forbidden"

The IAM user/role doesn't have permission to access the bucket. Check the bucket policy and IAM policy.

"NoSuchBucket"

The bucket doesn't exist or is in a different region. Verify with aws s3 ls s3://your-bucket/.

DuckDB reads fail but lance writes succeed

DuckDB uses its own httpfs extension for S3 access. It reads credentials from the same environment variables. If the Go server was started in a different shell without AWS_* variables, DuckDB won't have them. Make sure credentials are available in the server's environment.

High S3 costs

If costs are higher than expected, check compaction settings. Frequent small writes create many small files. The fetcher's batching and auto-compaction are designed to minimise this — make sure compaction thresholds are reasonable (see configuration.md).