bucket-scrapper

Search through compressed S3 objects at scale. Stream-decompress, filter lines by regex, and forward results to local files or an HTTP API -- without ever buffering a full object in memory.

Installation

cargo build --release
# Binary: target/release/bucket-scrapper

How it works

Three decisions drive every run:

Which files to read from S3? -- Buckets, prefix paths, date ranges, key filters
Which lines to keep? -- All lines, or only those matching a regex
Where to send them? -- Local zstd files, or an HTTP endpoint

Everything streams: downloads decompress on the fly, lines are filtered as they appear, and results are written continuously.

1. Selecting S3 files

Config file

A config file (config-scrapper.yml by default) defines which buckets and prefix paths to scan:

buckets:
  - bucket: my-log-bucket
    path:
      - static_path: "log-archives/"
      - datefmt: "dt=%Y%m%d/hour=%H"
    only_prefix_patterns:          # optional: only scan prefixes matching these regexes
      - "service-a"

region: eu-west-3
output_dir: ./scrapper-output

Path components are either literal strings (static_path) or date patterns (datefmt). Supported date formats: dt=%Y%m%d/hour=%H (strftime) or 2006/01/02/15 (Go reference time). Each bucket must have at least one datefmt component to avoid listing the entire bucket.

Date range (required)

Every run needs at least a start time. The tool generates S3 prefixes for each hour in the range:

bucket-scrapper -s 2024-01-15T10:00:00Z -e 2024-01-15T12:00:00Z

Key filter

Further narrow which S3 objects to process with a regex on the object key:

bucket-scrapper -s 2024-01-15T10:00:00Z -f "service-a.*\.json\.zst$"

2. Filtering lines

By default, all lines from matching objects are forwarded. Add a regex to keep only what you need:

# Lines containing "ERROR" followed by "timeout"
bucket-scrapper -s 2024-01-15T10:00:00Z --line-pattern-regex "ERROR.*timeout"

# Case insensitive
bucket-scrapper -s 2024-01-15T10:00:00Z --line-pattern-regex "failed" -i

Omit --line-pattern-regex to extract everything (useful for bulk re-export).

3. Choosing output

File output (default)

Results are written as zstd-compressed files in output_dir, grouped by date/hour prefix:

bucket-scrapper -s 2024-01-15T10:00:00Z --line-pattern-regex "ERROR"

HTTP output

Sends matched lines as zstd-compressed NDJSON batches to an HTTP endpoint, with adaptive (AIMD) throttling and 429 back-off:

bucket-scrapper -s 2024-01-15T10:00:00Z \
  --line-pattern-regex "ERROR" \
  --http-output \
  --http-url "https://logs.example.com/api/v1/logs" \
  --http-bearer-auth "your-token"

URL and token can also come from environment variables (HTTP_URL, HTTP_BEARER_AUTH).

HTTP defaults can be set in the config file:

http_output:
  url: https://logs.example.com/api/v1/logs
  api_key: your-api-key
  timeout_secs: 30

AWS Authentication

Standard AWS SDK credential chain: environment variables, ~/.aws/credentials, IAM role, or aws sso login. Custom CA bundles via AWS_CA_BUNDLE.

CLI Reference

Usage: bucket-scrapper [OPTIONS] --start <START>

General

Flag	Default	Description
`-s, --start <START>`	required	Start date (ISO 8601)
`-e, --end <END>`	now	End date (ISO 8601)
`--config <CONFIG>`	`config-scrapper.yml`	Config file path
`-r, --region <REGION>`	`eu-west-3`	AWS region
`-v, --log-level <LOG_LEVEL>`	`info`	trace, debug, info, warn, error
`--log-format <LOG_FORMAT>`	`text`	`text` or `json`

File selection & filtering

Flag	Default	Description
`-f, --filter <FILTER>`		Regex on S3 object keys
`--line-pattern-regex <REGEX>`	(all lines)	Regex to filter lines
`-i, --ignore-case`	false	Case insensitive matching

HTTP output

Flag	Default	Description
`--http-output`	false	Enable HTTP output mode
`--http-url`	`HTTP_URL` env	Endpoint URL
`--http-bearer-auth`	`HTTP_BEARER_AUTH` env	Bearer token
`--http-batch-max-mb`	2	Max batch size (MB)
`--http-timeout`	30	Request timeout (seconds)

AIMD throttle

Flag	Default	Description
`--max-submission-time`	3.0	Batch time threshold in seconds (0 = disable)
`--max-upload-rate`	0	Rate limit in MB/s (0 = unlimited)
`--http-aimd-decrease-factor`	0.15	Multiplicative decrease on congestion
`--http-aimd-increase`	1.0	Additive increase per healthy batch (MB/s)

Performance tuning

Flag	Default	Description
`--max-parallel`	32	Concurrent S3 downloads
`--filter-tasks`	cpu/2	Regex filter workers
`--line-buffer-size`	1000	Line channel capacity
`--max-retries`	10	Download retry attempts
`--retry-delay`	2	Initial retry delay (seconds)
`--progress-interval`	3	Progress report interval (seconds)
`--compression-level`	3	Zstd level (1-22)
`--memory-limit-gb`	0	Memory limit via setrlimit (0 = none)
`--client-max-age`	60	S3 client max age (minutes)
`--http-line-channel-size`	1000	Line channel before compressors
`--http-compressor-tasks`	cpu/8	Zstd compressor tasks
`--http-upload-tasks`	4x compressors	Concurrent upload tasks
`--http-upload-channel-size`	4	Batch channel size

Profiling

CPU: Use samply with the profiling build profile.

Memory: cargo build --profile profiling --features dhat-heap, then submit the generated profiler.json to dh_view.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
infra/docker		infra/docker
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
config-scrapper.yml		config-scrapper.yml
lefthook.yaml		lefthook.yaml
lefthook.yml		lefthook.yml
mise.toml		mise.toml
sample-config.yaml		sample-config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bucket-scrapper

Installation

How it works

1. Selecting S3 files

Config file

Date range (required)

Key filter

2. Filtering lines

3. Choosing output

File output (default)

HTTP output

AWS Authentication

CLI Reference

General

File selection & filtering

HTTP output

AIMD throttle

Performance tuning

Profiling

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bucket-scrapper

Installation

How it works

1. Selecting S3 files

Config file

Date range (required)

Key filter

2. Filtering lines

3. Choosing output

File output (default)

HTTP output

AWS Authentication

CLI Reference

General

File selection & filtering

HTTP output

AIMD throttle

Performance tuning

Profiling

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages