Skip to content

feat: add opt-in file persistence for DuckDB#300

Open
robertjbass wants to merge 2 commits intoPostHog:mainfrom
robertjbass:feat/file-persistence
Open

feat: add opt-in file persistence for DuckDB#300
robertjbass wants to merge 2 commits intoPostHog:mainfrom
robertjbass:feat/file-persistence

Conversation

@robertjbass
Copy link

@robertjbass robertjbass commented Mar 10, 2026

Summary

Adds a file_persistence config option that stores DuckDB data in <data_dir>/<username>.duckdb instead of using per-connection in-memory databases.

When enabled, DuckDB memory-maps the file and serves from RAM, so performance stays close to in-memory while data persists across client disconnects and duckgres restarts. The default behavior is unchanged, file_persistence defaults to false.

Motivation

I'm using duckgres to expose DuckDB over the PostgreSQL wire protocol in a managed database hosting service. Users create a DuckDB instance and get a postgresql:// connection string they can use from any PG client.

The current :memory: behavior means every connection starts with an empty DuckDB and all data is lost on disconnect. With file_persistence: true, data persists exactly as users expect from a hosted database, similar to how Redis uses memory-mapped files for in-memory performance with disk durability.

This is fully opt-in and backward-compatible. Anyone using duckgres without this flag sees no change in behavior.

Changes

  • server/server.go: openBaseDB() uses filepath.Join(cfg.DataDir, username+".duckdb") when FilePersistence is true (falls back to :memory: if DataDir or username is empty)
  • config_resolution.go: Wire FilePersistence through YAML → env → CLI resolution (same pattern as ProcessIsolation)
  • main.go: Add file_persistence YAML field, --file-persistence CLI flag, help text for DUCKGRES_FILE_PERSISTENCE env var

Config

# YAML
file_persistence: true
data_dir: "/var/lib/duckgres/data"
# Environment variable
DUCKGRES_FILE_PERSISTENCE=true

# CLI flag
duckgres --file-persistence --data-dir /var/lib/duckgres/data

Follows the existing YAML → env → CLI precedence chain.

Tests

  • 5 config resolution tests: YAML, env, env-overrides-file, CLI-overrides-env, default-false
  • 4 server unit tests: in-memory default, file persistence with data survival across reopen, fallback to :memory: when DataDir is empty, fallback when username is empty

All existing tests continue to pass.

Add a `file_persistence` config option (YAML, env var, CLI flag) that
stores DuckDB data in `<data_dir>/<username>.duckdb` instead of using
per-connection in-memory databases.

When enabled, DuckDB memory-maps the file and serves from RAM — giving
Redis-like performance with disk durability. Data persists across
client disconnects and duckgres restarts.

The default behavior is unchanged: `file_persistence` defaults to false,
so every connection still gets an isolated ephemeral in-memory DuckDB.

Config resolution follows the existing YAML → env → CLI precedence:
- YAML: `file_persistence: true`
- Env: `DUCKGRES_FILE_PERSISTENCE=true`
- CLI: `--file-persistence`
@fuziontech fuziontech self-requested a review March 11, 2026 22:46
Copy link
Member

@fuziontech fuziontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The implementation is clean and follows project patterns well. A few issues to address:

Critical

1. Path traversal via username (Security)

The username comes directly from the PG client startup message and is used unsanitized in filepath.Join. While the username must match an entry in cfg.Users to authenticate, filepath.Join does not prevent traversal — filepath.Join("/data", "../etc/cron.d/evil.duckdb") resolves to /etc/cron.d/evil.duckdb. If an admin configures a user like ../../tmp/pwned, the DB file escapes DataDir.

Recommend rejecting usernames containing /, \, or .. when FilePersistence is enabled, or validating the resolved path stays under DataDir.

2. Concurrent connections to the same file (Correctness)

Each PG connection calls openBaseDBsql.Open("duckdb", dsn) independently. With file persistence, two connections from the same user will both try to open the same .duckdb file. DuckDB uses a write-ahead log with single-writer file locking — the second connection will get a lock error.

This is a fundamental design issue for file-backed mode. The PR mentions a hosted database service use case, which implies multiple client connections per user. Need to share a single *sql.DB per user across connections (requires a connection pool/registry at the Server level), or document the limitation clearly.

Minor

3. Silent fallback to :memory: is surprising

When FilePersistence is true but DataDir is empty, the code silently falls back to in-memory. A user who sets file_persistence: true but forgets data_dir would unknowingly lose all data on disconnect. Should return a validation error at startup, or at minimum log a warning.

4. Missing DataDir directory creation

openBaseDB doesn't ensure DataDir exists before opening the file. A file-backed DB open will fail immediately if the directory doesn't exist. Should add os.MkdirAll or validate at startup.

- Reject usernames with path traversal characters (/, \, ..) in file-persistence mode
- Add per-user shared DB pool so concurrent connections reuse the same DuckDB file handle
- Warn and disable file_persistence when data_dir is empty instead of silent fallback
- Create data directory with os.MkdirAll before opening file-backed DBs
@robertjbass robertjbass force-pushed the feat/file-persistence branch from 7eec448 to 45aa772 Compare March 12, 2026 03:41
@robertjbass
Copy link
Author

Thanks for the review. All points are completely valid and are addressed as follows (45aa772):

1. Path traversal (Critical):

openBaseDB now rejects usernames containing /, \, or .. before building the file path. Went with the reject approach rather than resolved-path validation since it's simpler and the characters have no legitimate use in usernames.

2. Concurrent connections (Critical):

Added a per-user shared *sql.DB pool on Server (acquireFileDB/releaseFileDB with ref counting). Multiple PG connections for the same user share one *sql.DB and each gets a pinned *sql.Conn from Go's built-in connection pool for transaction isolation. Credential refresh runs once per pool entry rather than per connection. The pool entry is closed when the last connection disconnects.

3. Silent fallback (Minor):

resolveEffectiveConfig now warns and disables file_persistence when data_dir is empty, so misconfigurations surface at startup instead of silently losing data.

4. DataDir creation (Minor):

openBaseDB calls os.MkdirAll before opening the file. main.go already creates DataDir at startup, but openBaseDB is a public function so it should be self-sufficient.


All four have test coverage:

  • path traversal (4 cases)
  • pool ref counting
  • config validation
  • nested directory creation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants