feat: add opt-in file persistence for DuckDB#300
feat: add opt-in file persistence for DuckDB#300robertjbass wants to merge 2 commits intoPostHog:mainfrom
Conversation
Add a `file_persistence` config option (YAML, env var, CLI flag) that stores DuckDB data in `<data_dir>/<username>.duckdb` instead of using per-connection in-memory databases. When enabled, DuckDB memory-maps the file and serves from RAM — giving Redis-like performance with disk durability. Data persists across client disconnects and duckgres restarts. The default behavior is unchanged: `file_persistence` defaults to false, so every connection still gets an isolated ephemeral in-memory DuckDB. Config resolution follows the existing YAML → env → CLI precedence: - YAML: `file_persistence: true` - Env: `DUCKGRES_FILE_PERSISTENCE=true` - CLI: `--file-persistence`
fuziontech
left a comment
There was a problem hiding this comment.
Code Review
The implementation is clean and follows project patterns well. A few issues to address:
Critical
1. Path traversal via username (Security)
The username comes directly from the PG client startup message and is used unsanitized in filepath.Join. While the username must match an entry in cfg.Users to authenticate, filepath.Join does not prevent traversal — filepath.Join("/data", "../etc/cron.d/evil.duckdb") resolves to /etc/cron.d/evil.duckdb. If an admin configures a user like ../../tmp/pwned, the DB file escapes DataDir.
Recommend rejecting usernames containing /, \, or .. when FilePersistence is enabled, or validating the resolved path stays under DataDir.
2. Concurrent connections to the same file (Correctness)
Each PG connection calls openBaseDB → sql.Open("duckdb", dsn) independently. With file persistence, two connections from the same user will both try to open the same .duckdb file. DuckDB uses a write-ahead log with single-writer file locking — the second connection will get a lock error.
This is a fundamental design issue for file-backed mode. The PR mentions a hosted database service use case, which implies multiple client connections per user. Need to share a single *sql.DB per user across connections (requires a connection pool/registry at the Server level), or document the limitation clearly.
Minor
3. Silent fallback to :memory: is surprising
When FilePersistence is true but DataDir is empty, the code silently falls back to in-memory. A user who sets file_persistence: true but forgets data_dir would unknowingly lose all data on disconnect. Should return a validation error at startup, or at minimum log a warning.
4. Missing DataDir directory creation
openBaseDB doesn't ensure DataDir exists before opening the file. A file-backed DB open will fail immediately if the directory doesn't exist. Should add os.MkdirAll or validate at startup.
- Reject usernames with path traversal characters (/, \, ..) in file-persistence mode - Add per-user shared DB pool so concurrent connections reuse the same DuckDB file handle - Warn and disable file_persistence when data_dir is empty instead of silent fallback - Create data directory with os.MkdirAll before opening file-backed DBs
7eec448 to
45aa772
Compare
|
Thanks for the review. All points are completely valid and are addressed as follows (45aa772): 1. Path traversal (Critical): openBaseDB now rejects usernames containing 2. Concurrent connections (Critical): Added a per-user shared 3. Silent fallback (Minor):
4. DataDir creation (Minor):
All four have test coverage:
|
Summary
Adds a
file_persistenceconfig option that stores DuckDB data in<data_dir>/<username>.duckdbinstead of using per-connection in-memory databases.When enabled, DuckDB memory-maps the file and serves from RAM, so performance stays close to in-memory while data persists across client disconnects and duckgres restarts. The default behavior is unchanged,
file_persistencedefaults tofalse.Motivation
I'm using duckgres to expose DuckDB over the PostgreSQL wire protocol in a managed database hosting service. Users create a DuckDB instance and get a
postgresql://connection string they can use from any PG client.The current
:memory:behavior means every connection starts with an empty DuckDB and all data is lost on disconnect. Withfile_persistence: true, data persists exactly as users expect from a hosted database, similar to how Redis uses memory-mapped files for in-memory performance with disk durability.This is fully opt-in and backward-compatible. Anyone using duckgres without this flag sees no change in behavior.
Changes
server/server.go:openBaseDB()usesfilepath.Join(cfg.DataDir, username+".duckdb")whenFilePersistenceis true (falls back to:memory:ifDataDirorusernameis empty)config_resolution.go: WireFilePersistencethrough YAML → env → CLI resolution (same pattern asProcessIsolation)main.go: Addfile_persistenceYAML field,--file-persistenceCLI flag, help text forDUCKGRES_FILE_PERSISTENCEenv varConfig
Follows the existing YAML → env → CLI precedence chain.
Tests
:memory:whenDataDiris empty, fallback whenusernameis emptyAll existing tests continue to pass.