Skip to content

## Fix SQL query validation: replace strictValidation with SqlValidation enum#1881

Open
zaleslaw wants to merge 4 commits into
masterfrom
issue-1671
Open

## Fix SQL query validation: replace strictValidation with SqlValidation enum#1881
zaleslaw wants to merge 4 commits into
masterfrom
issue-1671

Conversation

@zaleslaw
Copy link
Copy Markdown
Collaborator

Closes #1671


Problem

The existing SQL validation rejected valid queries that any database would happily accept:

-- Rejected: semicolon inside a string literal
SELECT regexp_split_to_array(col, ';') FROM t

-- Rejected: CTE (even though WITH was in the allowlist)
WITH recent AS (SELECT * FROM events) SELECT * FROM recent

-- Rejected: comments — treated as injection attempt
SELECT * FROM sale WHERE amount > 100 -- filter by amount

Root cause: validation ran regex patterns against the raw SQL string, so a semicolon inside 'a;b' triggered the same rule as SELECT 1; DROP TABLE t. Comments were blocked entirely, preventing legitimate annotated queries.

Additionally, strictValidation: Boolean = true was the wrong abstraction — a boolean gives no hint about what is being validated or why, and the default-on behaviour meant every CTE or function call with a semicolon failed immediately.


Solution

1. SqlValidation enum replaces strictValidation: Boolean

enum class SqlValidation {
    ReadOnly,  // blocks DDL/DML; allows SELECT, WITH, VALUES, TABLE, EXPLAIN
    None,      // no validation — database handles rejection
}

Default changed to SqlValidation.None. readSqlQuery is a developer API; the database is the authoritative SQL validator. ReadOnly is opt-in for LLM-generated SQL or other less-trusted query sources.

2. Tokenizer-based validation in ReadOnly mode

stripSqlNoise() does a character-by-character pass stripping single-quoted strings ('...' with '' escape), double-quoted identifiers, line comments (--), and block comments (/* */) before any keyword matching. No external parser dependency.

Three checks on the stripped SQL:

  1. First token must be in {SELECT, WITH, VALUES, TABLE, EXPLAIN}
  2. No ; followed by non-whitespace (multiple statements)
  3. No DDL/DML keyword (DROP, DELETE, INSERT, UPDATE, CREATE, ALTER, TRUNCATE, GRANT, REVOKE, MERGE, EXEC, EXECUTE) as a whole word

3. Table name validation stays always strict

readSqlTable generates SQL from the table name (SELECT * FROM <name>). That makes it a different threat model from user-provided SQL. The validation parameter was removed from all readSqlTable overloads — table names are always validated by regex (TABLE_NAME_VALID_PATTERN). hasForbiddenPatterns() removed entirely.


Code changes

File Change
validationUtil.kt Added SqlValidation enum; stripSqlNoise() tokenizer; rewrote isValidSqlQuery() on top of tokenizer; simplified isValidTableName() to regex-only; removed FORBIDDEN_PATTERNS_REGEX / hasForbiddenPatterns()
readJdbc.kt All 9 public functions: strictValidation: Boolean = truevalidation: SqlValidation = SqlValidation.None; readSqlTable overloads: validation parameter removed, table name always validated; fixed private readTableAsDataFrame leftover call
readDataFrameSchema.kt Removed 2 hardcoded strictValidation = true arguments

Test changes

  • readSqlQuery should reject malicious SQL queries → split into:
    • readSqlQuery with ReadOnly validation should block multi-statement and DDL queries — only queries that should actually be blocked
    • readSqlQuery with ReadOnly validation should allow comments and semicolons inside literals — explicitly verifies the false positives are gone
    • readSqlQuery with default SqlValidation None does not throw for any query — verifies default passes everything through
  • readFromTable should work with non-standard table names when strictValidation is disabled → replaced by readSqlTable should always reject non-identifier table names (table validation is now unconditional)
  • readSqlQuery should execute DROP TABLE when validation is disabled → renamed to reflect that SqlValidation.None is simply the default, not a special mode
  • read from table with name from reserved SQL keywords — removed false shouldThrow block (double-quoted identifiers like "ALTER" are stripped before matching, so SELECT * FROM "ALTER" correctly passes ReadOnly validation)
  • Added: readSqlTable should reject pre-quoted table names and suggest unquoted form
  • Added: readSqlTable should accept unquoted schema-qualified names — verifies PUBLIC.Customer works, quoting added by DbType.quoteIdentifier()
  • Added "DROP TABLE users" and "users; DROP TABLE users" to the existing table name injection test (confirms regex protection is sufficient without hasForbiddenPatterns)

Tradeoffs and decisions

None as default, not ReadOnly
readSqlQuery is called by developers writing known SQL, not handling user input. Blocking their CTEs and comments by default is hostile. If someone is passing LLM-generated or user-controlled SQL, they can explicitly opt into ReadOnly. The database always provides the real last line of defence.

Table name validation always strict, no bypass
The library builds SQL strings from the table name. Providing a bypass would make the library itself the source of injection. Users with non-standard names have readSqlQuery as the explicit escape hatch. Pre-quoted names like "schema"."table" are rejected with an error message directing users to pass unquoted schema.table — the library adds DB-specific quoting via DbType.quoteIdentifier() automatically.

No SQL parser dependency
JSqlParser or similar would be more accurate but brings dialect coverage risk and a heavy Java dependency. The tokenizer approach is dialect-neutral: stripping literals and comments is the same in every SQL dialect, so the subsequent keyword checks are reliable enough for the ReadOnly guard use case.

ReadOnly allowlist is conservative
{SELECT, WITH, VALUES, TABLE, EXPLAIN} — if a dialect has other read-only statement types not covered here, the user can use SqlValidation.None and rely on DB-level permissions instead.

Refactored SQL query/table validation to replace the `strictValidation` parameter with a more flexible `SqlValidation` enum introducing `ReadOnly` and `None` modes. Enhanced validation to block multi-statement queries and DDL/DML keywords in `ReadOnly` mode while allowing comments and literals inside queries. Updated tests and documentation accordingly.
@zaleslaw zaleslaw requested review from Jolanrensen and koperagen May 29, 2026 14:28
@Jolanrensen
Copy link
Copy Markdown
Collaborator

@zaleslaw please be a bit more brief in describing all changes in the PR description (like in "Test changes"). We can see the changes when we review the files, you know ;P
Some things are helpful to mention, but now it's so much text that it's hard to pick out the important stuff

@Jolanrensen Jolanrensen requested review from Jolanrensen and removed request for Jolanrensen June 1, 2026 10:10
* [SqlValidation.ReadOnly] blocks DDL/DML; [SqlValidation.None] (default) skips validation.
* Table name validation is always strict regardless of this parameter.
* @param [configureStatement] optional lambda to configure the [PreparedStatement] before execution.
* This allows for custom tuning of fetch size, query timeout, and other JDBC parameters.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent aligning of param text. You use 3 styles in each KDoc:

@param [limit] |the maximum...
               |`null` (def...
               |or positive...
|@param [dbType] the type of...
|in that case the [dbType] w...
@param [configureStatement]| optional lambda 
                           |This allows for c

In the rest of the library we mostly just use 2 spaces:

@param [dbType] the type of...
  in that case the [dbType] w...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot leave comments on this file... Something may have gone wrong creating/saving it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it is saved as CRLF (Windows) instead of LF (Linux). I'll see if I can fix it for now, but check your settings :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways. I cannot see the descriptions of ReadOnly and None when selecting them in the IDE. Please add the KDoc to these enum entries itself.

@Jolanrensen
Copy link
Copy Markdown
Collaborator

I think the rest looks good :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Dataframe-JDBC less restrictive in what SQL it accepts

2 participants