Skip to content

Add taint analysis (analysis level 3) via CodeQL#29

Open
sinha108 wants to merge 5 commits into
mainfrom
codeql-taint-analysis
Open

Add taint analysis (analysis level 3) via CodeQL#29
sinha108 wants to merge 5 commits into
mainfrom
codeql-taint-analysis

Conversation

@sinha108
Copy link
Copy Markdown
Contributor

@sinha108 sinha108 commented May 20, 2026

Motivation and Context

This PR adds analysis level 3: inter-procedural taint analysis that tracks data flow from user-controlled sources to security-sensitive sinks and reports each path as a typed, severity-rated vulnerability.

Detection is delegated to CodeQL's codeql/python-all built-in security models rather than manually enumerating APIs, giving broad coverage across 20 vulnerability classes (SQL injection, command injection, path traversal, XSS, SSRF, SSTI, unsafe deserialization, LDAP injection, XXE, NoSQL injection, ReDoS, and more) without maintaining fragile pattern lists.

A set of focused taint APIs allows follow-up queries pinned to specific call-site locations — either obtained from a prior analyze_taint_flows() run or provided directly as TaintNodeRef instances from any external tool (Joern, grep, symbol-table scan). All pinned locations are OR-combined into a single CodeQL query, avoiding O(N) sequential runs.

How Has This Been Tested?

Unit tests (no CodeQL required):

  • Schema validation, query generation, configuration loading/merging, the three config modes, disabled_builtin_sinks filtering, and validate_config — all run in CI without any external dependency.
  • TaintNodeRef construction and field defaults; TaintSourceConfig / TaintSinkConfig / TaintSanitizerConfig model validators for pattern vs locations; PyTaintFlowStep and PyTaintAnalysisResult optional-field coverage.
  • All pattern-helper code paths in TaintQueryGenerator (including the non-.getACall() else-branches, none() empty-predicate guards, config-sig isBarrier toggle, location-clause generation with and without column precision, and special-character escaping in file paths).
  • Focused API error paths (empty list guards, missing config), mixed PyTaintSource / TaintNodeRef input shapes, and singular-wrapper delegation.
  • Config-loader completeness: duplicate sink/sanitizer names, blank patterns, no-sinks warning, name-collision overrides, include_remote_flow_source carry-through, _filter_disabled for sanitizers, all _load_from_file error paths, save_config round-trips for YAML and JSON, .yml extension, scalar-only config files.

Integration tests (require CodeQL CLI):

  • 9 purpose-built vulnerable fixture applications (sql_injection_app, command_injection_app, path_traversal_app, xss_app, flask_app, sanitizer_app, ssti_app, deserialization_app, ssrf_app) each with known-vulnerable code. Tests assert expected vulnerability types, flow counts, severity values, and that sanitized paths are not reported.
  • Focused taint API integration tests: analyze_taint_flows_from_sources([source]), analyze_taint_flows_to_sinks([sink]), and analyze_taint_flow_paths([source], [sink]) are exercised end-to-end against the flask_app fixture. Both PyTaintSource (bootstrapped from a prior full analysis) and raw TaintNodeRef (no bootstrapping) inputs are tested.

Total: 133 tests (109 unit, 24 integration), all passing.

Breaking Changes

None. All new flags have safe defaults (--taint-defaults on, no --taint-config required) and existing invocations at levels 1 and 2 are unaffected. The AnalysisOptions dataclass has two new optional fields (taint_config, taint_use_defaults) both with defaults. The focused taint APIs are additive; no existing method signatures changed.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the Codellm-Devkit Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Architecture. The implementation has three layers:

  1. Config layer (taint_config_defaults.py, taint_config_loader.py): default sources/sanitizers, YAML/JSON file loading, name-based merge, disabled_builtin_sinks suppression, and validate_config warnings surfaced at load time. Each source, sink, and sanitizer entry accepts either a CodeQL API-graph pattern or an explicit locations: List[TaintNodeRef] list (or both), so call-site data from any tool can be incorporated without writing CodeQL expressions.

  2. Query generation layer (taint_query_generator.py): dynamically generates a DataFlow::ConfigSig / TaintTracking::Global<Config> query from the active configuration. Built-in sinks are driven by the BUILTIN_SINKS table; user-defined entries are appended as additional predicates, with location-based entries emitting getAbsolutePath() + getStartLine() [+ getStartColumn()] constraints for optional sub-line precision.

  3. Analysis layer (codeql_analysis.py): executes the generated query, parses results into PyTaintFlow / PyTaintAnalysisResult Pydantic models, and resolves source/sink locations against the symbol table when available. Exposes three focused APIs — analyze_taint_flows_from_sources, analyze_taint_flows_to_sinks, analyze_taint_flow_paths — each accepting List[PyTaintSource | TaintNodeRef] and generating a single OR-combined query regardless of list size.

Config modes. Three options controlled by --taint-defaults / --no-taint-defaults:

  • Defaults only (no --taint-config): covers most projects out of the box.
  • Union (--taint-config file.yaml): extends the defaults with project-specific sources/sinks.
  • Custom only (--taint-config file.yaml --no-taint-defaults): replaces all defaults, useful for scoped audits.

TaintNodeRef. A minimal (file_path, start_line, start_column=-1) value accepted everywhere PyTaintSource / PyTaintSink are used in the focused APIs, and also as the locations field in config entries. This decouples the focused query APIs from the requirement to run a prior full analysis — call-site data from Joern, grep, or any other tool can be passed directly.

sinha108 added 4 commits May 15, 2026 12:40
…xtures.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
…models; add

related test fixtures and unit tests.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
…iltin_sinks, three-mode config control, and validation

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
@sinha108 sinha108 requested a review from rahlk May 20, 2026 21:25
…pecification

of sources/sinks (in addition to pattern-based spec), expand test coverage,
optimize fixtures.

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant