Add taint analysis (analysis level 3) via CodeQL by sinha108 · Pull Request #29 · codellm-devkit/codeanalyzer-python

sinha108 · 2026-05-20T21:25:33Z

Motivation and Context

This PR adds analysis level 3: inter-procedural taint analysis that tracks data flow from user-controlled sources to security-sensitive sinks and reports each path as a typed, severity-rated vulnerability.

Detection is delegated to CodeQL's codeql/python-all built-in security models rather than manually enumerating APIs, giving broad coverage across 20 vulnerability classes (SQL injection, command injection, path traversal, XSS, SSRF, SSTI, unsafe deserialization, LDAP injection, XXE, NoSQL injection, ReDoS, and more) without maintaining fragile pattern lists.

A set of focused taint APIs allows follow-up queries pinned to specific call-site locations — either obtained from a prior analyze_taint_flows() run or provided directly as TaintNodeRef instances from any external tool (Joern, grep, symbol-table scan). All pinned locations are OR-combined into a single CodeQL query, avoiding O(N) sequential runs.

How Has This Been Tested?

Unit tests (no CodeQL required):

Schema validation, query generation, configuration loading/merging, the three config modes, disabled_builtin_sinks filtering, and validate_config — all run in CI without any external dependency.
TaintNodeRef construction and field defaults; TaintSourceConfig / TaintSinkConfig / TaintSanitizerConfig model validators for pattern vs locations; PyTaintFlowStep and PyTaintAnalysisResult optional-field coverage.
All pattern-helper code paths in TaintQueryGenerator (including the non-.getACall() else-branches, none() empty-predicate guards, config-sig isBarrier toggle, location-clause generation with and without column precision, and special-character escaping in file paths).
Focused API error paths (empty list guards, missing config), mixed PyTaintSource / TaintNodeRef input shapes, and singular-wrapper delegation.
Config-loader completeness: duplicate sink/sanitizer names, blank patterns, no-sinks warning, name-collision overrides, include_remote_flow_source carry-through, _filter_disabled for sanitizers, all _load_from_file error paths, save_config round-trips for YAML and JSON, .yml extension, scalar-only config files.

Integration tests (require CodeQL CLI):

9 purpose-built vulnerable fixture applications (sql_injection_app, command_injection_app, path_traversal_app, xss_app, flask_app, sanitizer_app, ssti_app, deserialization_app, ssrf_app) each with known-vulnerable code. Tests assert expected vulnerability types, flow counts, severity values, and that sanitized paths are not reported.
Focused taint API integration tests: analyze_taint_flows_from_sources([source]), analyze_taint_flows_to_sinks([sink]), and analyze_taint_flow_paths([source], [sink]) are exercised end-to-end against the flask_app fixture. Both PyTaintSource (bootstrapped from a prior full analysis) and raw TaintNodeRef (no bootstrapping) inputs are tested.

Total: 133 tests (109 unit, 24 integration), all passing.

Breaking Changes

None. All new flags have safe defaults (--taint-defaults on, no --taint-config required) and existing invocations at levels 1 and 2 are unaffected. The AnalysisOptions dataclass has two new optional fields (taint_config, taint_use_defaults) both with defaults. The focused taint APIs are additive; no existing method signatures changed.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the Codellm-Devkit Documentation
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have added or updated documentation as needed

Additional context

Architecture. The implementation has three layers:

Config layer (taint_config_defaults.py, taint_config_loader.py): default sources/sanitizers, YAML/JSON file loading, name-based merge, disabled_builtin_sinks suppression, and validate_config warnings surfaced at load time. Each source, sink, and sanitizer entry accepts either a CodeQL API-graph pattern or an explicit locations: List[TaintNodeRef] list (or both), so call-site data from any tool can be incorporated without writing CodeQL expressions.
Query generation layer (taint_query_generator.py): dynamically generates a DataFlow::ConfigSig / TaintTracking::Global<Config> query from the active configuration. Built-in sinks are driven by the BUILTIN_SINKS table; user-defined entries are appended as additional predicates, with location-based entries emitting getAbsolutePath() + getStartLine() [+ getStartColumn()] constraints for optional sub-line precision.
Analysis layer (codeql_analysis.py): executes the generated query, parses results into PyTaintFlow / PyTaintAnalysisResult Pydantic models, and resolves source/sink locations against the symbol table when available. Exposes three focused APIs — analyze_taint_flows_from_sources, analyze_taint_flows_to_sinks, analyze_taint_flow_paths — each accepting List[PyTaintSource | TaintNodeRef] and generating a single OR-combined query regardless of list size.

Config modes. Three options controlled by --taint-defaults / --no-taint-defaults:

Defaults only (no --taint-config): covers most projects out of the box.
Union (--taint-config file.yaml): extends the defaults with project-specific sources/sinks.
Custom only (--taint-config file.yaml --no-taint-defaults): replaces all defaults, useful for scoped audits.

TaintNodeRef. A minimal (file_path, start_line, start_column=-1) value accepted everywhere PyTaintSource / PyTaintSink are used in the focused APIs, and also as the locations field in config entries. This decouples the focused query APIs from the requirement to run a prior full analysis — call-site data from Joern, grep, or any other tool can be passed directly.

…xtures. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

…models; add related test fixtures and unit tests. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

…iltin_sinks, three-mode config control, and validation Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

…pecification of sources/sinks (in addition to pattern-based spec), expand test coverage, optimize fixtures. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

sinha108 added 4 commits May 15, 2026 12:40

Implementation of taint analysis with CodeQL, along with tests and fi…

7e03cfc

…xtures. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

Expand taint analysis to use all applicable CodeQL built-in security …

08ee3c9

…models; add related test fixtures and unit tests. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

Improve taint analysis extensibility: fix merge bugs, add disabled_bu…

509a541

…iltin_sinks, three-mode config control, and validation Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

Add test case with taint config in json format; add user guide

d0d1568

Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

sinha108 requested a review from rahlk May 20, 2026 21:25

Add focused taint APIs, add TaintNodeRef for simpler location-based s…

f5329bb

…pecification of sources/sinks (in addition to pattern-based spec), expand test coverage, optimize fixtures. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add taint analysis (analysis level 3) via CodeQL#29

Add taint analysis (analysis level 3) via CodeQL#29
sinha108 wants to merge 5 commits into
mainfrom
codeql-taint-analysis

sinha108 commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sinha108 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sinha108 commented May 20, 2026 •

edited

Loading