Add taint analysis (analysis level 3) via CodeQL#29
Open
sinha108 wants to merge 5 commits into
Open
Conversation
…xtures. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
…models; add related test fixtures and unit tests. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
…iltin_sinks, three-mode config control, and validation Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
…pecification of sources/sinks (in addition to pattern-based spec), expand test coverage, optimize fixtures. Signed-off-by: Saurabh Sinha <sinha108@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
This PR adds analysis level 3: inter-procedural taint analysis that tracks data flow from user-controlled sources to security-sensitive sinks and reports each path as a typed, severity-rated vulnerability.
Detection is delegated to CodeQL's
codeql/python-allbuilt-in security models rather than manually enumerating APIs, giving broad coverage across 20 vulnerability classes (SQL injection, command injection, path traversal, XSS, SSRF, SSTI, unsafe deserialization, LDAP injection, XXE, NoSQL injection, ReDoS, and more) without maintaining fragile pattern lists.A set of focused taint APIs allows follow-up queries pinned to specific call-site locations — either obtained from a prior
analyze_taint_flows()run or provided directly asTaintNodeRefinstances from any external tool (Joern, grep, symbol-table scan). All pinned locations are OR-combined into a single CodeQL query, avoiding O(N) sequential runs.How Has This Been Tested?
Unit tests (no CodeQL required):
disabled_builtin_sinksfiltering, andvalidate_config— all run in CI without any external dependency.TaintNodeRefconstruction and field defaults;TaintSourceConfig/TaintSinkConfig/TaintSanitizerConfigmodel validators forpatternvslocations;PyTaintFlowStepandPyTaintAnalysisResultoptional-field coverage.TaintQueryGenerator(including the non-.getACall()else-branches,none()empty-predicate guards, config-sigisBarriertoggle, location-clause generation with and without column precision, and special-character escaping in file paths).PyTaintSource/TaintNodeRefinput shapes, and singular-wrapper delegation.include_remote_flow_sourcecarry-through,_filter_disabledfor sanitizers, all_load_from_fileerror paths,save_configround-trips for YAML and JSON,.ymlextension, scalar-only config files.Integration tests (require CodeQL CLI):
sql_injection_app,command_injection_app,path_traversal_app,xss_app,flask_app,sanitizer_app,ssti_app,deserialization_app,ssrf_app) each with known-vulnerable code. Tests assert expected vulnerability types, flow counts, severity values, and that sanitized paths are not reported.analyze_taint_flows_from_sources([source]),analyze_taint_flows_to_sinks([sink]), andanalyze_taint_flow_paths([source], [sink])are exercised end-to-end against theflask_appfixture. BothPyTaintSource(bootstrapped from a prior full analysis) and rawTaintNodeRef(no bootstrapping) inputs are tested.Total: 133 tests (109 unit, 24 integration), all passing.
Breaking Changes
None. All new flags have safe defaults (
--taint-defaultson, no--taint-configrequired) and existing invocations at levels 1 and 2 are unaffected. TheAnalysisOptionsdataclass has two new optional fields (taint_config,taint_use_defaults) both with defaults. The focused taint APIs are additive; no existing method signatures changed.Types of changes
Checklist
Additional context
Architecture. The implementation has three layers:
Config layer (
taint_config_defaults.py,taint_config_loader.py): default sources/sanitizers, YAML/JSON file loading, name-based merge,disabled_builtin_sinkssuppression, andvalidate_configwarnings surfaced at load time. Each source, sink, and sanitizer entry accepts either a CodeQL API-graphpatternor an explicitlocations: List[TaintNodeRef]list (or both), so call-site data from any tool can be incorporated without writing CodeQL expressions.Query generation layer (
taint_query_generator.py): dynamically generates aDataFlow::ConfigSig/TaintTracking::Global<Config>query from the active configuration. Built-in sinks are driven by theBUILTIN_SINKStable; user-defined entries are appended as additional predicates, with location-based entries emittinggetAbsolutePath()+getStartLine()[+getStartColumn()] constraints for optional sub-line precision.Analysis layer (
codeql_analysis.py): executes the generated query, parses results intoPyTaintFlow/PyTaintAnalysisResultPydantic models, and resolves source/sink locations against the symbol table when available. Exposes three focused APIs —analyze_taint_flows_from_sources,analyze_taint_flows_to_sinks,analyze_taint_flow_paths— each acceptingList[PyTaintSource | TaintNodeRef]and generating a single OR-combined query regardless of list size.Config modes. Three options controlled by
--taint-defaults/--no-taint-defaults:--taint-config): covers most projects out of the box.--taint-config file.yaml): extends the defaults with project-specific sources/sinks.--taint-config file.yaml --no-taint-defaults): replaces all defaults, useful for scoped audits.TaintNodeRef. A minimal(file_path, start_line, start_column=-1)value accepted everywherePyTaintSource/PyTaintSinkare used in the focused APIs, and also as thelocationsfield in config entries. This decouples the focused query APIs from the requirement to run a prior full analysis — call-site data from Joern, grep, or any other tool can be passed directly.