Skip to content

Conversation

@PRAteek-singHWY
Copy link
Contributor

@PRAteek-singHWY PRAteek-singHWY commented Jan 15, 2026

🚀 Prune Gap Analysis Search to Save Time and Memory

This PR implements a tiered pruning strategy for Gap Analysis to significantly reduce execution time and memory usage during map analysis.
The change directly addresses Issue #506 and aligns with the original design discussion around stopping early when strong or medium links are found.


🧠 Problem

Gap analysis currently performs an expensive wildcard Neo4j traversal:

MATCH p = allShortestPaths((BaseStandard)-[*..20]-(CompareStandard))

This approach:

  • Traverses all relationship types
  • Generates a large number of weakly relevant paths
  • Consumes large amounts of memory
  • Can take days to complete on large datasets
  • Runs even when direct or strong links already exist

In practice, we are only interested in the strongest connections between standards.


✅ Solution: Tiered Pruning Strategy

The search is now executed in three tiers, with early exit once results are found.

Tier 1 – Strong Links

Executed first. If any paths are found, the search stops immediately.

Relationships included:

  • LINKED_TO
  • AUTOMATICALLY_LINKED_TO
  • SAME

These correspond to the strongest connections (penalty = 0) and include equivalence (SAME) relationships.


Tier 2 – Medium Links

Executed only if Tier 1 returns no results.

Relationships included:

  • LINKED_TO
  • AUTOMATICALLY_LINKED_TO
  • SAME
  • CONTAINS

This captures hierarchical relationships without falling back to a full wildcard traversal.


Tier 3 – Fallback (Wildcard)

Executed only if Tier 1 and Tier 2 return no paths.

[*..20]

This preserves existing behavior as a fallback to ensure no loss of coverage.


🧪 Testing

A new unit test has been added to verify pruning behavior:

  • Confirms that Tier 3 is not executed when Tier 1 returns results
  • Uses mocking to detect which Neo4j queries are executed
  • Protects against future regressions in pruning logic

Test command:

python3 -m unittest application/tests/gap_analysis_db_test.py

All existing gap analysis tests continue to pass.


📈 Impact


🔗 Related Issue

Prune map analysis search to save time and memory
Fixes #506


📝 Notes

  • Path scoring logic is unchanged
  • Relationship semantics are preserved
  • This PR focuses strictly on backend query pruning
  • Frontend categorization changes are intentionally deferred to a follow-up PR (Stage 2)

- Introduce tiered gap analysis queries (strong → medium → wildcard)
- Stop traversal early when strong or medium paths exist
- Preserve existing scoring and semantics
- Add unit test to verify Tier-3 traversal is skipped when not needed

Fixes OWASP#506
@PRAteek-singHWY
Copy link
Contributor Author

PRAteek-singHWY commented Jan 17, 2026

PR 716: Performance Benchmark

Hi @northdpole ,

Thanks for the feedback. I understand the need for safety when touching core functionality. To be absolutely sure, I ran a comparative benchmark on my local environment using the full OpenCRE dataset (18 standards).

1. The "Before vs After" Measurements

I measured the execution time of gap_analysis() for the ASVS -> WSTG pair on both branches.

Metric Main Branch (Current) PR #716 (Optimized) Improvement
Query Strategy Broad Wildcard [*..20] Tiered [:STRONG_LINKS] then fallback Algorithmic
Execution Time 13.998s 1.26s ~11x Speedup
Paths Found 12,539 (Mixed Quality) 84 (High Confidence) 99% Noise Reduction

Conclusion: The legacy code drowns the user in 12,000+ weak paths after a 14-second wait. The new code returns the 84 relevant paths in 1 second.


2. Methodology & Rationale

Why did we test "ASVS -> WSTG"?
This is the "Stress Test". These two standards are heavily interconnected.

  • Main Branch: Because it checks all relationships up to 20 hops, it gets stuck in a combinatorial explosion (finding 12,539 paths through unrelated nodes).
  • PR Branch: The "Confidence-First" search finds the 84 direct, strong links immediately and stops.
  • If it works for this complex pair, it works for everything.

Why "Bolt" Protocol?
We ran the test using bolt:// (binary protocol) against the local Docker container. This ensures we are measuring the true database execution time without HTTP overhead or network lag.

Environment:

  • Dataset: Standard local import (~18 Core Standards: ASVS, WSTG, CAPEC, NIST, etc.).
  • Test Script: Automated Python script running db.gap_analysis 5 times and averaging the result.

This benchmark confirms the change is both safe (finds the high-quality data) and performant (>10x faster).

#506

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prune map analysis search to save time and memory

1 participant