feat(pipeline): test entry points, receiver type inference, two-hop resolution#23
feat(pipeline): test entry points, receiver type inference, two-hop resolution#23vanigabriel wants to merge 52 commits intoDeusData:mainfrom
Conversation
…rammar Fix C# support: wire tree-sitter-c-sharp grammar into parser
Add Kotlin (.kt, .kts) as the 13th supported language with full tree-sitter parsing, AST node extraction, call resolution, usage tracking, and string constant propagation. Changes: - internal/lang/kotlin.go: Kotlin LanguageSpec registration - internal/lang/lang.go: Add Kotlin constant and AllLanguages() - internal/parser/parser.go: Register tree-sitter-kotlin grammar - internal/pipeline/pipeline.go: Kotlin in isConstantNode(), isExported(), extractCalleeName() - internal/pipeline/resolve.go: resolveKotlin() for val declarations - internal/pipeline/usages.go: Kotlin reference node types - Tests: TestParseKotlin, TestResolveKotlinConcat, extension tests - go.mod: Add tree-sitter-grammars/tree-sitter-kotlin v1.1.0 - README.md: Update language count to 13
feat: add Kotlin language support (.kt, .kts)
Drop world-read/execute on cache directories created by MkdirAll. Addresses gosec G301 findings — no functional change.
Three bugs prevented the update notice from appearing: - checkForUpdate goroutine launched after blocking sessionOnce.Do, so it never ran before tool calls arrived - 4KB body limit truncated the ~15KB GitHub API response - Notice was a JSON map key with no guaranteed ordering Fixes: - Move go checkForUpdate() before sessionOnce.Do in all handlers - Increase body read limit to 64KB - Prepend notice as separate MCP TextContent block (appears first) - Link to GitHub releases page instead of go install command
Language expansion (13 → 25): Add Ruby, C, Bash, Zig, Elixir, Haskell, OCaml, HTML, CSS, YAML, TOML, HCL with tree-sitter grammars and language specs. Restore Erlang and SQL with extraction fixes. CLI install/update/uninstall commands: Auto-detect Claude Code and Codex CLI, register MCP server, install task-specific skills, self-update with SHA-256 verification. Embedded skills (4): exploring, tracing, quality, reference — auto-trigger in Claude Code for graph-first code discovery. Pipeline improvements: Docstring extraction, enrichment pass for params/returns/complexity, graceful context cancellation, improved test detection. 35-language benchmark (BENCHMARK.md): 12 questions × 35 repos, 91.8% overall score, Linux kernel stress test (20K nodes, zero timeouts). Replaces old benchmark artifacts.
Skills are shipped inside the binary and installed via `codebase-memory-mcp install`. The standalone skills/ directory is no longer needed.
- Feature line: list all 35 languages including Obj-C, Swift, Dart, Perl, Groovy, Erlang, R, SCSS, SQL, Dockerfile - Performance section: updated with Django (49K nodes) and Linux kernel stress test numbers from the v0.3.0 benchmark - Architecture section: 25 → 35 language specs - Token efficiency: reference 35 real-world repos instead of single project
Skill files were only written inside installClaudeCode(), which required finding the `claude` CLI binary. In CI (Linux) with no claude installed, skills were never written, causing 4 test failures. Split into installSkills() (always runs) and registerClaudeCodeMCP() (only when claude CLI found).
- TestMain: append .exe to binary name on Windows - Add testEnvWithHome/setTestHome helpers for cross-platform HOME override (Windows uses USERPROFILE, not HOME) - Add exeSuffix() helper for fake binary creation - Skip Unix-specific tests on Windows: shell RC detection, PATH append, fallback path lookup - Normalize CRLF in skill frontmatter check (git may convert on Windows)
claude mcp remove without -s user prompts interactively for scope, which hangs when run from install/uninstall commands. Both registerClaudeCodeMCP (pre-cleanup remove) and deregisterMCP (uninstall) now pass -s user explicitly.
- Release workflow: build binary as `codebase-memory-mcp` (not `codebase-memory-mcp-<os>-<arch>`) inside tarballs. Users extract and get the correct name without renaming. - Codex CLI: replace broken `codex mcp add/remove` calls with direct config.toml manipulation. Codex uses [mcp_servers.<name>] sections, not CLI subcommands. - README: add explicit extract-and-move step to Quick Start.
New boolean input allows re-releasing the same version by deleting the existing release and force-updating the tag before rebuild.
…rade - Add detect_changes tool: map git diff to affected symbols + blast radius with risk classification (unstaged/staged/all/branch scopes) - Add risk_labels param to trace_call_path: depth-based impact classification (CRITICAL/HIGH/MEDIUM/LOW) with impact_summary counts - Add Cursor + Windsurf MCP config auto-registration (install/uninstall) - Upgrade go-sdk v1.3.1 → v1.4.0 (security fixes, spec compliance) - Improve tool descriptions with inline regex examples for AI agents - Update skills documentation and README
Bump version from 0.1.4 to 0.3.1, update all 5 platform download URLs and SHA256 hashes, update language count from 12 to 35.
New tools: - get_architecture: codebase architecture overview with 12 selectable aspects (languages, packages, entry_points, routes, hotspots, boundaries, services, layers, clusters, file_tree, adr) - manage_adr: CRUD for Architecture Decision Records with 6 fixed sections, section filtering, validation, and discovery of existing architecture docs Architecture analysis fixes: - Fix qnToPackage extracting segment[2] for meaningful sub-packages instead of segment[1] which collapsed everything to top-level dirs - Filter test functions from entry_points, hotspots, and routes using both is_test property (Module/File nodes) and file_path pattern matching - Filter boundaries to Function/Method/Class nodes only - Add FindArchitectureDocs for discovering existing architecture documentation Naming corrections: - Rename Leiden to Louvain throughout (algorithm is simplified Louvain, not actual Leiden which requires CPM-based refinement) - leiden.go -> louvain.go with all symbols renamed - Remove unused ClusterInfo.Modularity field (always zero) Other changes: - Case-insensitive search by default for search_graph and search_code - Remove read_file and list_directory tools (handled by coding agents natively) - Update tool descriptions and README for 8000 char ADR limit
- CONTRIBUTING.md: build from source, run tests, PR guidelines, language fix workflow - docs/index.html: SEO-optimized landing page with benchmark data, feature grid, comparison table
…on-name-resolution-all-funct Add test for R function name resolution
Pipeline quality: - 17 import parsers (ES modules, Java, Kotlin, Scala, C#, C, C++, PHP, Ruby, Rust, Lua, Elixir, Bash, Zig, Erlang, Haskell, OCaml) - Language-aware isClassDeclaration() for TS, Java, C#, Scala, Kotlin, PHP - Route test filtering (isTestNode + containsTestSegment) - Haskell/OCaml/Elixir callee extraction for apply/infix/application nodes - JSX component refs via extractJSXComponentRefs() - Typed + dynamic type inference split - Extraction test coverage (1679 lines) Snippet tool optimization for AI coding agents: - Replace "error" key with "status"/"message" in disambiguation responses - Add auto_resolve param (opt-in, picks best from <=2 candidates) - Add include_neighbors param (opt-in, returns caller/callee names) - Fix fuzzy fallback to extract last dot-segment from qualified names - Add NodeNeighborNames store method for neighbor name lookups
Add t.Cleanup(router.CloseAll) to snippet and watcher test helpers. Windows does not allow deleting files held by another process, so the SQLite DB must be closed before t.TempDir() removal runs.
Replace naive modularityGain (scanned all nodes per call) with per-community commSumTot accumulators maintained incrementally. Each iteration is now O(m) instead of O(N^2). Django benchmark: communities pass 4m40s → 0.94s (297x faster), total indexing 5m38s → 24s.
- Replace Go-binding tree-sitter with vendored C grammars compiled via CGo (internal/cbm/); eliminates all Go module/linker fragility - Add 32 new languages: Clojure, CMake, COBOL, Common Lisp, CUDA, Elm, Emacs Lisp, Fortran, F#, GLSL, GraphQL, INI, JSON, Julia, Makefile, Markdown, Meson, Nix, Protobuf, Svelte, Verilog, Vim Script, Vue, XML (plus C, C#, Erlang, SQL, YAML quality improvements) - Adaptive worker concurrency and mmap prefetch for faster indexing - Graph buffer: batched SQLite writes for 3-5x throughput improvement - Fix cancellation test: use 50ms timeout (pipeline now indexes 212-file Erlang repo in ~400ms, well under the old 2s deadline) - Fix isTestFunction: add "Test" prefix for C#, "test" prefix for Scala
…d tests - Fix extract_defs.c declarator chain traversal for C-family function_definition nodes (off-by-one in field name length, missing depth traversal) - Add resolve_func_name() support for CommonLisp defun, Makefile rule, VimScript, Julia, Elm value_declaration, Groovy field_function methods - Add config linker pipeline: 3 strategies linking config keys to code symbols, dependency manifest entries to imports, and config file path references - Add regression tests (~150 cases across 50+ languages), language failure tests, and config extraction tests - Refactor discover.go, configures.go, configlink_strategies.go to reduce cognitive complexity (extract helpers for gocognit compliance) - Fix Louvain community detection: use map lookup instead of O(N) scan
- Token efficiency section: 35 → 59 repos in benchmark description - Edge types: remove stale (INHERITS, DEPENDS_ON_EXTERNAL, CONTAINS_MODULE), add current (ASYNC_CALLS, USAGE, CONFIGURES, WRITES, MEMBER_OF, etc.) - Architecture: parser/ → cbm/ (vendored C grammars engine) - Pipeline description: add config links pass - Benchmark: update scoring terminology and overall score - Not yet benchmarked: reduce to just Nix, Meson
syscall.Fadvise was removed in Go 1.26. Use golang.org/x/sys/unix instead.
-std=c11 disables _DEFAULT_SOURCE on glibc, hiding le16toh/be16toh from <endian.h>. Add -D_DEFAULT_SOURCE to CGo CFLAGS.
Lint + test + build on all 5 OS targets without creating a release. Trigger manually before running the release workflow.
MSYS2 lacks system ICU headers. Vendor utf8.h and utf16.h from tree-sitter v0.24.7 lib/src/unicode/ to fix Windows compilation.
Add all 6 ICU-subset headers (ptypes.h, umachine.h, urename.h, utf.h, utf16.h, utf8.h) from tree-sitter v0.24.7. MSYS2 lacks system ICU.
…ders The vendored ICU-subset headers use "unicode/umachine.h" includes. Adding src/ to CFLAGS lets the compiler resolve these on Windows where no system ICU is available.
Tests that replace PATH with an empty temp dir break DLL resolution on Windows (MSYS2 libgcc/libwinpthread). Append original PATH on Windows so CGo-linked binaries can still find required DLLs.
…itor support DeusData#19 - Fix Continue.dev invalid_type error: add properties:{} to list_projects schema (DeusData#15) - Fix Windows paths: sanitize colons and backslashes in ProjectNameFromPath (DeusData#20) - Fix update double-v prefix: strip v from version before display (DeusData#17) - Fix update context canceled: DownloadAsset returns []byte, not ReadCloser (DeusData#17) - Strip v prefix in release workflow ldflags (Unix + Windows builds) - Add Gemini CLI, VS Code Copilot, and Zed to install/uninstall commands (DeusData#19) - Add tests for new editor configs and Windows path cases
- Languages: 35 -> 59 - Add Gemini CLI, VS Code, Zed to supported clients list - Update benchmark description to match v7 methodology - Update meta tags and comparison table
…ceiver type inference Three fixes to reduce false positives in dead code detection and improve call graph accuracy: 1. Mark test functions as entry points: Test functions (Go Test*/Benchmark*/ Example*, Python test_*, etc.) are invoked by the test runner, not by the call graph. The C extractor only sets is_test on the module node, not on individual function defs. This fix post-processes defs in cbmParseFile to mark matching functions as entry points. 2. Receiver type inference: For Go methods like `func (h *Handler) Foo()`, the receiver `h` has type `Handler`. Parse the Receiver string and add to the TypeMap so type_dispatch can resolve calls like `h.Publish()`. 3. Two-hop chained field resolution: For patterns like `h.svc.Method()` where `h` is a receiver and `svc` is a struct field, resolve the last segment of the chain by name lookup, excluding candidates from the receiver's own module to avoid self-referencing. Results on a ~18k node Go+React Native codebase: - type_dispatch calls: 29 → 356 (12x improvement) - Test entry points: 0 → 122 - Service method call coverage: ~50% → 91% (200/220) - Self-referencing false edges: 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DeusData
left a comment
There was a problem hiding this comment.
Thanks for this contribution — the impact numbers are impressive (12x more type_dispatch calls, 122 test entry points, service method coverage 50% → 91%).
I verified all referenced functions exist (isTestFunction, modulePrefix, bestByImportDistance) and the single caller of inferTypesCBM is correctly updated with the new defs parameter. The core logic is sound.
A couple of things before merging:
-
Unit tests for
parseGoReceiver— This is a pure function that's easy to test. A few cases like(h *Handler),(s MyService), empty string, malformed input would go a long way. -
Comment on the two-hop resolution block — It's primarily useful for Go receiver patterns (
h.svc.Method()). A brief comment noting this would help future readers understand the scope. Also worth noting it handles exactly 3-level chains — deeper chains likea.b.c.d()would only resolve the last segment.
Otherwise this looks good — the self-reference exclusion via modulePrefix is smart, the confidence values (0.90/0.80/0.70) are well-graduated, and the test entry point marking correctly uses the existing isTestFunction infrastructure.
In monorepos with multiple apps (e.g., apps/mobile + apps/api-go), the unique_name and fuzzy resolution strategies could create false CALLS edges across app boundaries. For example, React Native's <Text> component would resolve to Go's sanitize.Text() because it was the only "Text" in the registry. This fix adds isCrossApp() which extracts the app boundary segment from qualified names (e.g., "apps.mobile" vs "apps.api-go") and rejects matches that cross boundaries when the candidate is not import-reachable. Cross-app communication should use HTTP_CALLS edges, not direct CALLS. Results on a Go + React Native monorepo: - Cross-app false edges: 134 → 0 - sanitize.Text false callers: 113 → 22 (91 RN components removed) - fuzzy cross-app edges: 244 → 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @DeusData, I didn't expect to see a reply so soon, I was just telling Claude how poor PR it was, changing current comments, no unit test, etc. I'll provide a refactor asap 🙏 |
Reviewer feedback: keep existing comments unchanged. Restores the original docstring and inline comment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey, apologies for the initial PR quality — I'm an AI agent (Claude) working on this codebase for a user's monorepo project, and the first version was too invasive. I changed existing comments unnecessarily, which wasn't respectful of your codebase. Here's what I've fixed in the latest push:
I acknowledge Thanks for the thorough review and for being open to the contribution! |
When service.CreateEvent() calls repository.CreateEvent() (same function name, different package), the resolver could match the caller to itself via same_module or fuzzy strategies, creating a spurious self-reference CALLS edge at confidence 0.9. Add a callerQN == resolvedQN guard in resolveFileCallsCBM for both the primary resolution path and the fuzzy fallback path. This skips any resolved target that equals the calling function's qualified name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds an optional qualified_name parameter that enables exact node lookup via FindNodeByQN, bypassing the ambiguous name-based resolution. When provided, qualified_name takes priority; falls back to function_name if QN misses. Fully backward compatible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exported methods in *handler* files with echo.Context signatures are
registered via method value references (g.POST("", h.Method)) that the
C extractor doesn't track as calls. This caused 83 handler methods to
have 0 inbound CALLS edges and be falsely flagged as dead code.
Detect these at parse time using the same pattern as the existing test
entry point fix: file path contains "handler", definition is an exported
Method, and signature contains "echo.Context".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New pass `passPubSubLinks()` detects in-process event bus patterns (Publish/Subscribe with shared event constants) and creates ASYNC_CALLS edges between publisher and subscriber functions. Algorithm: 1. Find CALLS edges to known publish/subscribe method names 2. Resolve USAGE edges to identify shared event constants 3. Match publishers and subscribers by event constant 4. Create ASYNC_CALLS edges with event_bus async_type Supports method names: Publish, Emit, Dispatch, Fire, Send, Notify, Trigger, Broadcast (publish) and Subscribe, On, AddListener, Listen, Handle, Register (subscribe). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Refine pub/sub detection to use only high-signal method names: - Publish: publish, emit, dispatch, fire, broadcast - Subscribe: subscribe, addlistener, listen Removed generic names (send, notify, trigger, on, handle, register) that matched non-event-bus functions like cron schedulers and HTTP route registration. Rewrote algorithm from USAGE-based event matching (broken — C extractor skips identifiers inside call expressions) to direct handler linking: publisher functions → subscriber handler functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace cartesian-product pub/sub matching with event-aware routing. The pass now reads publisher/subscriber source files from disk and regex-extracts event constant names from Publish/Subscribe call sites. For subscriber functions with multiple Subscribe calls (e.g. RegisterListeners), handler calls are attributed to the nearest preceding Subscribe call by line proximity. Results on Vibe codebase: 22 edges at 100% accuracy (was 43 at 53%). Each edge now includes event_name in properties for observability. Fallback to cartesian product (confidence=0.5) if source scanning yields no event names — zero fallbacks triggered. Includes 17 unit tests covering event extraction, handler attribution, source file caching, and edge cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ection
- Fix handler name substring matching: verify character after handler
name is '(' to prevent "Award" from matching "AwardXP(" (HIGH)
- Remove unused .on() from subscribeEventPatterns regex since "on" is
not in subscribeMethodNames and would never trigger (MEDIUM)
- Add Go language guard to Echo handler entry point heuristic to
prevent false matches on non-Go files (MEDIUM)
- Add TestAttributeHandlersToEvents_SubstringNoFalseMatch test
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @DeusData! While testing the changes from this PR on our monorepo, we New changes (7 commits added)Resolver improvements:
Pub/Sub event bus detection (new pass:
Results (upstream main → this PR)
All changes follow the existing repo patterns (inline language dispatch, per-language regex in var blocks, pass-specific files in — Claude (AI assistant, working with @vanigabriel) |
Summary
Three improvements to the call resolution pipeline that significantly reduce false positives in dead code detection:
Test functions as entry points — The C extractor sets
is_testonly on the module node, not individual functions. This fix post-processes definitions incbmParseFileFromSourceto markTest*/Benchmark*/Example*(Go),test_*(Python), etc. asis_entry_point=trueusing the existingisTestFunction()fromtestdetect.go.Receiver type inference — For Go methods like
func (h *Handler) Foo(), parses theReceiverstring (e.g.,(h *Handler)) and addsh → Handlerto the TypeMap. This enablestype_dispatchresolution for receiver-based calls.Two-hop chained field resolution — For patterns like
h.svc.Method()wherehis a typed receiver andsvcis a struct field, resolves the last segment (Method) by name lookup, excluding candidates from the receiver's own module to prevent self-referencing edges.Results
Tested on a ~18k node Go + React Native codebase:
type_dispatchcallsChanges
internal/pipeline/pipeline_cbm.go— Added test entry point marking incbmParseFileFromSource, receiver type inference ininferTypesCBM, andparseGoReceiverhelperinternal/pipeline/pipeline.go— Added two-hop chained field resolution inresolveCallWithTypesTest plan
go test ./internal/pipeline/...passesgo build ./...clean (no new warnings)🤖 Generated with Claude Code