Skip to content

Commit 93d4419

Browse files
committed
feat(core): ship clone_guard_exit_divergence + clone_cohort_drift, optimize cache/report pipeline, harden deterministic contracts, and align docs/tests for 2.0.0b1
1 parent 6b6ed54 commit 93d4419

71 files changed

Lines changed: 4806 additions & 1298 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 89 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -2,83 +2,79 @@
22

33
## [2.0.0b1]
44

5-
CodeClone 2.0 is a major upgrade that expands the project from a structural clone detector into a broader *
6-
*baseline-aware code-health and CI governance tool** for Python.
5+
CodeClone 2.0 is a major upgrade that evolves the project from a structural clone detector into a **baseline-aware**
6+
code-health and CI governance tool for Python.
77

8-
This beta introduces:
8+
This beta focuses on the new architecture, expanded code-health analysis, contract stability, and performance validation
9+
ahead of the final `2.0.0` release.
910

10-
- a new stage-based architecture
11-
- unified clone + metrics baseline flow
12-
- report schema `2.1`, cache schema `2.1`, and richer report provenance
13-
- expanded code-health analysis (complexity, coupling, cohesion, dependencies, dead code, health)
14-
- improved HTML and CLI reporting surfaces
15-
- substantial performance work for faster cold and warm runs
16-
17-
Compatibility remains a first-class concern in this release:
18-
19-
- baseline schema is bumped to `2.0`
20-
- `fingerprint_version` remains `1`
21-
- backward compatibility for legacy clone-only baselines is preserved
22-
23-
This is a beta release intended to validate the new architecture, reporting surface, and performance profile before the
24-
final `2.0.0` release.
25-
26-
### Fixes (feat/2.0.0)
11+
### Overview
2712

28-
- Fixed scanner root-exclude short-circuit: only an explicitly excluded root
29-
directory is skipped; excluded segments in parent path no longer suppress
30-
valid scans (prevents silent zero-file analysis for roots like `build/project`).
31-
- Optimized HTML snippet rendering path:
32-
- `_FileCache` now caches full file lines once per file and serves
33-
line-range slices without repeated full-file scans.
34-
- Pygments imports are cached per importer identity to avoid repeated
35-
dynamic import overhead in hot snippet loops while preserving testability.
36-
- Optimized block explainability AST stats:
37-
- added per-file statement index and range lookup via `bisect`,
38-
replacing repeated full `ast.walk()` scans per range.
39-
- Added scanner regression coverage for roots under excluded parent directories.
40-
- No baseline/cache/report schema contract changes; detector identity semantics
41-
and golden compatibility preserved.
13+
- New stage-based pipeline architecture with unified clone + metrics baseline flow.
14+
- Expanded code-health analysis: complexity, coupling, cohesion, dependencies, dead code, and health.
15+
- Improved HTML and CLI reporting surfaces.
16+
- Significant performance work for faster cold and warm runs.
17+
- Baseline schema `2.0`, report schema `2.1`, cache schema `2.2`; `fingerprint_version` remains `1` and legacy
18+
clone-only baselines stay compatible.
4219

4320
### Architecture
4421

45-
- Refactored CLI orchestration into a stage-based pipeline (`codeclone/pipeline.py`) to isolate discovery, processing,
46-
analysis, report writing, and gating.
22+
- Refactored CLI orchestration into a stage-based pipeline (`codeclone/pipeline.py`) that isolates discovery,
23+
processing, analysis, report writing, and gating.
4724
- Introduced explicit domain layers:
4825
- `codeclone/models.py` — typed core models
4926
- `codeclone/metrics/` — complexity, coupling, cohesion, dependencies, dead code, and health
50-
- `codeclone/report/` — merge, explain, serialize, and suggestions
27+
- `codeclone/report/` — merge, explain, serialize, suggestions
5128
- `codeclone/grouping.py` — clone grouping domain
52-
- Removed temporary legacy `_report_*` shim modules after migrating runtime and tests to `codeclone.report.*`.
29+
- Removed legacy `_report_*` shims after migrating runtime and tests to `codeclone.report.*`.
5330

5431
### Baseline, Cache, and Report Contracts
5532

5633
- Bumped baseline schema to `2.0` (`BASELINE_SCHEMA_VERSION`) while preserving compatibility checks for legacy `1.0`
5734
clone-only payloads.
58-
- Added unified baseline flow with optional top-level `metrics` stored in the same baseline file as clone keys.
35+
- Added a unified baseline flow with optional top-level `metrics` stored alongside clone keys in the same baseline file.
5936
- Tracked embedded metrics snapshot integrity via `meta.metrics_payload_sha256`.
6037
- Preserved embedded metrics payload and hash when updating clone baseline content.
61-
- Bumped cache schema to `2.1`.
62-
- Bumped report schema to `2.1`.
63-
- Consolidated report contract around canonical sections:
64-
`meta`, `inventory`, `findings`, `metrics`, with `derived` and `integrity`
65-
as explicit companion layers.
66-
- Structural findings now deduplicate repeated occurrences and use explicit
67-
`file_path` item layout instead of a sentinel `file_i=-1`.
68-
- Tightened `duplicated_branches` reporting to suppress trivial single-statement
69-
branch boilerplate without structural mass.
38+
- Bumped cache schema to `2.2` and report schema to `2.1`.
39+
- Extended cache metrics payload with canonical symbol-usage references:
40+
- `referenced_qualnames` in runtime entries
41+
- compact wire key `rq` in cache payload
42+
- Added additive cache payload key `sr` (segment report projection) to reuse merged
43+
segment suppression output on warm runs without cache schema/version bump.
44+
- Consolidated the report contract around canonical sections:
45+
`meta`, `inventory`, `findings`, `metrics`, with `derived` and `integrity` as companion layers.
46+
- Structural findings now deduplicate repeated occurrences and use an explicit `file_path` item layout instead of a
47+
sentinel `file_i = -1`.
48+
- Tightened `duplicated_branches` reporting to suppress trivial single-statement boilerplate without structural mass.
49+
50+
### Contract Stabilization Updates
51+
52+
- Added report-only structural finding families for clone cohort analysis:
53+
- `clone_guard_exit_divergence`
54+
- `clone_cohort_drift`
55+
- Added deterministic per-function stable structure facts in extraction/cache payloads and reused them for cohort
56+
structural findings without extra scans.
57+
- Extended cache wire `u` row with stable structure columns while preserving deterministic decode defaults for legacy
58+
rows.
59+
- Expanded `tests/fixtures/golden_v2` contracts:
60+
- analysis snapshots now lock `stable_structure` and `cohort_structural_findings`
61+
- CLI snapshots now lock structural group id/kind projections
62+
- Strengthened branch/invariant coverage for structural/report layers; coverage gate remains `>=99%`.
63+
- Synchronized contract docs with implemented code paths
64+
(`README`, architecture, cache/report schema appendices, testing book).
7065

7166
### Configuration and CLI UX
7267

73-
- Added project config loading from `pyproject.toml` under `[tool.codeclone]` with strict key and type validation.
68+
- Added project configuration loading from `pyproject.toml` under `[tool.codeclone]` with strict key and type
69+
validation.
7470
- Made precedence explicit: `CLI (explicit flags) > pyproject.toml > parser/runtime defaults`.
7571
- Added a Python 3.10-compatible TOML loading path (`tomli` fallback when `tomllib` is unavailable).
76-
- Added optional-value report flags with deterministic defaults when passed without a path:
77-
- `--html` -> `.cache/codeclone/report.html`
78-
- `--json` -> `.cache/codeclone/report.json`
79-
- `--md` -> `.cache/codeclone/report.md`
80-
- `--sarif` -> `.cache/codeclone/report.sarif`
81-
- `--text` -> `.cache/codeclone/report.txt`
72+
- Added optional-value report flags with deterministic default paths when passed without a value:
73+
- `--html` `.cache/codeclone/report.html`
74+
- `--json` `.cache/codeclone/report.json`
75+
- `--md` `.cache/codeclone/report.md`
76+
- `--sarif` `.cache/codeclone/report.sarif`
77+
- `--text` `.cache/codeclone/report.txt`
8278
- Added optional-value path flags for default-path intent:
8379
- `--baseline`
8480
- `--metrics-baseline`
@@ -87,41 +83,44 @@ final `2.0.0` release.
8783
- Replaced confusing argparse-generated double-negation aliases with explicit flag pairs:
8884
- `--no-progress` / `--progress`
8985
- `--no-color` / `--color`
90-
- Clarified CLI runtime footer wording: `Pipeline done in X.XXs`.
91-
Reported time is pipeline time, not full process wall-clock including launcher or interpreter startup.
92-
- Refreshed the terminal UI for both normal and `--ci` modes:
86+
- Clarified the CLI runtime footer wording: `Pipeline done in X.XXs` (pipeline time only, not full process wall-clock).
87+
- Refreshed the terminal UI for normal and `--ci` modes:
9388
- clearer run header with scan-root context
9489
- structured analysis summary and quality-metrics panels
9590
- explicit cache, clone, and baseline counters
96-
- report path and pipeline-time footer integrated into the summary surface
97-
- Fixed `pyproject.toml` override handling for `metrics_baseline`: a configured non-default metrics baseline path is now
98-
respected even when `--metrics-baseline` is not passed explicitly.
99-
100-
### Documentation
101-
102-
- Updated the root `README.md` to reflect CodeClone 2.0 as a structural clone detector, baseline-aware governance tool,
103-
and code-health gate.
104-
- Added a dedicated `pyproject.toml` configuration section (`[tool.codeclone]`) to the README.
105-
- Documented default-path behavior for bare report flags (`--html`, `--json`, `--text`).
106-
- Moved the long JSON report shape example under a collapsible `<details>` block for readability.
107-
- Added conservative performance guidance in the README with local run numbers and a 100k LOC extrapolation.
108-
- Updated contract docs in `docs/book/*` to reference `codeclone/report/*` directly instead of legacy shim paths.
109-
- Documented CLI timing semantics in `docs/book/09-cli.md`.
91+
- report path and pipeline-time footer integrated into the summary
92+
- Fixed `pyproject.toml` override handling for `metrics_baseline`: a configured non-default path is now respected even
93+
when `--metrics-baseline` is not passed explicitly.
11094

11195
### Report Provenance and UI
11296

113-
- Added scan identity fields to report metadata:
97+
- Added scan identity fields to report meta
11498
- `project_name`
11599
- `scan_root`
116100
- Rendered `Project` and `Scan root` in the HTML provenance panel.
117101
- Added `Project name` and `Scan root` to TXT report metadata.
118102
- Propagated the same fields into JSON report `meta` via the shared report metadata builder.
119-
- Fixed baseline provenance after `--update-baseline`: report metadata now reflects the freshly saved clone baseline
120-
hash (`baseline_payload_sha256`) and verification state in the same run.
103+
- Fixed baseline provenance after `--update-baseline`: report metadata now reflects the freshly saved baseline hash
104+
(`baseline_payload_sha256`) and verification state in the same run.
121105
- Simplified dependency SVG rendering internals by removing unreachable guard branches while preserving deterministic
122106
output.
123107
- Made suggestions table headers consistently render glossary help badges through a single deterministic template path.
124108

109+
### Detection Quality
110+
111+
- Made the dead-code detector more conservative for non-actionable runtime patterns:
112+
- skips test paths and test entrypoint names
113+
- skips dunder methods
114+
- skips dynamic visitor methods (`visit_*`) and setup/teardown hooks
115+
- skips `Protocol` methods and stub-like callables (`@overload`, `@abstractmethod`)
116+
- Reduced false positives without changing clone detection semantics.
117+
- Dead-code liveness now ignores references originating from test files, including cached test-file references, so
118+
production symbols used only in tests are still reported as dead-code candidates.
119+
- Dead-code liveness now uses exact canonical qualname references (including import-alias and module-alias usage)
120+
before fallback local-name checks, reducing false positives on re-export and alias wiring.
121+
- Refactored `scanner.iter_py_files` into deterministic helpers without semantic changes, reducing method complexity and
122+
keeping metrics-gate parity with the baseline.
123+
125124
### Performance
126125

127126
- Added adaptive multiprocessing thresholds so small batches stay sequential instead of paying process-pool overhead.
@@ -136,23 +135,25 @@ final `2.0.0` release.
136135
- Improved warm-run responsiveness substantially while preserving deterministic behavior and output contracts.
137136
- Deferred HTML renderer import in CLI so non-HTML runs do not pay template/render startup cost.
138137
- Disabled transient status spinner contexts when `--no-progress` is active to reduce terminal I/O overhead.
139-
- Added canonical cache-entry fast-path for already validated runtime entries while preserving fallback validation for
140-
raw
141-
or externally mutated payloads.
138+
- Added a canonical cache-entry fast path for already validated runtime entries while preserving fallback validation for
139+
raw or externally mutated payloads.
142140
- Reused a shared parsed baseline payload when clone and metrics baselines point to the same file to avoid duplicate
143141
JSON reads/parses in one run.
144142

145-
### Detection Quality
143+
### Fixes
146144

147-
- Made the dead-code detector more conservative for non-actionable runtime patterns:
148-
- skips test paths and test entrypoint names
149-
- skips dunder methods
150-
- skips dynamic visitor methods (`visit_*`) and setup/teardown hooks
151-
- Reduced false positives without changing clone detection semantics.
152-
- Dead-code liveness now ignores references originating from test files, including cached test-file references, so
153-
production symbols used only in tests are still reported as dead-code candidates.
154-
- Refactored `scanner.iter_py_files` into deterministic helpers without semantic changes, reducing method complexity to
155-
keep metrics-gate parity with baseline.
145+
- Fixed scanner root-exclude short-circuit: only an explicitly excluded root directory is skipped; excluded segments in
146+
a parent path no longer suppress valid scans, preventing silent zero-file analysis for roots like `build/project`.
147+
- Optimized HTML snippet rendering path:
148+
- `_FileCache` now caches full file lines once per file and serves line-range slices without repeated full-file
149+
scans
150+
- Pygments imports are cached per importer identity to avoid repeated dynamic import overhead while preserving
151+
testability
152+
- Optimized block explainability AST stats:
153+
- added per-file statement index and range lookup via `bisect`, replacing repeated full `ast.walk()` scans per range
154+
- Added scanner regression coverage for roots under excluded parent directories.
155+
- No baseline/cache/report schema contract changes in this branch; detector identity semantics and golden compatibility
156+
are preserved.
156157

157158
### Tests and Tooling
158159

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ all with baseline-aware governance that separates **known** technical debt from
2525
## Features
2626

2727
- **Clone detection** — function (CFG fingerprint), block (statement windows), and segment (report-only) clones
28+
- **Structural findings** — duplicated branch families, clone guard/exit divergence and clone-cohort drift (report-only)
2829
- **Quality metrics** — cyclomatic complexity, coupling (CBO), cohesion (LCOM4), dependency cycles, dead code, health
2930
score
3031
- **Baseline governance** — known debt stays accepted; CI blocks only new clones and metric regressions
@@ -147,6 +148,11 @@ Contract errors (`2`) take precedence over gating failures (`3`).
147148
| Text | `--text` | `.cache/codeclone/report.txt` |
148149

149150
All report formats are rendered from one canonical JSON report document.
151+
Structural findings include:
152+
153+
- `duplicated_branches`
154+
- `clone_guard_exit_divergence`
155+
- `clone_cohort_drift`
150156

151157
<details>
152158
<summary>JSON report shape (v2.1)</summary>
@@ -259,7 +265,7 @@ Architecture: [`docs/architecture.md`](docs/architecture.md) · CFG semantics: [
259265
| Docker benchmark contract | [`docs/book/18-benchmarking.md`](docs/book/18-benchmarking.md) |
260266
| Determinism | [`docs/book/12-determinism.md`](docs/book/12-determinism.md) |
261267

262-
## * Benchmarking
268+
## * Benchmarking
263269

264270
<details>
265271
<summary>Reproducible Docker Benchmark</summary>

0 commit comments

Comments
 (0)