Skip to content

mpritchard2/trusty-git-analytics

 
 

Repository files navigation

trusty-git-analytics

Analyze git repositories to measure developer productivity — classify commit work types, track weekly velocity, and export CSV/JSON/Markdown reports.

What It Does

tga walks one or more local git repositories, collects every commit into a SQLite database, classifies each commit into a work category (feature, bugfix, refactor, etc.) using a seven-tier classification cascade, then aggregates the results into per-author, per-week, DORA, velocity, and quality reports. It is a feature-complete Rust port of gitflow-analytics with the same YAML config schema and the same SQLite schema — existing config files work without modification.

Installation

From crates.io (recommended)

cargo install trusty-git-analytics

This installs the tga binary to ~/.cargo/bin/. Ensure ~/.cargo/bin is in your PATH.

From source

git clone https://github.com/bobmatnyc/trusty-git-analytics
cd trusty-git-analytics
cargo install --path .

Verify installation

tga --version
tga --help

Quick Start

Run Your First Analysis

Step 1 — Create a config.yaml:

repositories:
  - path: ~/code/my-project
    name: my-project

output:
  directory: ./reports
  formats: [csv, json, markdown]

Step 2 — Run the full pipeline:

tga analyze --config config.yaml

Step 3 — Find reports in ./reports/:

reports/
├── authors.csv         # Per-author commit summary
├── weekly_activity.csv # Week-by-week breakdown
├── report.json         # Full structured payload
└── report.md           # Narrative Markdown report

Configuration

Minimal config.yaml

repositories:
  - path: ~/code/my-repo
    name: my-repo

All other sections are optional. When output.formats is omitted, all three formats (CSV, JSON, Markdown) are written.

Full config reference

Key Type Default Description
repositories list required Repos to analyze
developer_aliases map {} Canonical name → list of emails/aliases
team object Alternative to developer_aliases; roster with email
output.directory path ./reports Where reports are written
output.formats list [csv, json, markdown] csv, json, and/or markdown
output.include_unclassified bool false Include commits with no category
output.include_merges bool false Include merge commits
output.include_files bool false Include file-level change detail
classification.rules_file path Path to custom rules YAML/JSON
classification.use_llm bool false Enable LLM fallback tier
classification.llm_model string gpt-4o-mini LLM model identifier
classification.confidence_threshold float 0.7 Minimum acceptance confidence
classification.llm_fallback_threshold float 0.0 Commits with confidence above this value skip the LLM tier
classification.llm_fallback_concurrency uint 4 Max concurrent LLM requests during fallback
github.token string $GITHUB_TOKEN GitHub PAT for PR fetch
github.org string Org slug for org-wide PR queries
github.repo string Single repo slug (owner/name)
github.fetch_prs bool false Fetch pull request metadata
github.ticket_regex string Override regex for detecting GitHub ticket refs in commit messages
jira.url string JIRA base URL
jira.username string JIRA API username (email for Cloud)
jira.token string JIRA API token
jira.project_key string Project key filter (e.g. API)
jira.ticket_regex string Override regex for detecting JIRA ticket refs in commit messages
linear.ticket_regex string Override regex for detecting Linear ticket refs in commit messages
pm.azure_devops.organization_url string ADO org URL (e.g. https://dev.azure.com/myorg)
pm.azure_devops.pat string Azure DevOps Personal Access Token
pm.azure_devops.project string Default ADO project name
pm.azure_devops.fetch_on_reference bool false Fetch work items when AB#N refs appear in commits
pm.azure_devops.fetch_prs bool false Fetch ADO pull requests and reviewer data
pm.azure_devops.ticket_regex string AB#(\d+) Override regex for detecting ADO work item refs in commit messages
pm.bitbucket.workspace string Bitbucket Cloud workspace slug
pm.bitbucket.repo_slug string Repository slug within the workspace
pm.bitbucket.fetch_prs bool false Fetch Bitbucket Cloud pull request metadata
pm.bitbucket.token string $BITBUCKET_TOKEN Bearer token (App password or OAuth)
pm.bitbucket.username string Atlassian account username for Basic auth
pm.bitbucket.app_password string Atlassian App password for Basic auth (alternative to token)
cache.directory path Cache directory (supports ~)
version string Schema version; stored for compatibility
profile string Named profile; stored for compatibility

Paths support ~ expansion. Config files from the Python gitflow-analytics tool load without changes — unknown keys are silently ignored.

developer_aliases vs team.members

developer_aliases (Python-compatible flat map):

developer_aliases:
  "Alice Smith":
    - "alice@company.com"
    - "asmith@company.com"
    - "alice@personal.dev"
  "Bob Jones":
    - "bob@company.com"
    - "129991831+bobgithub@users.noreply.github.com"

team.members (structured roster with canonical email):

team:
  members:
    - name: Alice Smith
      email: alice@company.com
      aliases:
        - asmith@company.com
        - alice@personal.dev

When developer_aliases is non-empty it takes precedence over team.members. Use developer_aliases when migrating an existing Python config file; use team.members for new setups where canonical email matters for downstream tooling.

Example: multi-repo config with GitHub

See configs/example-config.yaml for a working example that covers multiple repositories, developer aliases, and CSV+Markdown output.

CLI Reference

All subcommands accept these global flags:

Flag Default Description
--config <PATH> config.yaml Path to config YAML
--database <PATH> tga.db Path to SQLite database
-v / -vv / -vvv warnings only Increase log verbosity

tga analyze

Run the full pipeline: collect → classify → report.

tga analyze [--config <PATH>] [--database <PATH>] [--output <DIR>]
            [--skip-collect] [--skip-classify] [--weeks <N>]
Flag Description
--skip-collect Skip Stage 1; use commits already in the database
--skip-classify Skip Stage 2; use existing classifications
--output <DIR> Override output.directory from config
--weeks <N> Limit collection to the last N weeks (overrides config start_date)
# Full pipeline
tga analyze --config config.yaml

# Re-run reports only (commits already collected and classified)
tga analyze --skip-collect --skip-classify --output ./reports-v2

tga collect

Stage 1: extract commits from git repositories into the database.

tga collect [--config <PATH>] [--database <PATH>]
            [--repos <NAME,...>] [--since <DATE>] [--until <DATE>] [--weeks <N>]
Flag Description
--repos <NAME,...> Comma-separated list of repository names to collect; others are skipped
--since <DATE> Collect commits on or after this ISO 8601 date (overrides config and --weeks)
--until <DATE> Collect commits on or before this ISO 8601 date (overrides config)
--weeks <N> Limit collection to the last N weeks; --since takes precedence if both supplied
tga collect --repos my-project --since 2024-01-01 --until 2024-03-31
tga collect --weeks 4   # collect last 4 weeks across all repos

tga classify

Stage 2: run the classification cascade over collected commits.

tga classify [--config <PATH>] [--database <PATH>]
             [--rules <PATH>] [--use-llm]
Flag Description
--rules <PATH> Override classification.rules_file from config
--use-llm Enable LLM fallback regardless of config setting
tga classify --rules ./custom-rules.yaml --use-llm

tga report

Stage 3: generate reports from classified commits.

tga report [--config <PATH>] [--database <PATH>]
           [--output <DIR>] [--formats <FMT,...>]
Flag Description
--output <DIR> Override output.directory from config
--formats <FMT,...> Comma-separated: csv, json, markdown
tga report --output ./q1-reports --formats csv,json

Pipeline Architecture

git repos ──┐
             │   collect      SQLite (tga.db)   classify       SQLite    report
GitHub API ──┼──────────────► [commits]        ──────────────► [classif]─────────► CSV (×9)
JIRA API ────┤   (libgit2,    [authors]         (7-tier                 ► JSON
Linear API ──┤   reqwest)     [pull_requests]   cascade,                ► Markdown
ADO API  ────┘                [work_items]      Rayon-parallel)

Stage 1 — collect (tga::collect): opens each repository with libgit2, walks the configured branch, extracts commit metadata and diff stats, resolves author identities, fetches GitHub PR / JIRA issue / Linear / Azure DevOps work item metadata via REST/GraphQL, and writes everything to SQLite.

Stage 2 — classify (tga::classify): reads unclassified commits from the database, runs each message through the seven-tier cascade (see below), and writes a classification verdict back. Rule-based tiers execute in parallel via Rayon.

Stage 3 — report (tga::report): reads the classified database, aggregates per-author, per-week, DORA, velocity, and quality statistics, and writes the configured output formats to the output directory.

Classification

Seven-Tier Cascade

Each commit message is tested against tiers in order. The first tier to produce a confident result wins.

Tier 0 — Manual Override (confidence 1.0): looks up the (commit_hash, repo_path) pair in the classification_overrides table. Managed via tga override add|list|remove.

Tier 1.5 — Issue Type (confidence 0.90): when the commit has ticket references resolving to rows in issue_cache, maps the upstream issue type (bug, story, task, spike, etc.) directly to a change_type.

Tier 3 — JIRA Project Mapping (confidence 0.95): when jira_project_mappings is configured, maps the JIRA project key prefix of any [A-Z]+-\d+ reference to a change_type.

Tier 4 — Exact (Aho-Corasick): builds a single finite-state machine from every keyword list across every rule and scans the message in O(n) time. Matches feat:, fix:, chore:, etc. Confidence 0.85–0.95.

Tier 5 — Regex: applies pre-compiled regex patterns from the rule set. Handles anchored conventional-commit patterns (^feat(\([^)]*\))?!?:) and JIRA ticket IDs (\b[A-Z][A-Z0-9]+-\d+\b).

Tier 6 — Fuzzy heuristics: detects merge commits (via is_merge flag or Merge pull request prefix) and reverts (via Revert prefix). No external dependencies.

Tier 7 — LLM fallback (optional, async): calls an OpenAI-compatible API (OpenRouter by default, AWS Bedrock behind the bedrock cargo feature) when tiers 0–6 leave a commit in a fallthrough category. Disabled by default; enable with analysis.llm_classification.enabled: true or --use-llm. Results are only accepted when confidence >= confidence_threshold (default 0.7).

Default Rules

ID Category Keywords / Patterns
cc-feat feature feat:, feature:, ^feat(...)?!?:
cc-fix bugfix fix:, bugfix:, hotfix, ^fix(...)?!?:
cc-chore chore chore:, ^chore(...)?!?:
cc-docs documentation docs:, doc:, ^docs?(...)?!?:
cc-refactor refactor refactor:, ^refactor(...)?!?:
cc-test test test:, tests:, ^tests?(...)?!?:
cc-ci ci ci:, ^ci(...)?!?:
cc-perf performance perf:, ^perf(...)?!?:
cc-style style style:, ^style(...)?!?:
cc-build build build:, ^build(...)?!?:
cc-revert revert revert:, ^revert(...)?!?:
breaking-change breaking breaking change, breaking-change
jira-ticket feature (ticketed) \b[A-Z][A-Z0-9]+-\d+\b
kw-bug bugfix bug, defect
kw-security bugfix (security) security, cve-, vulnerability

Commits that match no rule are assigned category uncategorized with confidence 0.0.

Custom Rules File

Supply your own rules alongside the defaults:

# my-rules.yaml
version: "1.0"
rules:
  - id: my-deploy
    category: deployment
    keywords:
      - "deploy:"
      - "release:"
    patterns:
      - "(?i)^deploy(ment)?:"
    priority: 80
    confidence: 0.9
tga classify --rules ./my-rules.yaml
# or in config.yaml:
# classification:
#   rules_file: ./my-rules.yaml

Output Formats

CSV

Two files are written when csv is in the format list:

authors.csv — one row per author:

Column Description
name Canonical author name
email Canonical author email
commit_count Total commits
insertions Total lines added
deletions Total lines deleted
files_changed Total files changed
first_commit ISO 8601 timestamp of earliest commit
last_commit ISO 8601 timestamp of most recent commit

weekly_activity.csv — one row per week/author/repository bucket:

Column Description
week ISO week label, e.g. 2024-W03
author Author name
repository Repository name
commit_count Commits in this bucket
insertions Lines added in this bucket
deletions Lines deleted in this bucket

JSON

report.json — full structured payload:

{
  "generated_at": "2024-03-15T10:00:00Z",
  "period_start": "2024-01-01T00:00:00Z",
  "period_end":   "2024-03-14T23:59:59Z",
  "total_commits": 347,
  "total_authors": 8,
  "category_breakdown": { "feature": 120, "bugfix": 45, ... },
  "authors": [
    {
      "name": "Alice Smith",
      "email": "alice@company.com",
      "commit_count": 87,
      "insertions": 4200,
      "deletions": 1100,
      "files_changed": 310,
      "categories": { "feature": 50, "bugfix": 20, ... },
      "first_commit": "...",
      "last_commit": "..."
    }
  ],
  "repositories": [
    {
      "name": "my-project",
      "commit_count": 347,
      "author_count": 8,
      "insertions": 18000,
      "deletions": 6000,
      "top_categories": [["feature", 120], ["bugfix", 45]]
    }
  ],
  "weekly_activity": [
    {
      "week": "2024-W03",
      "author": "Alice Smith",
      "repository": "my-project",
      "commit_count": 12,
      "insertions": 500,
      "deletions": 120,
      "categories": { "feature": 8, "bugfix": 4 }
    }
  ]
}

Markdown

report.md — a narrative report containing a summary header, per-author commit table, category breakdown, and weekly activity section. Suitable for pasting into Confluence or a PR description.

Development

Build and Test

# Build everything
cargo build

# Build release binary
cargo build --release

# Run all tests
cargo test

# Lint (zero warnings required)
cargo clippy -- -D warnings

# Format check (CI gate)
cargo fmt --check

# Auto-format
cargo fmt

# Generate rustdoc
cargo doc --open

Running Against Real Repos

configs/example-config.yaml is a working example that analyzes repositories using developer_aliases. Copy it, adjust paths and names to match your setup, then run:

tga analyze --config configs/example-config.yaml --database tga.db

CI Gates

The GitHub Actions workflow (ci.yml) requires:

  • cargo fmt -- --check
  • cargo clippy --all-targets -- -D warnings
  • cargo test
  • cargo doc --no-deps with RUSTDOCFLAGS="-D warnings"

Crate Structure

Single tga crate (consolidated from the original 5-crate workspace):

Module Path Purpose
tga::core src/core/ Shared types, config, DB schema, migrations, error types
tga::collect src/collect/ Stage 1: git extraction (libgit2), GitHub/JIRA/Linear/ADO clients, PmAdapter trait
tga::classify src/classify/ Stage 2: seven-tier classification cascade
tga::report src/report/ Stage 3: CSV/JSON/Markdown output
commands (binary-private) src/commands/ Subcommand handlers wired into src/main.rs

License

Non-commercial use only. See LICENSE for terms.

Elastic License 2.0

About

High-performance Rust port of gitflow-analytics — developer productivity analytics via git repository analysis

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Rust 100.0%