Skip to content

split-commits: Add skill for non-interactive commit splitting#5

Draft
cgwalters wants to merge 3 commits intobootc-dev:mainfrom
cgwalters:split-commits
Draft

split-commits: Add skill for non-interactive commit splitting#5
cgwalters wants to merge 3 commits intobootc-dev:mainfrom
cgwalters:split-commits

Conversation

@cgwalters
Copy link
Contributor

Add a new skill for splitting mixed working tree changes into clean,
logical commits. This is designed for AI agents that don't have TTY
access for interactive tools like git add -i.

The tool uses a session-based workflow: prepare topics, review hunks,
assign each hunk to a topic, then commit topics in logical order.
It validates staleness via blob hashes and uses git apply --cached
with 3-way merge fallback.

Assisted-by: OpenCode (Claude claude-opus-4-5@20251101)
Signed-off-by: Colin Walters walters@verbum.org

Add a simple Justfile with a 'check' target that runs Python syntax
validation on all scripts in the repository.

Signed-off-by: Colin Walters <walters@verbum.org>
Signed-off-by: Colin Walters <walters@verbum.org>
Add a new skill for splitting mixed working tree changes into clean,
logical commits. This is designed for AI agents that don't have TTY
access for interactive tools like `git add -i`.

The tool uses a session-based workflow: prepare topics, review hunks,
assign each hunk to a topic, then commit topics in logical order.
It validates staleness via blob hashes and uses `git apply --cached`
with 3-way merge fallback.

Ported from devaipod contrib/.

Assisted-by: OpenCode (Claude claude-opus-4-5@20251101)
Signed-off-by: Colin Walters <walters@verbum.org>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable new skill for non-interactive commit splitting, which is well-documented and thoughtfully designed for AI agents. The core logic is solid, particularly the use of blob hashes for staleness checks. I've identified a few areas for improvement, including a couple of high-severity bugs related to parsing diffs and handling command-line arguments that could lead to incorrect behavior or crashes. I've also noted some opportunities to enhance code style and maintainability. My detailed feedback and suggestions are in the comments below.

Comment on lines +114 to +121
changed_files = git("diff", "--name-only").stdout.strip().split("\n")
file_blobs = {}
for filepath in changed_files:
if filepath:
file_blobs[filepath] = {
"index": get_index_blob(filepath),
"worktree": hash_object(filepath),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script does not correctly handle files that have been deleted from the working tree. git diff --name-only will list deleted files, but the subsequent call to hash_object(filepath) on line 120 will fail because the file no longer exists. This prevents the session from being prepared and makes it impossible to include file deletions in commits using this tool.

You should consider using git diff --name-status to detect file statuses and handle deleted files appropriately (e.g., by skipping the hash_object call for the worktree version of deleted files).

Comment on lines +180 to +185
match = re.match(r"@@ -(\d+)", hunk_header)
if match:
old_start = int(match.group(1))
match = re.match(r"@@ -\d+(?:,\d+)? \+(\d+)", hunk_header)
if match:
new_start = int(match.group(1))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regular expression logic to parse the new_start line number from the hunk header is incorrect. The re.match on line 183 will always fail because it tries to match from the beginning of the string, while the + part of the hunk header is not at the start. This results in new_start always being its default value of 0, which is a bug that breaks hunk parsing, especially for new files.

You should use a single, more robust regular expression with re.search to capture both old_start and new_start line numbers.

Suggested change
match = re.match(r"@@ -(\d+)", hunk_header)
if match:
old_start = int(match.group(1))
match = re.match(r"@@ -\d+(?:,\d+)? \+(\d+)", hunk_header)
if match:
new_start = int(match.group(1))
match = re.search(r"@@ -(\d+)(?:,\d+)? \+(\d+)(?:,\d+)?", hunk_header)
if match:
old_start = int(match.group(1))
new_start = int(match.group(2))

Comment on lines +449 to +455
elif cmd == "next":
hunk_id = int(sys.argv[2]) if len(sys.argv) > 2 else None
session.next(hunk_id)
elif cmd == "assign":
if len(sys.argv) < 4:
die("usage: git-split-commits assign HUNK_ID TOPIC")
session.assign(int(sys.argv[2]), sys.argv[3])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script does not handle cases where a non-integer value is provided for HUNK_ID in the next and assign commands. This will cause an unhandled ValueError and crash the script with a traceback.

You should wrap the int() conversions in a try...except ValueError block and call die() with a user-friendly error message to make the command-line interface more robust.


At the current time, these skills are not intended for use outside
of the organization.
This is intentially split into its own git repository to make it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the word "intentially". It should be "intentionally".

Suggested change
This is intentially split into its own git repository to make it
This is intentionally split into its own git repository to make it

"""

import json
import os

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The os module is imported but never used. Unused imports should be removed to keep the code clean.

old_start = new_start = 0
if hunk_header.startswith("@@"):
# Format: @@ -old_start,old_count +new_start,new_count @@
import re

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Imports should be placed at the top of the file, as per PEP 8 style guidelines. This improves readability and avoids re-importing modules. Please move import re to the top-level imports section of the file.

if first_context and first_change:
break

hunk_id = len(list(self.hunks_dir.glob("*.patch")))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method for generating hunk_id by counting files on disk with self.hunks_dir.glob("*.patch") is inefficient and brittle. It performs a filesystem operation for every single hunk, and could be unreliable if the state directory is manipulated externally.

A more robust and efficient approach would be to use a simple integer counter that is initialized in the prepare method and incremented for each hunk.

Comment on lines +228 to +231
if topic in topic_counts:
topic_counts[topic] += 1
else:
topic_counts[topic] = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block here is unreachable. The assign command ensures that a hunk's topic is always one of the predefined topics, which are already in topic_counts. Therefore, topic in topic_counts will always be true. This block can be removed to simplify the code.

            topic_counts[topic] += 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant