Skip to content

Proposal: Streaming pager mode for git log -p and other large patch streams #247

@bezhermoso

Description

@bezhermoso

How this was put together: The design and the prototype implementation were developed in collaboration with a coding agent (Claude Code). I drove the direction, made the judgment calls on scope and trade-offs, and reviewed the code; the agent did the codebase archaeology, drafted the implementation, and helped pressure-test the design. Disclosing because it shaped how this got from "the pager feels broken on git log -p" to a working stack of branches in roughly a sitting, and that's worth being transparent about.

Prototype: https://github.com/bezhermoso/hunk/tree/feat/pager-stream-prototype

Problem

When Hunk is configured as Git's pager (GIT_PAGER=hunk, core.pager=hunk, or invoked as hunk pager), it works well for the kind of input most pagers see day to day: a single git diff, a single git show, a git format-patch payload. As soon as the input gets large i.e. running git log -p against an old monorepo, the experience falls apart in three distinct ways:

  1. Time to first paint is bound to total stream size. Hunk reads stdin to completion before doing anything else. Nothing renders until the entire stream has been buffered, decoded, and parsed. On a long-lived repository this can mean a multi-second to multi-minute blank screen, even though the user only wants to scan the first few commits.

  2. Peak memory grows with the input, and the legacy path doubles it. The full pager input lives in memory as a single decoded string, and the parsed metadata derived from it lives alongside, so peak memory is roughly twice the input size. Multi-gigabyte streams thrash the runtime or fail outright. There is no upper bound and no graceful degradation.

  3. Commit metadata vanishes. Hunk's patch model is per-file. git log -p is a per-commit format: commit headers (sha, author, date, message body) sit between the file diffs. The current parser has no concept of these boundaries, so commit metadata gets dropped from stdout. Reading a commit log through Hunk shows a flat stream of file diffs with no indication of which commit produced which change.

The combined effect is that Hunk is currently not a viable pager replacement for the workflow most pagers exist to serve in a large repository: skimming git log -p for recent activity. The user has to fall back to less, which keeps Hunk siloed to the small/single-changeset case.

Proposal

Treat pager mode as a stream end-to-end and surface commit context inline with the diff the way git log -p does in less. I propose the following concepts, each with its own concerns, that together replace the single "buffer then parse" pipeline.

Streaming stdin I/O

The pager entrypoint should stop waiting for end-of-stream before doing any work. Stdin should be consumed as an async iterator of lines, decoded incrementally with UTF-8 stream-mode handling and partial-line carry across chunk boundaries. This is the foundation. This is done by auto-detecting stdin to be the output of git log -p.

Bounded patch sniffing

Deciding whether the input is a patch or generic plain text needs to happen on a prefix of the stream rather than the whole thing. A bounded sniffer (capped at a fixed byte and line budget) should consume the first portion, apply the same regex contract Hunk already uses (diff --git, --- /+++ pair, @@), and return its decision plus the lines it has already consumed. The downstream stage will re-prepend the prefix and continue from where the sniffer left off. With this in place, false positives and false negatives will have a defined budget instead of depending on the size of the full input.

Per-file chunker

A line-driven state machine should consume the post-sniff iterator and emit one event per file boundary. Each event must carry the verbatim chunk text that the existing diff parser can consume: same regex rules and same boundary heuristics as the legacy synchronous splitter, just driven incrementally. Per-chunk parsing must produce byte-identical results to the current "buffer everything then split" path, so inputs that already work continue to render the same way.

Commit-boundary awareness

When the input is a git log -p stream, the chunker should recognize commit <sha> lines and accumulate the verbatim metadata block (commit/author/date-time headers, blank lines, indented message body, etc.) until the next file boundary. The captured text should ride along on the next file event as a single string. Carrying this over would let the renderer display them verbatim, and would let future changes to Git's log output flow through without any code change.

Perhaps the experience should be that piping git log -p to hunk pager gives the user a diff viewer commit-by-commit. It would allow navigating backwards and forwards the ancestry.

Expected caveats

This proposal would not fully solve the unbounded-memory problem, but it should improve the constant factor and the time-to-paint, but the worst case would remain O(input).

  • What streaming would bound. Raw input would no longer be held as one giant string while parsing runs. The chunker should retain only the lines for the file currently being assembled, then drop them on emission to avoid doubling. Time-to-first-paint would no longer be tied to total stream size.
  • What streaming would not bound by a pure pager. Every parsed file (verbatim chunk text plus parsed metadata structures) would accumulate in the changeset for the session's lifetime so the user can scroll back through the history. This will continue to be the case unless we offload data somewhere we can retrieve again somehow, or make the design decision to drop scroll-back history to a certain point. Pager input is one-shot stdin & can't be replayed.
  • Back-pressure on a pure pager A reasonable next step would be to have a back-pressure on stdin e.g. stop reading from pipe when a high watermark is reached. The watermark could be number of files or number of commits, or a hybrid of both. However if this is implemented as a diff viewer/reviewer that does it commit-by-commit, then the commit is the natural boundary. There should be no risk for files to be dropped in the pager (which can lead to /hunk-review to fail when commenting on files that aren't loaded yet)

Would love y'all's thoughts on this. I wouldn't really be surprised if I missed a very obvious risk or consideration here, as I haven't really put as much thought on this in the context of the agentic code review workflows. 🙇‍♂️

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions