Skip to content

Commit 7cd3e08

Browse files
committed
fix: suppress noisy logs for tag parsing, update internals docs
1 parent 9a4beae commit 7cd3e08

2 files changed

Lines changed: 301 additions & 2 deletions

File tree

jgit-proxy-core/GIT_INTERNALS.md

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,299 @@ This produces a full-snapshot diff of the tagged commit, which is harmless for t
122122

123123
**`SecretScanningFilter`** — passes `commitFrom`/`commitTo` to `gitleaks git`.
124124
Gitleaks calls native `git log`, which peels tags natively. No special handling needed.
125+
126+
---
127+
128+
## Branches and refs
129+
130+
### What the proxy sees on the wire
131+
132+
Every `git push` sends one or more **packet lines** before the pack data.
133+
Each line has the format:
134+
135+
```
136+
<oldOid> <newOid> <refName>\0<capabilities>
137+
```
138+
139+
| Field | Meaning |
140+
|-------|---------|
141+
| `oldOid` | The SHA the client believes the ref currently points to on the remote. All-zeros (`0000…`) for a new ref (branch or tag). |
142+
| `newOid` | The SHA the client wants the ref to point to after the push. All-zeros for a **ref deletion**. |
143+
| `refName` | Full ref path: `refs/heads/main`, `refs/tags/v1.0`, etc. |
144+
145+
The null byte `\0` separates the ref triple from the capability string (e.g. `report-status side-band-64k`).
146+
Only the **first** packet line carries capabilities; subsequent lines omit the `\0…` suffix.
147+
148+
`GitReceivePackParser.parsePush()` splits this line and populates `PushInfo` (proxy mode)
149+
or JGit's `ReceiveCommand` carries the same triple (S&F mode).
150+
151+
### Determining the push type from the packet line
152+
153+
The packet line SHAs encode what kind of ref update is happening:
154+
155+
| `oldOid` | `newOid` | `refName` | Meaning |
156+
|----------|----------|-----------|---------|
157+
| `000…0` | `abc123` | `refs/heads/feature` | **New branch** — create a branch pointing at `abc123` |
158+
| `abc123` | `def456` | `refs/heads/feature` | **Branch update** — fast-forward (or force push) from `abc123` to `def456` |
159+
| `abc123` | `000…0` | `refs/heads/feature` | **Branch deletion** — remove the ref entirely |
160+
| `000…0` | `abc123` | `refs/tags/v1.0` | **New tag** — see the "Tag objects" section above |
161+
162+
In S&F mode, JGit's `ReceiveCommand.Type` enum maps these directly: `CREATE`, `UPDATE`,
163+
`UPDATE_NONFASTFORWARD`, `DELETE`.
164+
165+
In proxy mode, `GitRequestDetails` exposes helper methods:
166+
- `isRefDeletion()``commitTo` is all-zeros
167+
- `isTagPush()``branch` starts with `refs/tags/`
168+
169+
There is no explicit `isNewBranch()` helper; filters check `commitFrom.matches("^0+$")` directly.
170+
171+
### New branches — what makes them tricky
172+
173+
A new-branch push (`oldOid` = zeros) doesn't tell you which commits are "new".
174+
The pack may contain many commits, but some of them may already exist on the remote
175+
under a different branch. Only the commits **not reachable from any existing ref** are
176+
genuinely new in this push.
177+
178+
Both modes solve this the same way — via `CommitInspectionService.getCommitRange()`:
179+
180+
```java
181+
// New branch path (fromId is null or zero):
182+
var logCmd = git.log().add(toId);
183+
for (Ref ref : repository.getRefDatabase().getRefsByPrefix("refs/heads/")) {
184+
if (ref.getObjectId() != null) logCmd.not(ref.getObjectId());
185+
}
186+
```
187+
188+
This walks backward from the pushed tip, excluding anything reachable from existing
189+
branch heads. The result is only the commits that are genuinely new.
190+
191+
**S&F mode**: JGit's `ReceivePack` has already unpacked the objects into its own
192+
repository, so `getCommitRange()` works against that repo directly.
193+
194+
**Proxy mode**: `EnrichPushCommitsFilter` must first clone/fetch the upstream and
195+
unpack the push's pack data into the local clone (see "How proxy mode gets a repository"
196+
below), then `getCommitRange()` can walk the combined object store.
197+
198+
### Branch updates — the commit range
199+
200+
For an existing branch update (`oldOid` is a real SHA), the commit range is
201+
straightforward:
202+
203+
```
204+
git log oldOid..newOid
205+
```
206+
207+
`CommitInspectionService.getCommitRange()` uses `git.log().addRange(fromId, toId)`,
208+
which is JGit's equivalent. This returns exactly the commits introduced by this push.
209+
210+
### Force pushes (non-fast-forward)
211+
212+
A force push rewrites history. `oldOid` is no longer an ancestor of `newOid`.
213+
214+
In S&F mode, JGit classifies this as `ReceiveCommand.Type.UPDATE_NONFASTFORWARD`.
215+
`ForwardingPostReceiveHook.buildRefUpdates()` sets `force=true` for these so the
216+
upstream accepts the rewrite.
217+
218+
In proxy mode, the request is forwarded as-is — the upstream git server decides
219+
whether to accept the force push based on its own configuration. The proxy's filter
220+
chain still runs validation on the new commits, but `getCommitRange()` may behave
221+
unexpectedly: `addRange(oldId, newOid)` only returns commits reachable from `newOid`
222+
but not `oldOid`. If the branches diverged, commits on the old branch that were
223+
dropped are **not** included — the range shows only what was added, not what was removed.
224+
225+
### Ref deletions
226+
227+
When `newOid` is all-zeros, the client is deleting a ref. There are no objects in the
228+
pack and no commits to validate.
229+
230+
**S&F mode**: `ReceiveCommand.Type.DELETE`. Hooks that iterate commands skip `DELETE`
231+
types explicitly (e.g. `CheckEmptyBranchHook`, `CheckHiddenCommitsHook`, `DiffGenerationHook`).
232+
`ForwardingPostReceiveHook` handles deletion by creating a `RemoteRefUpdate` with
233+
a null source ref — JGit translates this to a delete on the upstream.
234+
235+
**Proxy mode**: `GitReceivePackParser.parsePush()` checks `newCommit.equals(ZERO_OID)`
236+
and skips pack parsing entirely (there's nothing to parse). `GitRequestDetails` will
237+
have `commitTo` = zeros, `commit` = null, `pushedCommits` = empty.
238+
`isRefDeletion()` returns true, and filters should check this early and skip.
239+
240+
---
241+
242+
## How the proxy gets commit data
243+
244+
The two proxy modes obtain commit metadata very differently.
245+
246+
### S&F mode: JGit ReceivePack
247+
248+
JGit's `ReceivePack` handles the entire git protocol server-side. When the client
249+
pushes, JGit:
250+
251+
1. Receives the pack data and unpacks objects into the local repository
252+
2. Creates `ReceiveCommand` entries for each ref update
253+
3. Calls the pre-receive hook chain with access to the full `Repository`
254+
255+
Hooks can call any JGit API — `RevWalk`, `DiffFormatter`, `git.log()` — because
256+
the objects are already in the local object store. No special setup required.
257+
258+
The repository is a bare repo managed by the S&F servlet, one per provider+repo
259+
combination.
260+
261+
### Proxy mode: clone + unpack
262+
263+
Proxy-mode filters run as servlet filters on an HTTP request. They don't have a
264+
local repository by default — the request is just bytes on the wire being forwarded
265+
to the upstream.
266+
267+
`EnrichPushCommitsFilter` bridges this gap:
268+
269+
1. **Clone/fetch**: `LocalRepositoryCache.getOrClone(remoteUrl)` maintains a bare
270+
clone of each upstream repository. First push triggers a `git clone --bare --depth 100`;
271+
subsequent pushes do `git fetch --depth 100`. The cache is keyed by
272+
`owner_reponame` (derived from the URL).
273+
274+
2. **Unpack push data**: The push's pack data (from the HTTP request body) is fed
275+
into JGit's `PackParser`, which inserts the objects into the local clone's object
276+
store. This is the equivalent of what `ReceivePack` does internally in S&F mode.
277+
278+
3. **Walk commits**: With objects now in the local clone, `CommitInspectionService`
279+
can walk the commit range, generate diffs, etc.
280+
281+
The local clone is published on `GitRequestDetails.localRepository` so all downstream
282+
filters can use it.
283+
284+
#### Shallow clone implications
285+
286+
The default clone depth is 100 commits. This means:
287+
288+
- `getCommitRange()` for a new branch will only walk back 100 commits. Commits beyond
289+
that depth are not in the local clone and won't appear in the range.
290+
- `getDiff()` for a new branch uses `findNewBranchBase()` to diff against the parent
291+
of the oldest new commit. If the oldest new commit's parent is beyond the shallow
292+
boundary, `resolve(parentSha + "^{tree}")` returns null and the diff falls back to
293+
the empty tree (full-snapshot diff).
294+
- Secret scanning via gitleaks is passed `commitFrom..commitTo` and runs `git log`
295+
natively — it respects the shallow boundary silently.
296+
297+
For most pushes this is fine. A push with more than 100 new commits on a new branch
298+
is unusual, and the shallow clone can be deepened via configuration (`cloneDepth`).
299+
300+
---
301+
302+
## Diff generation
303+
304+
### Where diffs are generated
305+
306+
Diffs are generated in both modes but through different code paths:
307+
308+
| Mode | Component | When | What |
309+
|------|-----------|------|------|
310+
| S&F | `DiffGenerationHook` (order 280) | Pre-receive, after validation hooks pass | Push diff + optional default-branch diff |
311+
| Proxy | `ScanDiffFilter` (order 300) | In the filter chain, after `EnrichPushCommitsFilter` | Push diff only |
312+
313+
Both ultimately call `CommitInspectionService.getFormattedDiff(repo, fromCommit, toCommit)`.
314+
315+
### How diffs are computed
316+
317+
`CommitInspectionService.getDiff()` resolves both sides to tree objects, then runs
318+
JGit's `DiffFormatter`:
319+
320+
```java
321+
ObjectId oldId = isNullCommit(fromCommit)
322+
? findNewBranchBase(repository, toCommit) // new branch: diff against merge base
323+
: repository.resolve(fromCommit + "^{tree}"); // existing branch: diff against old tip
324+
ObjectId newId = repository.resolve(toCommit + "^{tree}");
325+
```
326+
327+
The `^{tree}` peel works for both commits and annotated tags — it follows the chain
328+
down to the commit, then to its tree.
329+
330+
### New branch diff base (`findNewBranchBase`)
331+
332+
For a new-branch push, diffing against the empty tree would show the entire repo
333+
snapshot — useless for review and would trigger false-positive secret scan findings
334+
on existing files.
335+
336+
Instead, `findNewBranchBase()` finds the oldest new commit (same "exclude existing refs"
337+
walk as `getCommitRange()`), then returns the **tree of that commit's first parent**.
338+
This means the diff shows only the changes introduced by the new commits, not the
339+
entire history they're built on.
340+
341+
If the oldest new commit is a root commit (no parent), the base is null, and the
342+
diff does fall back to the empty tree — but this only happens for genuinely new
343+
repositories.
344+
345+
### Default-branch diff (S&F only)
346+
347+
`DiffGenerationHook` generates a second diff when pushing to a non-default branch:
348+
the total diff of `defaultBranch..commitTo`. This helps reviewers see the full scope
349+
of a feature branch without having to check it out.
350+
351+
The default branch is resolved from `HEAD` (which in a bare clone is a symbolic ref
352+
to the remote's default branch), falling back to `refs/heads/main` or `refs/heads/master`.
353+
354+
This diff is stored as a separate `PushStep` with step name `diff:default-branch` and
355+
tagged as `type: auto:default-branch` so the dashboard UI can label it appropriately.
356+
357+
### Hidden commits detection
358+
359+
The "hidden commits" check exists in both modes (`CheckHiddenCommitsHook` / `CheckHiddenCommitsFilter`)
360+
and catches a subtle attack vector: a developer could create a branch from unapproved
361+
commits that haven't been pushed yet. Git's pack protocol bundles all objects needed
362+
by the receiving side, including ancestor commits that the remote doesn't have.
363+
364+
The algorithm is:
365+
366+
1. **introduced** = commits from `getCommitRange(oldId, newId)` — the explicit push range
367+
2. **allNew** = `RevWalk` from `newId`, marking all existing refs as uninteresting
368+
3. **hidden** = `allNew` minus `introduced`
369+
370+
If hidden is non-empty, the push is rejected. The developer needs to get the hidden
371+
commits approved and pushed first, then retry.
372+
373+
---
374+
375+
## Pack data parsing
376+
377+
### What `GitReceivePackParser` does (proxy mode only)
378+
379+
In proxy mode, `ParseGitRequestFilter` needs to extract commit metadata from the raw
380+
HTTP request body before JGit ever touches it. The request body contains:
381+
382+
1. Packet lines (ref updates + capabilities)
383+
2. A flush packet (`0000`)
384+
3. Pack data (the `PACK` signature followed by pack objects)
385+
386+
`GitReceivePackParser.parsePush()` reads the packet line via JGit's `PacketLineIn`,
387+
then parses the first object from the pack data manually:
388+
389+
- Scans for the `PACK` signature (4 bytes: `P`, `A`, `C`, `K`)
390+
- Skips the 12-byte pack header (signature + version + object count)
391+
- Reads the first pack entry's type+size header (variable-length encoding)
392+
- Inflates the zlib-compressed object data
393+
- If the type is `OBJ_COMMIT` (1), parses the raw commit content for author, committer,
394+
parent, message, and GPG signature
395+
396+
This is a **best-effort parse of the first object only**. It handles the common case
397+
(a commit push where the tip commit is the first pack entry) but intentionally does
398+
not handle:
399+
- Delta objects (`OBJ_OFS_DELTA`, `OBJ_REF_DELTA`) — logged as a warning
400+
- Tag objects (`OBJ_TAG`, type 4) — throws "No commit object found"
401+
- Packs where the commit is not the first entry
402+
- Empty packs (lightweight tag pointing to an existing commit)
403+
404+
These failures are caught by the `try/catch` in `parsePush()`, and
405+
`PushInfo.commit` is left null. `EnrichPushCommitsFilter` downstream recovers
406+
full commit data from the local clone anyway — the pack-parsed commit is just an
407+
early-availability optimization for `ParseGitRequestFilter`.
408+
409+
### Why the pack parser exists alongside `EnrichPushCommitsFilter`
410+
411+
`ParseGitRequestFilter` runs at order `MIN_VALUE + 1` — it's the first filter.
412+
It needs to populate `GitRequestDetails` before any other filter runs.
413+
`EnrichPushCommitsFilter` runs at `MIN_VALUE + 2` — immediately after — but requires
414+
a network clone/fetch which may fail.
415+
416+
The pack parser gives `ParseGitRequestFilter` a synchronous, no-network way to
417+
extract the head commit's metadata. If it succeeds, `requestDetails.commit` is
418+
available immediately. If it fails (tag push, delta-only pack, etc.), the commit
419+
is null and filters that need it wait for `EnrichPushCommitsFilter` to populate
420+
`pushedCommits` from the local clone.

jgit-proxy-core/src/main/java/org/finos/gitproxy/git/GitReceivePackParser.java

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,12 @@ public static PushInfo parsePush(String packetLine, byte[] packData) throws IOEx
3838
String newCommit = parts[1];
3939
String reference = parts[2].replace("\u0000", "").trim();
4040

41-
// For branch deletion (newCommit is all zeros), there's no pack data
41+
// Skip pack parsing for deletions (no objects) and tag pushes (first object is a
42+
// tag object or the pack is empty — neither produces a usable Commit; EnrichPushCommitsFilter
43+
// recovers full commit data from the local clone).
4244
Commit commit = null;
43-
if (!newCommit.equals(ZERO_OID) && packData != null && packData.length > 0) {
45+
boolean isTag = reference.startsWith("refs/tags/");
46+
if (!isTag && !newCommit.equals(ZERO_OID) && packData != null && packData.length > 0) {
4447
// Parse the commit content from pack data
4548
try {
4649
commit = parsePackData(packData);

0 commit comments

Comments
 (0)