Skip to content

backfill: accept revision arguments#2070

Open
derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
derrickstolee:backfill-revs
Open

backfill: accept revision arguments#2070
derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
derrickstolee:backfill-revs

Conversation

@derrickstolee
Copy link

@derrickstolee derrickstolee commented Mar 16, 2026

The git backfill command assists in downloading missing blobs for blobless partial clones. However, its current version lacks some valuable functionality. It currently:

  1. Only walks commits reachable from HEAD.
  2. It walks all reachable commits to the full history.
  3. It can focus on the current sparse-checkout definition, but otherwise it doesn't focus on a given pathspec.

All of these are being updated by this patch series, which allows rev-list options to impact the path-walk. These include:

  1. Specifying a given refspec, including --all.
  2. Modifying the commit walk, including --first-parent, commit ranges, or recency using --since.
  3. Modifying the set of paths to download using pathspecs.

One particularly valuable situation here is that now a user can run git backfill -- <path> to download all versions of a specific file or a specific directory, accelerating history queries within that path without downloading more than necessary. This can accelerate git blame or git log -L for these paths, where normally those commands download missing blobs one-by-one during its diff algorithms.

This patch series is organized in the following way:

  1. A missing #include is added to prevent future compilation issues.
  2. The test repo in t5620 is expanded to make later tests more interesting.
  3. The backfill builtin parses the rev-list arguments. We test the top arguments that work as expected, though the pathspec arguments need extra work.
  4. Update the path-walk logic to work efficiently with some pathspecs, such as fixed prefix pathspecs, accelerating the computation.
  5. For more complicated pathspecs, do a post-filter in builtin/backfill.c instead of restricting the walk in the path-walk API.

The main goal of this series is to make such customizations possible, and to improve performance where common use cases are expected. I'm open to feedback as to whether we should consider more detailed performance analysis or whether we should wait for how users interact with these new options before overoptimizing unlikely use cases.

Updates in v2

  • Hard stops are replaced with a comma (and no punctuation) in the docs.
  • add_head_to_pending() simplifies some code.
  • My poor explanation of "starting commits" is updated.
  • Language around temporary prefix restriction is clarified.
  • Prefix match logic is simplified with dir_prefix().
  • Temporary memory leak (introduced in v1's patch 4 and removed in v1's patch 5) is removed in v2's patch 4.
  • Commit pruning is reenabled in v2's patch 5. There was no need for that with the way the logic works in the patch.
  • Add a new patch with a test demonstrating the new behavior that was being discussed in [1].

[1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/

Thanks,
-Stolee

cc: gitster@pobox.com
cc: "Kristoffer Haugsbakk" kristofferhaugsbakk@fastmail.com
cc: r.siddharth.shrimali@gmail.com
cc: ps@pks.im

The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is
defined in object-name.h. Include it in revision.h so consumers of
REV_INFO_INIT do not need to include this hidden dependency.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Prepare the test infrastructure for upcoming changes that teach 'git
backfill' to accept revision arguments and pathspecs.

Add test_tick before each commit in the setup loop so that commit dates
are deterministic. This enables reliable testing with '--since'.

Rename the 'd/e/' directory to 'd/f/' so that the prefix 'd/f' is
ambiguous with the files 'd/file.*.txt'. This exercises the subtlety
in prefix pathspec matching that will be added in a later commit.

Create a branched version of the test repository (src-revs) with:
 - A 'side' branch merged into main, adding s/file.{1,2}.txt with
   two versions (4 new blobs, 52 total from main HEAD).
 - An unmerged 'other' branch adding o/file.{1,2}.txt (2 more blobs,
   54 total reachable from --all).

This structure makes --all, --first-parent, and --since produce
meaningfully different results when used with 'git backfill'.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2026

Submitted as pull.2070.git.1773707361.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2070/derrickstolee/backfill-revs-v1

To fetch this version to local tag pr-2070/derrickstolee/backfill-revs-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2070/derrickstolee/backfill-revs-v1

@jayesh0104
Copy link

/allow

@jayesh0104
Copy link

I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2026

Error: User jayesh0104 is not yet permitted to use GitGitGadget

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2026

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
>
>  1. Only walks commits reachable from HEAD.
>  2. It walks all reachable commits to the full history.
>  3. It can focus on the current sparse-checkout definition, but otherwise it
>     doesn't focus on a given pathspec.
>
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
>
>  1. Specifying a given refspec, including --all.

Makes sense.  You can only be on a single branch at a time, but may
want to work on multiple topics in reasonably quick succession in a
single repository.  Being able to prepare enough material to go back
to when working on whichever topic in a single backfill invocation
would be a welcome addition.

>  2. Modifying the commit walk, including --first-parent, commit ranges, or
>     recency using --since.
>  3. Modifying the set of paths to download using pathspecs.

Both are good mechanisms to express which subset of history you will
be working on.

> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.

Yup.  Even if your project is a huge monorepo that contains all, you
do not necessarily have to look at everything the organization has
all the time.  "git blame -C -C -C" would of course not work in such
an environment (would it end up on-demand lazy fetch these blobs, or
are there ways to say "I know the object store of my repository is
only sparsely populated, and I do not want you to on-demand download
the missing blobs---do your best to work with only what is already
available?), but that's a tradeoff a monorepo makes.

@@ -4,6 +4,7 @@
#include "commit.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is
> defined in object-name.h. Include it in revision.h so consumers of
> REV_INFO_INIT do not need to include this hidden dependency.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  revision.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/revision.h b/revision.h
> index b36acfc2d9..18c9bbd822 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -4,6 +4,7 @@
>  #include "commit.h"
>  #include "grep.h"
>  #include "notes.h"
> +#include "object-name.h"
>  #include "oidset.h"
>  #include "pretty.h"
>  #include "diff.h"

OK.  Other symbols REV_INFO_INIT needs are REV_SORT_IN_GRAPH_ORDER
(in <commit.h>), CMIT_FMT_DEFAULT (in <pretty.h>), and STRVEC_INIT
(in <strvec.h>), and all three are already included there.

Makes sense.

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -	repo_init_revisions(ctx->repo, &revs, "");
> -	handle_revision_arg("HEAD", &revs, 0, 0);

So we used to "cheat" and did an initialization without even knowing
in which directory we were started ...

> +	/* Walk from HEAD if otherwise unspecified. */
> +	if (!ctx->revs.pending.nr)
> +		handle_revision_arg("HEAD", &ctx->revs, 0, 0);

... but by initializing the revs correctly in the caller, we would
be correcting it.  Looking good.

> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  					 builtin_backfill_usage, options);
>  
>  	argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> -			     0);
> +			     PARSE_OPT_KEEP_UNKNOWN_OPT |
> +			     PARSE_OPT_KEEP_ARGV0 |
> +			     PARSE_OPT_KEEP_DASHDASH);
> +
> +	repo_init_revisions(repo, &ctx.revs, prefix);
> +	argc = setup_revisions(argc, argv, &ctx.revs, NULL);

OK.

@@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx,
match != MATCHED)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> The previous change allowed specifying revision arguments over the 'git
> backfill' command-line. This created the opportunity for pathspecs that
> specify a smaller set of starting commits, but otherwise did not restrict
> the blob paths that were downloaded.

"pathspecs that specify a smaller set of starting commits" is
puzzling, as starting commits would be coming from the revision
arguments.  "opportunity for pathspec to further filter commits
to those that touch only the matching paths...", or something?

> Update the path-walk API to accept certain kinds of pathspecs and to
> silently ignore anything too complex.

Hmph, "silently ignore", instead of "no, you cannot use that! and
die", or at least "sorry, I cannot do that, so the result may not be
what you wanted, you've been warned"?

> The current behavior focuses on
> pathspecs that match paths exactly. This includes exact filenames,
> including directory names as prefixes. Pathspecs containing wildcards
> or magic are cleared so the path walk downloads all blobs, as before.

Ah, "we punt and lift the limitation to grab everything, so at least
everything you wanted to have will become available to you, even
though we may download more than what you asked"?  OK, users would
survive that, and as we improve the pathspec support, the user
experience would only improve.  OK.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The previous change allowed specifying revision arguments over the 'git
>> backfill' command-line. This created the opportunity for pathspecs that
>> specify a smaller set of starting commits, but otherwise did not restrict
>> the blob paths that were downloaded.
> 
> "pathspecs that specify a smaller set of starting commits" is
> puzzling, as starting commits would be coming from the revision
> arguments.  "opportunity for pathspec to further filter commits
> to those that touch only the matching paths...", or something?

You're right. I'm using "starting commits" incorrectly. My view was too
focused on how the path-walk API starts from the commits output by the
revision walk to get a list of root trees and then walks by path from
that point.

I'll reword to make this more clear.

>> Update the path-walk API to accept certain kinds of pathspecs and to
>> silently ignore anything too complex.
> 
> Hmph, "silently ignore", instead of "no, you cannot use that! and
> die", or at least "sorry, I cannot do that, so the result may not be
> what you wanted, you've been warned"?

The behavior when silently ignoring is to over-download. The revision
walk still filters commits, but the path-walk then walks paths beyond
that pathspec. This will be fixed in the next commit, so adding an
error case didn't seem worth it. I'll do a better job foreshadowing.

>> The current behavior focuses on
>> pathspecs that match paths exactly. This includes exact filenames,
>> including directory names as prefixes. Pathspecs containing wildcards
>> or magic are cleared so the path walk downloads all blobs, as before.
> 
> Ah, "we punt and lift the limitation to grab everything, so at least
> everything you wanted to have will become available to you, even
> though we may download more than what you asked"?  OK, users would
> survive that, and as we improve the pathspec support, the user
> experience would only improve.  OK.

Exactly. I can word things better.

Thanks,
-Stolee

@@ -62,6 +62,8 @@ struct path_walk_context {
*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <stolee@gmail.com>
>
> Previously, walk_objects_by_path() silently ignored pathspecs containing
> wildcards or magic by clearing them. This caused all blobs to be
> downloaded regardless of the given pathspec. Wildcard pathspecs like
> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
> during 'git backfill').
>
> Support wildcard pathspecs by making three changes:
>
>  1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>     pathspec has no wildcards or magic, set this flag and use the
>     existing fast-path prefix matching in add_tree_entries(). When
>     wildcards are present, skip that block since prefix matching
>     cannot handle glob patterns.
>
>  2. Disable revision-level commit pruning (revs->prune = 0) for
>     wildcard pathspecs. The revision walk uses the pathspec to filter
>     commits via TREESAME detection. For exact prefix pathspecs this
>     works well, but wildcard pathspecs may fail to match through
>     TREESAME because fnmatch with WM_PATHNAME does not cross directory
>     boundaries. Disabling pruning ensures all commits are visited and
>     their trees are available for the path-walk to filter.

Hmph, I wonder how significant an impact does it have on the
performance that we have to disable pruning here.  With the bog
standard tree traversal, wouldn't tree_entry_interesting() already
be capable of doing this, even with fnmatch / WM_PATHNAME ?

>  3. Add a match_pathspec() check in walk_path() to filter out blobs
>     whose full path does not match the pathspec. This provides the
>     actual blob-level filtering for wildcard pathspecs.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The latter person cannot sign DCO or vouch for the origin of what
they have written in this patch, can they?

> ---
>  path-walk.c         | 22 ++++++++++++++--------
>  t/t5620-backfill.sh |  7 +++----
>  2 files changed, 17 insertions(+), 12 deletions(-)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 3/17/2026 6:19 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>> wildcards or magic by clearing them. This caused all blobs to be
>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>> during 'git backfill').
>>
>> Support wildcard pathspecs by making three changes:
>>
>>  1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>>     pathspec has no wildcards or magic, set this flag and use the
>>     existing fast-path prefix matching in add_tree_entries(). When
>>     wildcards are present, skip that block since prefix matching
>>     cannot handle glob patterns.
>>
>>  2. Disable revision-level commit pruning (revs->prune = 0) for
>>     wildcard pathspecs. The revision walk uses the pathspec to filter
>>     commits via TREESAME detection. For exact prefix pathspecs this
>>     works well, but wildcard pathspecs may fail to match through
>>     TREESAME because fnmatch with WM_PATHNAME does not cross directory
>>     boundaries. Disabling pruning ensures all commits are visited and
>>     their trees are available for the path-walk to filter.
> 
> Hmph, I wonder how significant an impact does it have on the
> performance that we have to disable pruning here.  With the bog
> standard tree traversal, wouldn't tree_entry_interesting() already
> be capable of doing this, even with fnmatch / WM_PATHNAME ?

I will explore what's possible here and see what I can do.

>>  3. Add a match_pathspec() check in walk_path() to filter out blobs
>>     whose full path does not match the pathspec. This provides the
>>     actual blob-level filtering for wildcard pathspecs.
>>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
> 
> The latter person cannot sign DCO or vouch for the origin of what
> they have written in this patch, can they?
No they cannot. Sorry for this error.

Thanks,
-Stolee

@@ -63,9 +63,12 @@ OPTIONS
current sparse-checkout. If the sparse-checkout feature is enabled,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kristoffer Haugsbakk" wrote on the Git mailing list (how to reply to this email):

On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>[snip]
> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
> index b8394dcf22..fdfe22d623 100644
> --- a/Documentation/git-backfill.adoc
> +++ b/Documentation/git-backfill.adoc
> @@ -63,9 +63,12 @@ OPTIONS
>  	current sparse-checkout. If the sparse-checkout feature is enabled,
>  	then `--sparse` is assumed and can be disabled with `--no-sparse`.
>
> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
> +
>  SEE ALSO
>  --------
>  linkgit:git-clone[1].
> +linkgit:git-rev-list[1].

Should there be a comma between these two?

>[snip]

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 18, 2026

User "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 18, 2026

This branch is now known as ds/backfill-revs.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 18, 2026

This patch series was integrated into seen via git@b0faae0.

@gitgitgadget gitgitgadget bot added the seen label Mar 18, 2026
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 19, 2026

This patch series was integrated into seen via git@2e7d77b.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2026

This patch series was integrated into seen via git@03fbc16.

The existing implementation of 'git backfill' only includes downloading
missing blobs reachable from HEAD. Advanced uses may desire more general
commit limiting options, such as '--all' for all references, specifying a
commit range via negative references, or specifying a recency of use such as
with '--since=<date>'.

All of these options are available if we use setup_revisions() to parse the
unknown arguments with the revision machinery. This opens up a large number
of possibilities, only a small set of which are tested here.

For documentation, we avoid duplicating the option documentation and instead
link to the documentation of 'git rev-list'.

Note that these arguments currently allow specifying a pathspec, which
modifies the commit history checks but does not limit the paths used in the
backfill logic. This will be updated in a future change.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The previous change allowed specifying revision arguments over the 'git
backfill' command-line. This created the opportunity for restricting the
initial commit set by filtering the revision walk through a pathspec. Other
than filtering the commit set (and thereby the root trees), this did not
restrict the path-walk implementation of 'git backfill' and did not restrict
the blobs that were downloaded to only those matching the pathspec.

Update the path-walk API to accept certain kinds of pathspecs and to
silently ignore anything too complex, for now. We will update this in the
next change to properly restrict to even complex pathspecs.

The current behavior focuses on pathspecs that match paths exactly. This
includes exact filenames, including directory names as prefixes. Pathspecs
containing wildcards or magic are cleared so the path walk downloads all
blobs, as before.

The reason for this restriction is to allow for a faster execution by
pruning the path walk to only trees that could contribute towards one of
those paths as a parent directory.

The test directory 'd/f/' (next to 'd/file*.txt') was prepared in a
previous commit to exercise the subtlety in prefix matching.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, walk_objects_by_path() silently ignored pathspecs containing
wildcards or magic by clearing them. This caused all blobs to be
downloaded regardless of the given pathspec. Wildcard pathspecs like
"d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
during 'git backfill').

Support wildcard pathspecs by making two changes:

 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
    pathspec has no wildcards or magic, set this flag and use the
    existing fast-path prefix matching in add_tree_entries(). When
    wildcards are present, skip that block since prefix matching
    cannot handle glob patterns.

 2. Add a match_pathspec() check in walk_path() to filter out blobs
    whose full path does not match the pathspec. This provides the
    actual blob-level filtering for wildcard pathspecs.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Before the recent changes to parse rev-list arguments inside of 'git
backfill', the builtin would take arbitrary arguments without complaint (and
ignore them). This was noticed and a patch was sent [1] which motivates this
change to encode this behavior in test.

[1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/

Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 23, 2026

Submitted as pull.2070.v2.git.1774266019.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2070/derrickstolee/backfill-revs-v2

To fetch this version to local tag pr-2070/derrickstolee/backfill-revs-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2070/derrickstolee/backfill-revs-v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants