backfill: accept revision arguments#2070
backfill: accept revision arguments#2070derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
Conversation
The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is defined in object-name.h. Include it in revision.h so consumers of REV_INFO_INIT do not need to include this hidden dependency. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Prepare the test infrastructure for upcoming changes that teach 'git
backfill' to accept revision arguments and pathspecs.
Add test_tick before each commit in the setup loop so that commit dates
are deterministic. This enables reliable testing with '--since'.
Rename the 'd/e/' directory to 'd/f/' so that the prefix 'd/f' is
ambiguous with the files 'd/file.*.txt'. This exercises the subtlety
in prefix pathspec matching that will be added in a later commit.
Create a branched version of the test repository (src-revs) with:
- A 'side' branch merged into main, adding s/file.{1,2}.txt with
two versions (4 new blobs, 52 total from main HEAD).
- An unmerged 'other' branch adding o/file.{1,2}.txt (2 more blobs,
54 total reachable from --all).
This structure makes --all, --first-parent, and --since produce
meaningfully different results when used with 'git backfill'.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
6bb9bb6 to
beb1c92
Compare
|
/submit |
|
Submitted as pull.2070.git.1773707361.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
/allow |
|
I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345 |
|
Error: User jayesh0104 is not yet permitted to use GitGitGadget |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> The git backfill command assists in downloading missing blobs for blobless
> partial clones. However, its current version lacks some valuable
> functionality. It currently:
>
> 1. Only walks commits reachable from HEAD.
> 2. It walks all reachable commits to the full history.
> 3. It can focus on the current sparse-checkout definition, but otherwise it
> doesn't focus on a given pathspec.
>
> All of these are being updated by this patch series, which allows rev-list
> options to impact the path-walk. These include:
>
> 1. Specifying a given refspec, including --all.
Makes sense. You can only be on a single branch at a time, but may
want to work on multiple topics in reasonably quick succession in a
single repository. Being able to prepare enough material to go back
to when working on whichever topic in a single backfill invocation
would be a welcome addition.
> 2. Modifying the commit walk, including --first-parent, commit ranges, or
> recency using --since.
> 3. Modifying the set of paths to download using pathspecs.
Both are good mechanisms to express which subset of history you will
be working on.
> One particularly valuable situation here is that now a user can run git
> backfill -- <path> to download all versions of a specific file or a specific
> directory, accelerating history queries within that path without downloading
> more than necessary. This can accelerate git blame or git log -L for these
> paths, where normally those commands download missing blobs one-by-one
> during its diff algorithms.
Yup. Even if your project is a huge monorepo that contains all, you
do not necessarily have to look at everything the organization has
all the time. "git blame -C -C -C" would of course not work in such
an environment (would it end up on-demand lazy fetch these blobs, or
are there ways to say "I know the object store of my repository is
only sparsely populated, and I do not want you to on-demand download
the missing blobs---do your best to work with only what is already
available?), but that's a tradeoff a monorepo makes.
|
| @@ -4,6 +4,7 @@ | |||
| #include "commit.h" | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> The REV_INFO_INIT macro includes a use of the DEFAULT_ABBREV macro, which is
> defined in object-name.h. Include it in revision.h so consumers of
> REV_INFO_INIT do not need to include this hidden dependency.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> revision.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/revision.h b/revision.h
> index b36acfc2d9..18c9bbd822 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -4,6 +4,7 @@
> #include "commit.h"
> #include "grep.h"
> #include "notes.h"
> +#include "object-name.h"
> #include "oidset.h"
> #include "pretty.h"
> #include "diff.h"
OK. Other symbols REV_INFO_INIT needs are REV_SORT_IN_GRAPH_ORDER
(in <commit.h>), CMIT_FMT_DEFAULT (in <pretty.h>), and STRVEC_INIT
(in <strvec.h>), and all three are already included there.
Makes sense.| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> - repo_init_revisions(ctx->repo, &revs, "");
> - handle_revision_arg("HEAD", &revs, 0, 0);
So we used to "cheat" and did an initialization without even knowing
in which directory we were started ...
> + /* Walk from HEAD if otherwise unspecified. */
> + if (!ctx->revs.pending.nr)
> + handle_revision_arg("HEAD", &ctx->revs, 0, 0);
... but by initializing the revs correctly in the caller, we would
be correcting it. Looking good.
> @@ -134,7 +135,12 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
> builtin_backfill_usage, options);
>
> argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
> - 0);
> + PARSE_OPT_KEEP_UNKNOWN_OPT |
> + PARSE_OPT_KEEP_ARGV0 |
> + PARSE_OPT_KEEP_DASHDASH);
> +
> + repo_init_revisions(repo, &ctx.revs, prefix);
> + argc = setup_revisions(argc, argv, &ctx.revs, NULL);
OK.| @@ -206,6 +206,49 @@ static int add_tree_entries(struct path_walk_context *ctx, | |||
| match != MATCHED) | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> The previous change allowed specifying revision arguments over the 'git
> backfill' command-line. This created the opportunity for pathspecs that
> specify a smaller set of starting commits, but otherwise did not restrict
> the blob paths that were downloaded.
"pathspecs that specify a smaller set of starting commits" is
puzzling, as starting commits would be coming from the revision
arguments. "opportunity for pathspec to further filter commits
to those that touch only the matching paths...", or something?
> Update the path-walk API to accept certain kinds of pathspecs and to
> silently ignore anything too complex.
Hmph, "silently ignore", instead of "no, you cannot use that! and
die", or at least "sorry, I cannot do that, so the result may not be
what you wanted, you've been warned"?
> The current behavior focuses on
> pathspecs that match paths exactly. This includes exact filenames,
> including directory names as prefixes. Pathspecs containing wildcards
> or magic are cleared so the path walk downloads all blobs, as before.
Ah, "we punt and lift the limitation to grab everything, so at least
everything you wanted to have will become available to you, even
though we may download more than what you asked"? OK, users would
survive that, and as we improve the pathspec support, the user
experience would only improve. OK.
There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/17/2026 6:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The previous change allowed specifying revision arguments over the 'git
>> backfill' command-line. This created the opportunity for pathspecs that
>> specify a smaller set of starting commits, but otherwise did not restrict
>> the blob paths that were downloaded.
>
> "pathspecs that specify a smaller set of starting commits" is
> puzzling, as starting commits would be coming from the revision
> arguments. "opportunity for pathspec to further filter commits
> to those that touch only the matching paths...", or something?
You're right. I'm using "starting commits" incorrectly. My view was too
focused on how the path-walk API starts from the commits output by the
revision walk to get a list of root trees and then walks by path from
that point.
I'll reword to make this more clear.
>> Update the path-walk API to accept certain kinds of pathspecs and to
>> silently ignore anything too complex.
>
> Hmph, "silently ignore", instead of "no, you cannot use that! and
> die", or at least "sorry, I cannot do that, so the result may not be
> what you wanted, you've been warned"?
The behavior when silently ignoring is to over-download. The revision
walk still filters commits, but the path-walk then walks paths beyond
that pathspec. This will be fixed in the next commit, so adding an
error case didn't seem worth it. I'll do a better job foreshadowing.
>> The current behavior focuses on
>> pathspecs that match paths exactly. This includes exact filenames,
>> including directory names as prefixes. Pathspecs containing wildcards
>> or magic are cleared so the path walk downloads all blobs, as before.
>
> Ah, "we punt and lift the limitation to grab everything, so at least
> everything you wanted to have will become available to you, even
> though we may download more than what you asked"? OK, users would
> survive that, and as we improve the pathspec support, the user
> experience would only improve. OK.
Exactly. I can word things better.
Thanks,
-Stolee
| @@ -62,6 +62,8 @@ struct path_walk_context { | |||
| */ | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> Previously, walk_objects_by_path() silently ignored pathspecs containing
> wildcards or magic by clearing them. This caused all blobs to be
> downloaded regardless of the given pathspec. Wildcard pathspecs like
> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
> during 'git backfill').
>
> Support wildcard pathspecs by making three changes:
>
> 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
> pathspec has no wildcards or magic, set this flag and use the
> existing fast-path prefix matching in add_tree_entries(). When
> wildcards are present, skip that block since prefix matching
> cannot handle glob patterns.
>
> 2. Disable revision-level commit pruning (revs->prune = 0) for
> wildcard pathspecs. The revision walk uses the pathspec to filter
> commits via TREESAME detection. For exact prefix pathspecs this
> works well, but wildcard pathspecs may fail to match through
> TREESAME because fnmatch with WM_PATHNAME does not cross directory
> boundaries. Disabling pruning ensures all commits are visited and
> their trees are available for the path-walk to filter.
Hmph, I wonder how significant an impact does it have on the
performance that we have to disable pruning here. With the bog
standard tree traversal, wouldn't tree_entry_interesting() already
be capable of doing this, even with fnmatch / WM_PATHNAME ?
> 3. Add a match_pathspec() check in walk_path() to filter out blobs
> whose full path does not match the pathspec. This provides the
> actual blob-level filtering for wildcard pathspecs.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The latter person cannot sign DCO or vouch for the origin of what
they have written in this patch, can they?
> ---
> path-walk.c | 22 ++++++++++++++--------
> t/t5620-backfill.sh | 7 +++----
> 2 files changed, 17 insertions(+), 12 deletions(-)There was a problem hiding this comment.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):
On 3/17/2026 6:19 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> Previously, walk_objects_by_path() silently ignored pathspecs containing
>> wildcards or magic by clearing them. This caused all blobs to be
>> downloaded regardless of the given pathspec. Wildcard pathspecs like
>> "d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
>> during 'git backfill').
>>
>> Support wildcard pathspecs by making three changes:
>>
>> 1. Add an 'exact_pathspecs' flag to path_walk_context. When the
>> pathspec has no wildcards or magic, set this flag and use the
>> existing fast-path prefix matching in add_tree_entries(). When
>> wildcards are present, skip that block since prefix matching
>> cannot handle glob patterns.
>>
>> 2. Disable revision-level commit pruning (revs->prune = 0) for
>> wildcard pathspecs. The revision walk uses the pathspec to filter
>> commits via TREESAME detection. For exact prefix pathspecs this
>> works well, but wildcard pathspecs may fail to match through
>> TREESAME because fnmatch with WM_PATHNAME does not cross directory
>> boundaries. Disabling pruning ensures all commits are visited and
>> their trees are available for the path-walk to filter.
>
> Hmph, I wonder how significant an impact does it have on the
> performance that we have to disable pruning here. With the bog
> standard tree traversal, wouldn't tree_entry_interesting() already
> be capable of doing this, even with fnmatch / WM_PATHNAME ?
I will explore what's possible here and see what I can do.
>> 3. Add a match_pathspec() check in walk_path() to filter out blobs
>> whose full path does not match the pathspec. This provides the
>> actual blob-level filtering for wildcard pathspecs.
>>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
>
> The latter person cannot sign DCO or vouch for the origin of what
> they have written in this patch, can they?
No they cannot. Sorry for this error.
Thanks,
-Stolee
| @@ -63,9 +63,12 @@ OPTIONS | |||
| current sparse-checkout. If the sparse-checkout feature is enabled, | |||
There was a problem hiding this comment.
"Kristoffer Haugsbakk" wrote on the Git mailing list (how to reply to this email):
On Tue, Mar 17, 2026, at 01:29, Derrick Stolee via GitGitGadget wrote:
>[snip]
> diff --git a/Documentation/git-backfill.adoc b/Documentation/git-backfill.adoc
> index b8394dcf22..fdfe22d623 100644
> --- a/Documentation/git-backfill.adoc
> +++ b/Documentation/git-backfill.adoc
> @@ -63,9 +63,12 @@ OPTIONS
> current sparse-checkout. If the sparse-checkout feature is enabled,
> then `--sparse` is assumed and can be disabled with `--no-sparse`.
>
> +You may also specify the commit limiting options from linkgit:git-rev-list[1].
> +
> SEE ALSO
> --------
> linkgit:git-clone[1].
> +linkgit:git-rev-list[1].
Should there be a comma between these two?
>[snip]|
User |
|
This branch is now known as |
|
This patch series was integrated into seen via git@b0faae0. |
|
This patch series was integrated into seen via git@2e7d77b. |
|
This patch series was integrated into seen via git@03fbc16. |
The existing implementation of 'git backfill' only includes downloading missing blobs reachable from HEAD. Advanced uses may desire more general commit limiting options, such as '--all' for all references, specifying a commit range via negative references, or specifying a recency of use such as with '--since=<date>'. All of these options are available if we use setup_revisions() to parse the unknown arguments with the revision machinery. This opens up a large number of possibilities, only a small set of which are tested here. For documentation, we avoid duplicating the option documentation and instead link to the documentation of 'git rev-list'. Note that these arguments currently allow specifying a pathspec, which modifies the commit history checks but does not limit the paths used in the backfill logic. This will be updated in a future change. Signed-off-by: Derrick Stolee <stolee@gmail.com>
The previous change allowed specifying revision arguments over the 'git backfill' command-line. This created the opportunity for restricting the initial commit set by filtering the revision walk through a pathspec. Other than filtering the commit set (and thereby the root trees), this did not restrict the path-walk implementation of 'git backfill' and did not restrict the blobs that were downloaded to only those matching the pathspec. Update the path-walk API to accept certain kinds of pathspecs and to silently ignore anything too complex, for now. We will update this in the next change to properly restrict to even complex pathspecs. The current behavior focuses on pathspecs that match paths exactly. This includes exact filenames, including directory names as prefixes. Pathspecs containing wildcards or magic are cleared so the path walk downloads all blobs, as before. The reason for this restriction is to allow for a faster execution by pruning the path walk to only trees that could contribute towards one of those paths as a parent directory. The test directory 'd/f/' (next to 'd/file*.txt') was prepared in a previous commit to exercise the subtlety in prefix matching. Signed-off-by: Derrick Stolee <stolee@gmail.com>
Previously, walk_objects_by_path() silently ignored pathspecs containing
wildcards or magic by clearing them. This caused all blobs to be
downloaded regardless of the given pathspec. Wildcard pathspecs like
"d/file.*.txt" are useful for narrowing which blobs to process (e.g.,
during 'git backfill').
Support wildcard pathspecs by making two changes:
1. Add an 'exact_pathspecs' flag to path_walk_context. When the
pathspec has no wildcards or magic, set this flag and use the
existing fast-path prefix matching in add_tree_entries(). When
wildcards are present, skip that block since prefix matching
cannot handle glob patterns.
2. Add a match_pathspec() check in walk_path() to filter out blobs
whose full path does not match the pathspec. This provides the
actual blob-level filtering for wildcard pathspecs.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
beb1c92 to
1168edf
Compare
Before the recent changes to parse rev-list arguments inside of 'git backfill', the builtin would take arbitrary arguments without complaint (and ignore them). This was noticed and a patch was sent [1] which motivates this change to encode this behavior in test. [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/ Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com> Signed-off-by: Derrick Stolee <stolee@gmail.com>
|
/submit |
|
Submitted as pull.2070.v2.git.1774266019.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
The
git backfillcommand assists in downloading missing blobs for blobless partial clones. However, its current version lacks some valuable functionality. It currently:HEAD.All of these are being updated by this patch series, which allows rev-list options to impact the path-walk. These include:
--all.--first-parent, commit ranges, or recency using--since.One particularly valuable situation here is that now a user can run
git backfill -- <path>to download all versions of a specific file or a specific directory, accelerating history queries within that path without downloading more than necessary. This can accelerategit blameorgit log -Lfor these paths, where normally those commands download missing blobs one-by-one during its diff algorithms.This patch series is organized in the following way:
#includeis added to prevent future compilation issues.backfillbuiltin parses the rev-list arguments. We test the top arguments that work as expected, though the pathspec arguments need extra work.builtin/backfill.cinstead of restricting the walk in the path-walk API.The main goal of this series is to make such customizations possible, and to improve performance where common use cases are expected. I'm open to feedback as to whether we should consider more detailed performance analysis or whether we should wait for how users interact with these new options before overoptimizing unlikely use cases.
Updates in v2
[1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/
Thanks,
-Stolee
cc: gitster@pobox.com
cc: "Kristoffer Haugsbakk" kristofferhaugsbakk@fastmail.com
cc: r.siddharth.shrimali@gmail.com
cc: ps@pks.im