fetch, clone: add fetch.blobSizeLimit config#2058
fetch, clone: add fetch.blobSizeLimit config#2058abraithwaite wants to merge 1 commit intogitgitgadget:masterfrom
Conversation
Welcome to GitGitGadgetHi @abraithwaite, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that either:
You can CC potential reviewers by adding a footer to the PR description with the following syntax: NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description, Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txtTo iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description): To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
525eef2 to
8d656a9
Compare
External tools like git-lfs and git-fat use the filter clean/smudge mechanism to manage large binary objects, but this requires pointer files, a separate storage backend, and careful coordination. Git's partial clone infrastructure provides a more native approach: large blobs can be excluded at the protocol level during fetch and lazily retrieved on demand. However, enabling this requires passing `--filter=blob:limit=<size>` on every clone, which is not discoverable and cannot be set as a global default. Add a new `fetch.blobSizeLimit` configuration option that enables size-based partial clone behavior globally. When set, both `git clone` and `git fetch` automatically apply a `blob:limit=<size>` filter. Blobs larger than the threshold that are not needed for the current worktree are excluded from the transfer and lazily fetched on demand when needed (e.g., during checkout, diff, or merge). This makes it easy to work with repositories that have accumulated large binary files in their history, without downloading all of them upfront. The precedence order is: 1. Explicit `--filter=` on the command line (highest) 2. Existing `remote.<name>.partialclonefilter` 3. `fetch.blobSizeLimit` (new, lowest) Once a clone or fetch applies this setting, the remote is registered as a promisor remote with the corresponding filter spec, so subsequent fetches inherit it automatically. If the server does not support object filtering, the setting is silently ignored. Signed-off-by: Alan Braithwaite <alan@braithwaite.dev>
8d656a9 to
818b64e
Compare
|
/allow |
|
User abraithwaite is now allowed to use GitGitGadget. WARNING: abraithwaite has no public email address set on GitHub; GitGitGadget needs an email address to Cc: you on your contribution, so that you receive any feedback on the Git mailing list. Go to https://github.com/settings/profile to make your preferred email public to let GitGitGadget know which email address to use. |
|
/submit |
|
Submitted as pull.2058.git.1772383499900.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Sun, Mar 01, 2026 at 04:44:59PM +0000, Alan Braithwaite via GitGitGadget wrote:
> From: Alan Braithwaite <alan@braithwaite.dev>
>
> External tools like git-lfs and git-fat use the filter clean/smudge
> mechanism to manage large binary objects, but this requires pointer
> files, a separate storage backend, and careful coordination. Git's
> partial clone infrastructure provides a more native approach: large
> blobs can be excluded at the protocol level during fetch and lazily
> retrieved on demand. However, enabling this requires passing
> `--filter=blob:limit=<size>` on every clone, which is not
> discoverable and cannot be set as a global default.
I'm not sure that we should make blob size limiting the default. The
problem with specifying a limit is that this is comparatively expensive
to compute on the server side: we have to look up each blob so that we
can determine its size. Unfortunately, such requests cannot (currently)
be optimized via for example bitmaps, or any other cache that we have.
So if we want to make any filter the default, I'd propose that we should
rather think about filters that are computationally less expensive, like
for example `--filter=blob:none`. This can be computed efficiently via
bitmaps.
The downside is of course that in this case we have to do way more
backfill fetches compared to the case where we only leave out a couple
of blobs. But unless we figure out a way to serve the size limit filter
in a more efficient way I'm not sure about proper alternatives.
Another question to consider: is it really sensible to set this setting
globally? It is very much dependent on the forge that you're connecting
to, as forges may not even allow object filters at all, or only a subset
of them.
Thanks!
Patrick |
|
Jeff King wrote on the Git mailing list (how to reply to this email): On Mon, Mar 02, 2026 at 12:53:32PM +0100, Patrick Steinhardt wrote:
> On Sun, Mar 01, 2026 at 04:44:59PM +0000, Alan Braithwaite via GitGitGadget wrote:
> > From: Alan Braithwaite <alan@braithwaite.dev>
> >
> > External tools like git-lfs and git-fat use the filter clean/smudge
> > mechanism to manage large binary objects, but this requires pointer
> > files, a separate storage backend, and careful coordination. Git's
> > partial clone infrastructure provides a more native approach: large
> > blobs can be excluded at the protocol level during fetch and lazily
> > retrieved on demand. However, enabling this requires passing
> > `--filter=blob:limit=<size>` on every clone, which is not
> > discoverable and cannot be set as a global default.
>
> I'm not sure that we should make blob size limiting the default. The
> problem with specifying a limit is that this is comparatively expensive
> to compute on the server side: we have to look up each blob so that we
> can determine its size. Unfortunately, such requests cannot (currently)
> be optimized via for example bitmaps, or any other cache that we have.
We actually can do blob:limit filters with bitmaps. See 84243da129
(pack-bitmap: implement BLOB_LIMIT filtering, 2020-02-14). It's more
expensive than blob:none, but not much. Once we have the list of blobs
we can get their sizes directly from the packfile. It's stuff like
path-limiting that is truly expensive, because it requires a traversal.
All that said, I'd be wary of turning on partial clones like this by
default. I feel like there are still a lot of performance gotchas
lurking (and possibly some correctness ones, too).
-Peff |
|
User |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): Patrick Steinhardt <ps@pks.im> writes:
> I'm not sure that we should make blob size limiting the default. The
> problem with specifying a limit is that this is comparatively expensive
> to compute on the server side: we have to look up each blob so that we
> can determine its size. Unfortunately, such requests cannot (currently)
> be optimized via for example bitmaps, or any other cache that we have.
> ...
> Another question to consider: is it really sensible to set this setting
> globally? It is very much dependent on the forge that you're connecting
> to, as forges may not even allow object filters at all, or only a subset
> of them.
Both are good questions, but to affect "clone" you'd need either
"git -c that.variable=setting clone" or have it in ~/.gitconfig no?
As to this extra variable, it can already be done with existing
remote.*.partialCloneFilter, it seems, so I do not know why we want
to add it.
|
|
"Alan Braithwaite" wrote on the Git mailing list (how to reply to this email): Patrick, Peff, Junio — thanks for taking the time to look at
this.
Patrick wrote:
> I'm not sure that we should make blob size limiting the
> default.
To clarify — this is a user-opt-in config, not a default. You
would only get partial clone behavior if you explicitly set
fetch.blobSizeLimit in your gitconfig.
Peff wrote:
> We actually can do blob:limit filters with bitmaps. See
> 84243da129 (pack-bitmap: implement BLOB_LIMIT filtering,
> 2020-02-14).
Good to know. I'm not positive, but my understanding is that
this patch only touches client code, and the server sees an
identical request to what `git clone --filter=blob:limit=1m`
already sends today. If that's correct, anyone can already
impose that cost — this patch just makes it easier to opt in.
> All that said, I'd be wary of turning on partial clones like
> this by default.
That's fair. I'm not attached to getting this merged — it was
more exploratory to start a discussion.
Junio wrote:
> As to this extra variable, it can already be done with
> existing remote.*.partialCloneFilter, it seems, so I do not
> know why we want to add it.
I may not understand the config as well as you do, but my
reading is that remote.*.partialCloneFilter requires a specific
remote name and only takes effect on subsequent fetches from an
already-registered promisor remote — not the initial clone. You
would also need remote.origin.promisor=true set globally, which
seems odd. If I'm understanding correctly, there is currently
no way to say "all new clones should use a blob size filter"
via config alone. But please correct me if I'm wrong.
Separately — is my understanding correct that partial clone
with blob:limit works today without server-side changes,
assuming uploadpack.allowFilter is enabled? If so, I'm happy
to maintain this as a local client patch for my own workflow.
Thanks again,
Alan
On Mon, Mar 2, 2026, at 10:57, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
>> I'm not sure that we should make blob size limiting the default. The
>> problem with specifying a limit is that this is comparatively expensive
>> to compute on the server side: we have to look up each blob so that we
>> can determine its size. Unfortunately, such requests cannot (currently)
>> be optimized via for example bitmaps, or any other cache that we have.
>> ...
>> Another question to consider: is it really sensible to set this setting
>> globally? It is very much dependent on the forge that you're connecting
>> to, as forges may not even allow object filters at all, or only a subset
>> of them.
>
> Both are good questions, but to affect "clone" you'd need either
> "git -c that.variable=setting clone" or have it in ~/.gitconfig no?
>
> As to this extra variable, it can already be done with existing
> remote.*.partialCloneFilter, it seems, so I do not know why we want
> to add it. |
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Mon, Mar 02, 2026 at 01:36:40PM -0800, Alan Braithwaite wrote:
> Peff wrote:
> > We actually can do blob:limit filters with bitmaps. See
> > 84243da129 (pack-bitmap: implement BLOB_LIMIT filtering,
> > 2020-02-14).
>
> Good to know. I'm not positive, but my understanding is that
> this patch only touches client code, and the server sees an
> identical request to what `git clone --filter=blob:limit=1m`
> already sends today. If that's correct, anyone can already
> impose that cost — this patch just makes it easier to opt in.
Ah, right, that's something I forgot. I've seen too many performance
issues recently with blob:limit fetches, so I jumped the gun.
> Junio wrote:
> > As to this extra variable, it can already be done with
> > existing remote.*.partialCloneFilter, it seems, so I do not
> > know why we want to add it.
>
> I may not understand the config as well as you do, but my
> reading is that remote.*.partialCloneFilter requires a specific
> remote name and only takes effect on subsequent fetches from an
> already-registered promisor remote — not the initial clone. You
> would also need remote.origin.promisor=true set globally, which
> seems odd. If I'm understanding correctly, there is currently
> no way to say "all new clones should use a blob size filter"
> via config alone. But please correct me if I'm wrong.
No, you're right about this one, and I think this is a sensible thing to
want. But what I'd like to see is a bit more nuance, I guess:
- It should be possible to specify the configuration per URL. If you
know that git.example.com knows object filters you may want to turn
them on for that domain specifically. So the mechanism would work
similar to "url.<base>.insteadOf" or "http.<url>.*" settings.
- The infrastructure shouldn't cast any specific filter into stone.
Instead, it should be possible to specify a default filter.
I'd assume that these settings should only impact the initial clone to
use a default filter in case the cloned URL matches the configured URL.
For existing repositories it shouldn't have any impact, as we should
continue to respect the ".git/config" there when it comes to promisors
and filters.
> Separately — is my understanding correct that partial clone
> with blob:limit works today without server-side changes,
> assuming uploadpack.allowFilter is enabled? If so, I'm happy
> to maintain this as a local client patch for my own workflow.
Yes, blob:limit filters are supported by many forges nowadays.
Patrick |
|
"Alan Braithwaite" wrote on the Git mailing list (how to reply to this email): Patrick wrote:
> No, you're right about this one, and I think this is a
> sensible thing to want. But what I'd like to see is a bit
> more nuance, I guess:
>
> - It should be possible to specify the configuration per
> URL. If you know that git.example.com knows object
> filters you may want to turn them on for that domain
> specifically. So the mechanism would work similar to
> "url.<base>.insteadOf" or "http.<url>.*" settings.
>
> - The infrastructure shouldn't cast any specific filter
> into stone. Instead, it should be possible to specify a
> default filter.
Thanks, this is great feedback. I took a look at the existing
URL-based config patterns and I think the http.<url>.* model
is the right one to follow, since it already uses the
urlmatch_config_entry() infrastructure with proper URL
normalization, host globs, and longest-match specificity.
Here's what I'm thinking for a v2. I'd like to get feedback
on the design before implementing:
The config would use a new section that supports both a global
default and per-URL overrides, following the same pattern as
http.sslVerify vs http.<url>.sslVerify:
# Global default — applies to all clones/fetches
[fetch]
partialCloneFilter = blob:limit=1m
# Per-URL override — more specific match wins
[fetch "https://github.com/"]
partialCloneFilter = blob:limit=5m
[fetch "https://internal.corp.com/"]
partialCloneFilter = blob:none
Design points:
- Accepts any filter spec, not just blob:limit. This
addresses your point about not casting a specific filter
into stone.
- Uses fetch.<url>.partialCloneFilter, following the
http.<url>.* precedent. The urlmatch.c infrastructure
handles URL normalization, host globs (*.example.com),
default port stripping, and path-based specificity
ordering — so no new matching logic would be needed.
- A bare fetch.partialCloneFilter (no URL) acts as the
global default, the same way http.sslVerify is the
global default that http.<url>.sslVerify can override.
- Only applies to initial clone and to fetches where no
existing remote.<name>.partialCloneFilter is set. Existing
repos continue using their per-remote config.
- Explicit --filter on the command line still takes
precedence over everything.
- If the server does not support object filtering, the
setting is silently ignored (existing behavior).
I chose fetch.* rather than clone.* so that both git-clone
and git-fetch can use the same config. In practice this
mainly matters for the initial clone, since once the promisor
remote is registered, subsequent fetches inherit the filter
from remote.<name>.partialCloneFilter anyway.
Does this direction make sense? Happy to hear if there are
concerns before I start on a v2.
Thanks,
- Alan |
|
Jeff King wrote on the Git mailing list (how to reply to this email): On Mon, Mar 02, 2026 at 01:36:40PM -0800, Alan Braithwaite wrote:
> Peff wrote:
> > We actually can do blob:limit filters with bitmaps. See
> > 84243da129 (pack-bitmap: implement BLOB_LIMIT filtering,
> > 2020-02-14).
>
> Good to know. I'm not positive, but my understanding is that
> this patch only touches client code, and the server sees an
> identical request to what `git clone --filter=blob:limit=1m`
> already sends today. If that's correct, anyone can already
> impose that cost — this patch just makes it easier to opt in.
Yes, that's correct. The server protects itself by refusing to support
certain filters that are too expensive. Usually by setting
uploadpackfilter.allow to "false", followed by enabling
uploadpackfilter.*.allow for particular filters.
When we added those, we left the defaults as-is (allowing everything).
That's OK for casual use amongst your own repositories, but terrible for
a hosting site. I don't know if it would be worth revisiting the
defaults.
But anyway, all orthogonal to the topic in this thread.
-Peff |
|
Patrick Steinhardt wrote on the Git mailing list (how to reply to this email): On Tue, Mar 03, 2026 at 06:00:29AM -0800, Alan Braithwaite wrote:
> Patrick wrote:
> > No, you're right about this one, and I think this is a
> > sensible thing to want. But what I'd like to see is a bit
> > more nuance, I guess:
> >
> > - It should be possible to specify the configuration per
> > URL. If you know that git.example.com knows object
> > filters you may want to turn them on for that domain
> > specifically. So the mechanism would work similar to
> > "url.<base>.insteadOf" or "http.<url>.*" settings.
> >
> > - The infrastructure shouldn't cast any specific filter
> > into stone. Instead, it should be possible to specify a
> > default filter.
>
> Thanks, this is great feedback. I took a look at the existing
> URL-based config patterns and I think the http.<url>.* model
> is the right one to follow, since it already uses the
> urlmatch_config_entry() infrastructure with proper URL
> normalization, host globs, and longest-match specificity.
>
> Here's what I'm thinking for a v2. I'd like to get feedback
> on the design before implementing:
>
> The config would use a new section that supports both a global
> default and per-URL overrides, following the same pattern as
> http.sslVerify vs http.<url>.sslVerify:
>
> # Global default — applies to all clones/fetches
> [fetch]
> partialCloneFilter = blob:limit=1m
>
> # Per-URL override — more specific match wins
> [fetch "https://github.com/"]
> partialCloneFilter = blob:limit=5m
>
> [fetch "https://internal.corp.com/"]
> partialCloneFilter = blob:none
>
> Design points:
>
> - Accepts any filter spec, not just blob:limit. This
> addresses your point about not casting a specific filter
> into stone.
>
> - Uses fetch.<url>.partialCloneFilter, following the
> http.<url>.* precedent. The urlmatch.c infrastructure
> handles URL normalization, host globs (*.example.com),
> default port stripping, and path-based specificity
> ordering — so no new matching logic would be needed.
>
> - A bare fetch.partialCloneFilter (no URL) acts as the
> global default, the same way http.sslVerify is the
> global default that http.<url>.sslVerify can override.
>
> - Only applies to initial clone and to fetches where no
> existing remote.<name>.partialCloneFilter is set. Existing
> repos continue using their per-remote config.
>
> - Explicit --filter on the command line still takes
> precedence over everything.
>
> - If the server does not support object filtering, the
> setting is silently ignored (existing behavior).
>
> I chose fetch.* rather than clone.* so that both git-clone
> and git-fetch can use the same config. In practice this
> mainly matters for the initial clone, since once the promisor
> remote is registered, subsequent fetches inherit the filter
> from remote.<name>.partialCloneFilter anyway.
I think using something like "clone.<url>.defaultObjectFilter" would be
a more sensible design. The idea is that we'd only honor this filter on
the initial clone to basically be equivalent to `git clone --filter=`. I
don't think any subsequent fetches should be impacted at all, as turning
a full clone into a partial clone would need more consideration.
Patrick |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): Patrick Steinhardt <ps@pks.im> writes:
> No, you're right about this one, and I think this is a sensible thing to
> want. But what I'd like to see is a bit more nuance, I guess:
>
> - It should be possible to specify the configuration per URL. If you
> know that git.example.com knows object filters you may want to turn
> them on for that domain specifically. So the mechanism would work
> similar to "url.<base>.insteadOf" or "http.<url>.*" settings.
>
> - The infrastructure shouldn't cast any specific filter into stone.
> Instead, it should be possible to specify a default filter.
>
> I'd assume that these settings should only impact the initial clone to
> use a default filter in case the cloned URL matches the configured URL.
> For existing repositories it shouldn't have any impact, as we should
> continue to respect the ".git/config" there when it comes to promisors
> and filters.
Ahh, thanks for pointing out the flaw in my thinking that forgets
that "remote.<name>.partialCloneFilter" would not work in the
initial state where there is no <name> associated with the remote
repository you are trying to contact. I agree that something like
"remote.<url>.particialCloneFilter" is a more proper way forward.
|
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): Patrick Steinhardt <ps@pks.im> writes:
> I think using something like "clone.<url>.defaultObjectFilter" would be
> a more sensible design. The idea is that we'd only honor this filter on
> the initial clone to basically be equivalent to `git clone --filter=`. I
> don't think any subsequent fetches should be impacted at all, as turning
> a full clone into a partial clone would need more consideration.
Yup, I like this one. Should <url> be giving a repository fully, or
be some pattern that groups similar repositories together? You
would not be wanting to clone exactly the same repository so many
times for a configuration variable to matter in general. |
CC: ps@pks.im
CC: christian.couder@gmail.com
CC: jonathantanmy@google.com
CC: me@ttaylorr.com
CC: gitster@pobox.com
cc: Jeff King peff@peff.net