PO sync is unnaturally increasing repository sizes
Ben Cooksley
bcooksley at kde.org
Sat Aug 12 21:59:19 BST 2023
On Fri, Aug 11, 2023 at 6:22 AM Albert Astals Cid <aacid at kde.org> wrote:
> El dijous, 10 d’agost de 2023, a les 14:00:05 (CEST), Andre Heinecke va
> escriure:
> > Hi,
> >
> > tl;dr; po sync is blowing up our repository sizes far more then it
> appears
> > to be necessary. We might need a force push across all repos to correct
> > that. Kleopatra repo has increased in size tenfold since po files were
> > added less then a year ago.
> >
> >
> > I recently noticed that Kleopatra has gained some weight. While she is an
> > old lady, and when she was split up from the old KDEPIM repo took all her
> > history with it she was always quite chubby. But not by that much. ( I am
> > messy with Mega / Mebi here since it is not important for the overall
> > picture)
> >
> > So let us see:
> > A fresh clone of Kleopatra:
> > 209M kleopatra
> > Running:
> > git filter-repo --path po --invert-paths
> > 21M kleopatra
> >
> > Let us do the same for KMail:
> > Before:
> > 169M kmail
> > after:
> > 56M kmail
> >
> > Now yes Kleopatra has quite a few translations. Their checked out size is
> > about 29Megabytes. But there is something wrong here.
> >
> > What I don't understand though is that if I look at the scripty commits
> in
> > the git log, nothing seems unusual.
> >
> > But Let us take the language of Low Saxon. I hope that offends the least
> > people here. There have been no new translations there in ~10 years.
> >
> > It's checked out size is 460KB.
> >
> > In master we have:
> > 715 translated messages, 709 fuzzy translations, 428 untranslated
> messages.
> > Going back to the first revision that added po files:
> > 763 translated messages, 645 fuzzy translations, 391 untranslated
> messages.
> >
> > Sizes are fairly equal with master of course a bit larger. Now this
> > language, unchanged in translation. Has alone added 10 Megabytes. That is
> > about half of the size of the complete history of the real source code
> for
> > Kleopatra.
> >
> > du -hs .
> > 209M
> > git filter-repo --path po/nds --invert-paths --force
> > du -hs .
> > 199M
> >
> > Now here is what I don't understand. If I look at the changes
> > git log -p po/nds/kleopatra.po | wc -c
> > 164774
> > That seems reasonable for all the automatic scripty updates and even with
> > all the context lines, that is just 1,6MB uncompressed.
> >
> > And this is where my git understanding runs into limits. To understand
> why
> > the history has gotten so large i tried some snippets from stackoverflow
> > and from there with:
> > git rev-list --objects --all po/nds/kleopatra.po| git cat-file --batch-
> > check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
> > sed -n 's/^blob //p' |
> > sort --numeric-sort --key=2 |
> > cut -c 1-12,41-
> >
> > I think that I can roughly see that apparently each commit in the repo
> has a
> > blob associated with it that is the same size of the file.
> >
> > So can some git sleuth please investigte what is happening here? This
> kind
> > of repo growth is unstainable and at least for Kleopatra I see no
> possible
> > solution then to figure this out and then remove the po history from the
> > last year with a force push :-/
> >
> > Don't get me wrong I like that the po files are now also in the repo, and
> > that this will of course increase the repo size, but something fishy is
> > going on here in my opinion.
>
> As discussed on Matrix, it seems doing
> git gc ---aggresive
> brings down kleopatra size from 223M to 55M
>
> Is this something worth doing on the server side?
>
I have investigated this, and it isn't as simple as just running "git gc
--aggressive" on the server unfortunately.
The component responsible for looking after Git repositories in Gitlab
(known as Gitaly) makes use of a couple of tricks to help ensure efficient
use of resources and help things move quickly. One of these tricks involves
sharing the underlying objects in a Git repository among a parent
repository and it's forks - to save on the disk space (this mechanism is
known as a pool repository in Gitaly). This is how forks of our
repositories use very little in the way of space, even though they take
significantly more to clone. It appears that in this case we would need to
have this repack (which is what git gc --aggressive is doing) done in the
pool repository, which I can't see a way to do.
I have now filed https://gitlab.com/gitlab-org/gitaly/-/issues/5508 to
discuss this with the Gitlab/Gitaly developers.
>
> Anyone knows of any potential downsides of that?
>
> Cheers,
> Albert
>
Thanks,
Ben
>
> >
> >
> > Best Regards,
> > Andre
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-devel/attachments/20230813/3f60fc2d/attachment.htm>
More information about the kde-devel
mailing list