<div dir="ltr"><div dir="ltr">On Fri, Aug 11, 2023 at 6:22 AM Albert Astals Cid <<a href="mailto:aacid@kde.org">aacid@kde.org</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">El dijous, 10 d’agost de 2023, a les 14:00:05 (CEST), Andre Heinecke va <br>
escriure:<br>
> Hi,<br>
> <br>
> tl;dr; po sync is blowing up our repository sizes far more then it appears<br>
> to be necessary. We might need a force push across all repos to correct<br>
> that. Kleopatra repo has increased in size tenfold since po files were<br>
> added less then a year ago.<br>
> <br>
> <br>
> I recently noticed that Kleopatra has gained some weight. While she is an<br>
> old lady, and when she was split up from the old KDEPIM repo took all her<br>
> history with it she was always quite chubby. But not by that much. ( I am<br>
> messy with Mega / Mebi here since it is not important for the overall<br>
> picture)<br>
> <br>
> So let us see:<br>
> A fresh clone of Kleopatra:<br>
> 209M kleopatra<br>
> Running:<br>
> git filter-repo --path po --invert-paths<br>
> 21M kleopatra<br>
> <br>
> Let us do the same for KMail:<br>
> Before:<br>
> 169M kmail<br>
> after:<br>
> 56M kmail<br>
> <br>
> Now yes Kleopatra has quite a few translations. Their checked out size is<br>
> about 29Megabytes. But there is something wrong here.<br>
> <br>
> What I don't understand though is that if I look at the scripty commits in<br>
> the git log, nothing seems unusual.<br>
> <br>
> But Let us take the language of Low Saxon. I hope that offends the least<br>
> people here. There have been no new translations there in ~10 years.<br>
> <br>
> It's checked out size is 460KB.<br>
> <br>
> In master we have:<br>
> 715 translated messages, 709 fuzzy translations, 428 untranslated messages.<br>
> Going back to the first revision that added po files:<br>
> 763 translated messages, 645 fuzzy translations, 391 untranslated messages.<br>
> <br>
> Sizes are fairly equal with master of course a bit larger. Now this<br>
> language, unchanged in translation. Has alone added 10 Megabytes. That is<br>
> about half of the size of the complete history of the real source code for<br>
> Kleopatra.<br>
> <br>
> du -hs .<br>
> 209M<br>
> git filter-repo --path po/nds --invert-paths --force<br>
> du -hs .<br>
> 199M<br>
> <br>
> Now here is what I don't understand. If I look at the changes<br>
> git log -p po/nds/kleopatra.po | wc -c<br>
> 164774<br>
> That seems reasonable for all the automatic scripty updates and even with<br>
> all the context lines, that is just 1,6MB uncompressed.<br>
> <br>
> And this is where my git understanding runs into limits. To understand why<br>
> the history has gotten so large i tried some snippets from stackoverflow<br>
> and from there with:<br>
> git rev-list --objects --all po/nds/kleopatra.po| git cat-file --batch-<br>
> check='%(objecttype) %(objectname) %(objectsize) %(rest)' |<br>
> sed -n 's/^blob //p' |<br>
> sort --numeric-sort --key=2 |<br>
> cut -c 1-12,41-<br>
> <br>
> I think that I can roughly see that apparently each commit in the repo has a<br>
> blob associated with it that is the same size of the file.<br>
> <br>
> So can some git sleuth please investigte what is happening here? This kind<br>
> of repo growth is unstainable and at least for Kleopatra I see no possible<br>
> solution then to figure this out and then remove the po history from the<br>
> last year with a force push :-/<br>
> <br>
> Don't get me wrong I like that the po files are now also in the repo, and<br>
> that this will of course increase the repo size, but something fishy is<br>
> going on here in my opinion.<br>
<br>
As discussed on Matrix, it seems doing<br>
git gc ---aggresive<br>
brings down kleopatra size from 223M to 55M<br>
<br>
Is this something worth doing on the server side?<br></blockquote><div><br></div><div>I have investigated this, and it isn't as simple as just running "git gc --aggressive" on the server unfortunately.</div><div><br></div><div>The component responsible for looking after Git repositories in Gitlab (known as Gitaly) makes use of a couple of tricks to help ensure efficient use of resources and help things move quickly. One of these tricks involves sharing the underlying objects in a Git repository among a parent repository and it's forks - to save on the disk space (this mechanism is known as a pool repository in Gitaly). This is how forks of our repositories use very little in the way of space, even though they take significantly more to clone. It appears that in this case we would need to have this repack (which is what git gc --aggressive is doing) done in the pool repository, which I can't see a way to do.</div><div><br></div><div>I have now filed <a href="https://gitlab.com/gitlab-org/gitaly/-/issues/5508">https://gitlab.com/gitlab-org/gitaly/-/issues/5508</a> to discuss this with the Gitlab/Gitaly developers.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Anyone knows of any potential downsides of that?<br>
<br>
Cheers,<br>
Albert<br></blockquote><div><br></div><div>Thanks,</div><div>Ben</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> <br>
> <br>
> Best Regards,<br>
> Andre<br>
<br>
<br>
<br>
<br>
</blockquote></div></div>