clarification on git, central repositories and commit access lists

Mon Aug 20 19:41:05 BST 2007

On Sun, 19 Aug 2007, Adam Treat wrote:
> 
> I just watched your talk on git and wanted to ask for clarification on a 
> few points.  Many of us in the KDE community are interested in git and 
> some even contemplate using git as the official SCM tool in the future.  

As you are probably aware, some people have tried to import the whole KDE 
history into git. Quite frankly, the way git works (tracking whole trees 
at a time, never single files), that ends up being very painful, because 
it's an "all or nothing" approach. 

So I'm hoping that if you guys are seriously considering git, you'd also 
split up the KDE repository so that it's not one single huge one, but with 
multiple smaller repositories (ie kdelibs might be one, and each major app 
would be its own), and then using the git "submodule" support to tie it 
all together.

> However, I think a few issues have been confused and want to see if you 
> can clarify.

Sure.

> Your talk focused heavily on the evils of a central repository versus 
> the benefits of a distributed model.  However, I wonder if what you 
> actually find distasteful is not a central repository per se, but rather 
> designing an SCM that relies upon *communication* with a central 
> repository to do branching/merging or offline development.

I certainly agree that almost any project will want a "central" repository 
in the sense that you want to have one canonical default source base that 
people think of as the "primary" source base. 

But that should not be a *technical* distinction, it should be a *social* 
one, if you see what I mean. The reason? Quite often, certain groups would 
know that there is a primary archive, but for various reasons would want 
to ignore that knowledge: the reasons can be any of

 - Release management: you often want the central "development" repository 
   to be totally separate from the release management tree. Yes, you 
   approximate that with branches, but let's face it, the people involved 
   usually have a lot of overlap, but the overlap is not total, and the 
   *interest* isn't necessarily the same.

   For an example of "release management", think of multiple different 
   vendors. They would probably always start with your "central" release 
   tree (which in turn may well be different from your central development 
   tree!), but vendors invariably have their own timetables and customer 
   issues, so they usually need to make decisions that may not even make 
   sense for the "official" tree.

   Examples of this in the kernel is how my tree is the central 
   development tree, then we have the "stable" tree (which is a *separate* 
   thing, maintained totally separately, but obviously based on my 
   releases), and then each vendor tends to have their own "release 
   trees". They are all different, they all have different policies and 
   reasons for existence, and they are *all* "central" depending on who 
   looks at them.

 - Branching. Yes, you can branch in a truly centralized model too, but 
   it's generally a "big issue" - the branches are globally visible 
   things, and you need permission from the maintainers of the centralized 
   model too.

   Both of those are *horrible* mistakes: the "globally visible" part 
   means that if you're not sure this makes sense, you're much less likely 
   to begin a branch - even if it's cheap, it's still something that 
   everybody else will see, and as such you can't really do "throwaway" 
   development that way. And let's face it, many cool ideas turn out to be 
   totally idiotic, but it might take a long time until it's obvious that 
   it was a bad idea.

   So you absolutely need *private* branches, that can becom "central" for 
   the people involved in some re-architecting, even if they never ever 
   show up in the "truly central" repository. That's a huge deal for 
   development.

   The other problem is the "permission from maintainers" thing: I have an 
   ego the size of a small planet, but I'm not _always_ right, and in that 
   kind of situation it would be a total disaster if everybody had to ask 
   for my permission to create a branch to do some re-architecting work. 

   The fact that anybody can create a branch without me having to know 
   about it or care about it is a big issue to me: I think it keeps me 
   honest. Basically, the fundamental tool we use for the kernel makes 
   sure that if I'm not doing a good job, anybody else can show people 
   that they do a better job, and nobody is really "inconvenienced". 

   Compare that to some centralized model, and something like the gcc/egcs 
   fork: the centralized model made the fork so painful that it became a 
   huge political fight, instead of just becoming an issue of "we can do 
   this better"!

There are other reasons for having a *social* network that tends to have 
one or two fairly central nodes, but not having a *technical* limitation 
that enforces that. But the above are the two biggest and most important 
reasons, I think-

> After all, your repository acts as a de-facto central repository of the 
> linux kernel in as much as everyone pulls from it.  Without such a 
> central place to pull the linux kernel would not exist, rather what 
> you'd have is a bunch of forks which perhaps merge with each other from 
> time to time.

Well, I do want to make it clear that we *do* have such forks that pull 
from each other too. So the kernel actually does use the technology, it's 
just that you have to be involved in the particular subprojects to even 
know or care about it!

So it's not strictly true that there is a single "central" one, even if 
you ignore the stable tree (or the vendor trees). There are subsystems 
that end up working with each other even before they hit the central tree 
- but you are right that most people don't even see it. Again, it's the 
difference between a technical limitation, and a social rule: people use 
multiple trees for development, but because it's easier for everybody to 
have one default tree, that's obviously what most people who aren't 
actively developing do.

To put this in a KDE perspective: it would make tons and tons of sense to 
have one central place (kde.org) that most developers know about, and 
where they would fetch their sources from. But for various reasons (and 
security is one of them), that may not be the main place where most "core 
developers" really work. You would generally want to have separate places 
that are secure, and those separate places may be *different* for 
different developer groups.

For a kernel example: the "public" git tree is on the public kernel.org 
servers (including "git.kernel.org"), but that is actually not a machine 
that any developers really ever push to directly.

Many kernel developers use other kernel.org machines (because we have the 
infrastructure), but others will use their own setups entirely, because 
they might have issues like bandwidth (ie kernel.org may be reasonably 
well connected, but while it has mirrors elsewhere, the main machines are 
in the US, so some European developers prefer to just use servers that are 
closer). 

So if you look at my merge messages, for example, you'll see things like 
merges from lm-sensors.org, git.kernel.dk, ftp.linux-mips.org, oss.sgi.com
etc etc. The point being that yes, there is a central place that people 
know about, but at the same time, much of the *development* really happens 
outside that central place!

> For any software project to exist as opposed to a bunch of forks I think 
> you *have to have* a central repository from which everyone pulls, no?  
> Of course many branches might exist, but those branches must pull from a 
> central repository if they want to share *at least some* common code.

Practically speaking, you'd generally have one or a few central 
repositories, yes. But no, it really doesn't have to be a single one. And 
I'm not just talking about mirroring (which is really easy with a 
distributed setup), I'm literally talking about things like some people 
wanting to use the "stable" tree, and not my tree at all, or the vendor 
trees.

And they are obviously *connected*, but it doesn't have to be a totally 
central notion at all.

Think of the git trees as people: some people are more "central" than 
others, but in the end, the kernel is actually fairly unusual (at least 
for a big project) in having just *one* person that is so much in the 
"center" that everybody knows about him.

In most other projects, you literally would have different groups that 
handle different parts. In the KDE group, for example, there really is no 
reason why the people who work on one particular application should ever 
use the same "central" repository as the people who work on another app 
do.

You'd have a *separate* group (that probably also maintains some central 
part like the kdelibs stuff) that might be in charge of *integrating* it 
all, and that integration/core group might be seen to outsiders as the 
"one central repository", but to the actual application developers, that 
may actually be pretty secondary, and as with the kernel, they may 
maintain their own trees at places like ftp.linux-mips.org - and then just 
ask the core people to pull from them when they are reasonably ready.

See? There's really no more "one central place" any more. To the casual 
observer, it *looks* like one central place (since casual users would 
always go for the core/integration tree), but the developers themselves 
would know better. If you wanted to develop some bleeding edge koffice 
stuff, you'd use *that* tree - and it might not have been merged into the 
core tree yet, because it might be really buggy at the moment!

This is one of the big advantages of true distribution: you can have that 
kind of "central" tree that does integration, but it doesn't actually have 
to integrate the development "as it happens". In fact, it really really 
shouldn't. If you look at my merges, for example, when I merge big changes 
from somebody else who actually maintains them in a git tree, they will 
have often been done much earlier, and be a series of changes, and I only 
merge when they are "ready".

So the core/central people should generally not necessarily even do any 
real development at all: the tree that people see as the "one tree" is 
really mostly just an integration thing. When the koffice/kdelibs/whatever 
people decide that they are ready and stable, they can tell the 
integration group to pull their changes. There's obviously going to be 
overlap between developers/integrators (hopefully a *lot* of overlap), but 
it doesn't have to be that way (for example, I personally do almost *only* 
integration, and very little serious development).

> A central repository is also necessary for projects like KDE to enable 
> things like buildbots and commit mailing lists.

I disagree.

Yes, you want a central build-bot and commit mailing list. But you don't 
necessarily want just *one* central build-bot and commit mailing list. 

There's absolutely no reason why everybody would be interested in some 
random part of the tree (say, kwin), and there's no reason why the people 
who really only do kwin stuff should have to listen to everybody elses 
work. They may well want to have their *own* build-bot and commit mailing 
list!

So making one central one is certainly not a mistake, but making *only* a 
central one is. Why shouldn't the groups that do specialized work have 
specialized test-farms? The kernel does. The NFS stuff, for example, tends 
to have its own test infrastructure. 

Also, it's a mistake to think that one site has to do everything. That's 
not what we do in the kernel, for example. Yes, we have kernel.org, and 
it's reasonably central, but that doesn't mean that everything has to, or 
even should, happen within that organization.

So we've had people do build-bots and performance regressions, and 
specialized testing *outside* of kernel.org. For example, intel and others 
have done things like performance regression testing that required 
specialized hardware and software (eg TPC-C performance numbers).

So we do commit mailing lists from kernel.org, but (a) that doesn't mean 
that everything else should be done from that central site and (b) it also 
doesn't mean that subprojects shouldn't do their *own* commit mailing 
lists. In fact, there's a "gitstat" project (which tracks the kernel, but 
it's designed to be available for *any* git project), and you can see an 
example of it in action at

	http://tree.celinuxforum.org/gitstat

(or get the source code from sourceforge), and the point is that all of 
this was done entirely *outside* the kernel.org framework.

So centralized is not at all always good. Quite the reverse: having 
distributed services allows *specialized* services, and it also allows the 
above kind of experimental stuff that does some (fairly simple, but maybe 
it will expand) data-mining on the project!

> These tools are important to the way we work and provide for many eyes 
> constantly reviewing changes to the codebase as well as regular 
> regression testing across diverse platforms.  In the future, whether git 
> or svn, I see no advantages in getting rid of a central repository from 
> which everyone pulls.  I wonder whether you really disagree.

So I do disagree, but only in the sense that there's a big difference 
between "a central place that people can go to" and "ONLY ONE central 
place".

See? Distribution doesn't mean that you cannot have central places - but 
it means that you can have *different* central places for different 
things. You'd generally have one central place for "default" things 
(kde.org), but other central places for more specific or specialized 
services!

And whether it's specialized by project, or by things like the above 
"special statistics" kind of thing, or by usage, is another matter! For 
example, maybe you have kde.org as the "default central place", but then 
some subgroup that specializes in mobility and small-memory-footprint 
issues might use something like kde.mobile.org as _their_ central site, 
and then developers would occasionally merge stuff (hopefully both ways!)

> In your talk you also focus on the evils of commit access lists, 
> comparing and contrasting with the web of trust the kernel uses where 
> you have no commit access lists at all.  However, isn't the kernel model 
> just a special case?  The linux kernel has a de-facto commit access list 
> of one: you.

No, really. It doesn't. It's the one you see from the outside, but the 
fact is, different sub-parts of the kernel really do use their own trees, 
and their own mailing lists. You, as a KDE developer, would generally 
never care about it, so you only _see_ the main one.

> This might work well for the kernel, but I fail to see how this really 
> reduces politics.  Many are still constantly pushing and arguing to 
> merge their branches upstream into your repository.  Would having a 
> central repository where you and all your trusted lieutenants push their 
> changes really be very different?

Yes it would be. You only see the end result now. You don't see how those 
lieutenants have their own development trees, and while the kernel is 
fairly modular (so the different development trees seldom have to interact 
with each others), they *do* interact. We've had the SCSI development tree 
interact with the "block layer" development tree, and all you ever see is 
the end result in my tree, but the fact is, the development happened 
entirely *outside* my tree.

The networking parts, for example, merge the crypto changes, and I then 
merge the end result of the crypto _and_ network changes.

Or take the powerpc people: they actually merge their basic architecture 
stuff to me, but their network driver stuff goes through Jeff Garzik - and 
you as a user never even realize that there was another "central" tree for 
network driver development, because you would never use it unless you had 
reported a bug to Jeff, and Jeff might have sent you a patch for it, or 
alternatively he might have asked if you were a git user, and if so, 
please pull from his 'e1000e' branch.

For an example of this, go to 

	http://git.kernel.org/

and look at all the projects there. There are lots of kernel subprojects 
that are used by developers - exactly so that if you report a bug against 
a particular driver or subsystem, the developer can tell you to test an 
experimental branch that may fix it.

> The KDE community has a very large commit access list and it is quite 
> easy to join.  Having a central git repository with a large set of 
> committers would seem to map well with our community.  I fail to see any 
> harm in this model.  The web of trust would still exist, it would just 
> be much larger and more inclusive than the model the kernel uses.  I 
> wonder if you disagree.

Hey, you can use your old model if you want to. git doesn't *force* you to 
change. But trust me, once you start noticing how different groups can 
have their own experimental branches, and can ask people to test stuff 
that isn't ready for mainline yet, you'll see what the big deal is all 
about.

Centralized _works_. It's just *inferior*.

> Another sticking point is the performance implications of a git 
> repository managing something the size of the KDE project.  I understand 
> the straightforward solution: just define content boundaries with a 
> separate git repo for each submodule: kdelibs.git, kdebase.git, 
> kdesupport.git, etc, etc.  And then have a super git repo with hooks 
> that point to these submodules.  However, I think this leads to a few 
> problems.
>
> What if I want to make a commit to kdelibs that will require changes in 
> other modules for them to compile.  I will no longer be able to make a 
> single atomic commit with changes to multiple submodules, right?

Sure you will. It's hierarchical, though.

What happens is that you do a single commit in each submodule that is 
atomic to that *private* copy of that submodule (and nobody will ever see 
it on its own, since you'd not push it out), and then in the supermodule 
you make *another* commit that updates the supermodule to all the changes 
in each submodule.

See? It's totally atomic. Anybody that updates from the supermodule will 
get one supermodule commit, when when that in turn fetches all the 
submodule changes, you never have any inconsistent state.

> Also, won't we lose history when moving files/content between 
> submodules?

Yes. If you move stuff between repositories, you do lose history (or 
rather, it breaks it as far as git is concerned - you still obviously have 
both *pieces* of history, but to see it, you'd have to manually go and 
look).

The point of submodules is that they are totally independent entities in 
their own right, so that you can develop on a submodule without having to 
even know about or care about the supermodule.

Git actually does perform fairly well even for huge repositories (I fixed 
a few nasty problems with 100,000+ file repos just a week ago), so if you 
absolutely *have* to, you can consider the KDE repos to be just one single 
git repository, but that unquestionably will perform worse for some things 
(notably, "git annotate/blame" and friends).

But what's probably worse, a single large repository will force everybody 
to always download the whole thing. That does not necessarily mean the 
whole *history* - git does support the notion of "shallow clones" that 
just download part of the history - but since git at a very fundamental 
level tracks the whole tree, it forces you to download the whole "width" 
of the tree, and you cannot say "I want just the kdelibs part".

> And how will we break up the existing history between all of these 
> submodules?

There's a few options for that. 

One is to just import the SVN history per directory in the first place, 
but that makes it hard to then tie the history together in the 
supermodule.

The better approach is probably to import the *whole* thing (which will 
require a rather beefy machine), and then split it up from within git. 
There are various tools on the git side to basically rewrite the history 
in other formats, including splitting up a bigger repository (google for 
"git-split", for example). 

But I certainly won't lie to you: importing all the history of KDE is 
going to be a fairly big project, and it will require people who have good 
git knowledge to set it up. I suspect (judging by some noises I've seen on 
the git mailing list and irc channel) that you have those kinds of people 
already, but it may well be a good idea to _avoid_ doing it as one big 
"everything at once" kind of event.

So seriously, I would suggest that if there is currently some smaller part 
of the KDE SVN tree, and the people who work on that part are already more 
familiar with git than most KDE people necessarily are, I suspect that the 
best thing to do is to convert just that piece first, and have people 
migrate in pieces. Because any SCM move is going to be a learning process 
(the CVS->SVN one is much easier than most, since they really are largely 
just different faces of the same coin - no real changes in how things 
fundamentally work as far as the user experience is concerned).

> Finally, a couple points...  CVS/SVN might be stupid and moronic, but I 
> think it is good to note they are not nearly as bad as some other SCM's.  
> Many SCM's used by some of the largest codebases in the world are still 
> lock-based.  If you think it is difficult to branch/merge using a 
> central server, remember that some poor folks can't even *change a 
> single file* without asking the central server for permission.

Sure. Crap exists. That doesn't make CVS/SVN _good_. It just means that 
there are even worse things out there.

> It is also good to note that a free distributed SCM was not available 
> until recently.  The kernel community might have had a special deal with 
> BitKeeper, but the same didn't apply to all open source projects AFAIK.  
> When KDE moved to svn it was the best tool for the job.  That might have 
> changed when git became easier to use, but at the time it was simply too 
> big of a barrier for new developers and too new.  And from what I 
> understand git support on other platforms is a recent development.

Git works pretty well on any random unix (although most users are on 
Linux, with a reasonable minority on OS X - everything else tends to be 
pretty spotty, and can at times require that you add compiler options 
etc). 

The native windows support is pretty recent, and still in flux. It's now 
apparently quite usable, although I don't think there's any real 
integration with any native Windows development environments (ie it's all 
either command line or the "native" git visualization tools like git-gui 
or gitk).

			Linus