Another proposal for modernization of our infrastructure

Wed Jan 28 11:27:06 GMT 2015

On Wednesday, 28 January 2015 10:08:54 CEST, Ben Cooksley wrote:
> 1) Most applications integrate extremely poorly with LDAP. They
> basically take the details once on first login and don't sync the
> details again after that (this is what both Chiliproject and
> Reviewboard do). How does Gerrit perform here?

Data are fetched from LDAP as needed. There's a local cache for speedup 
(with configurable TTL and support for explicit flushes).

> 2) For this trivial scratch repository script, will it offer it's own
> user interface or will developers be required to pass arguments to it
> through some other means? The code you've presented thus far makes it
> appear some other means will be required.

I might not fully understand this question -- I thought we already dicussed 
this. The simplest method of invocation can be as easy as `ssh 
user at somehost create-personal-project foobar`, with SSH keys verified by 
OpenSSH. This is the same UI as our current setup. There are other options, 
some of them with fancy, web-based UIs.

> 3) We've used cGit in the past, and found it suffered from performance
> problems with our level of scale. Note that just because a solution
> scales with the size of repositories does not necessarily mean it
> scales with the number of repositories, which is what bites cGit. In
> light of this, what do you propose?

An option which is suggested in the document is to use our current quickgit 
setup, i.e. GitPHP. If it works, there's no need to change it, IMHO, and 
sticking with it looks like a safe thing to me. But there are many 
additional choices (including gitiles).

> 4) Has Gerrit's replication been demonstrated to handle 2000 Git
> repositories which consume 30gb of disk space? How is metadata such as
> repository descriptions (for repository browsers) replicated?

Yes, Gerrit scales far beyond that. See e.g. the thread at 
https://groups.google.com/forum/#!topic/repo-discuss/5JHwzednYkc for real 
users' feedback about large-scale deployments.

> 5) If Gerrit or it's hosting system were to develop problems how would
> the replication system handle this? From what I see it seems highly
> automated and bugs or glitches within Gerrit could rapidly be
> inflicted upon the anongit nodes with no apparent safe guards (which
> our present custom system has). Bear in mind that failure scenarios
> can always occur in the most unexpected ways, and the integrity of our
> repositories is of paramount importance.

I agree that one needs proper, offline and off-site backups for critical 
data, and that any online Git replication is not a proper substitute for 
this. The plan for disaster recovery therefore is "restore from backup".

In terms of Gerrit, this means backing up all of the Git repositories and 
the dumping the PostgreSQL database, and storing these in a location which 
cannot be wiped out or modified by an attacker who has root on the main Git 
server, or by a software bug in our Git hosting. One cannot get that with 
just Git replication, of course.

What are the safeguard mechanisms that you mentioned? What threats do they 
mitigate? I'm asking because e.g. the need for frequent branch deletion is 
minimized by Gerrit's code review process which uses "branches" inside. 
What risks do you expect to see here?

> 6) Notifications: Does it support running various checks that our
> hooks do at the moment for license validity and the like? When these
> rules are tripped the author is emailed back on their own commits.

Yes, the proposed setup supports these. The best place for implementing 
them is via CI invocation through the ref-updated hook. My personal 
preference would be a ref-updated event handler in Zuul to ensure proper 
scalability, but there are other options.

> 7) Storing information such as tree metadata location within
> individual Git repositories is a recipe for delivering a system that
> will eventually fail to scale, and will abuse resources. Due to the
> time it takes to fork out to Git,

Gerrit uses JGit, a Java implementation of Git. There are no forks.

> plus the disk access necessary for
> it to retrieve the information in question, I suspect your generation
> script will take several load intensive minutes to complete even if it
> only covers mainline repositories. This is comparable to the
> performance of Chiliproject in terms of generation at the moment.

The yesterday-released Gerrit 2.10 adds a REST API for fetching arbitrary 
data from files stored in Git with aggressive caching. I would like to use 
that for generating that kde_projects.xml file.

> The original generation of our Git hooks invoked Git several times per
> commit, which meant the amount of time taken to process 1000 commits
> easily reached 10 minutes. I rebuilt them to invoke git only a handful
> of times per push - which is what we have now.

Gerrit has a different architecture with no forks and aggressive caching. 
I'm all for benchmarking, though. Do you want a test repository to run your 
benchmarks against?

> 8) Shifting information such as branch assignments in the same manner
> will necessitate that someone have access to a copy of the Git
> repository to determine the branch to use. This is something the CI
> system cannot ensure, as it needs to determine this information for
> dependencies, and a given node may not have a workspace for the
> repository in question. It also makes it difficult to update rules
> which are common among a set of repositories such as those for
> Frameworks and Plasma (Workspace). I've no idea if it would cause
> problems for kdesrc-build, but that is also a possibility.

The kde_projects.xml which stores a copy of these data will remain 
unchanged, and it should also remain to be the place consulted by e.g. the 
CI scripts, or the kdesrc-build. These tools will need no change.

What the proposal says is to base generating of that file on data in Git 
rather than on a custom webapp.

> 9) You've essentially said you are going to eliminate our existing
> hooks.

The proposal said that it might be possible to replace a large part of the 
functionality with Gerrit's native features with zero maintenance. If the 
remaining functionality (CRLF line endings and author human names for 
direct pushes) is important to warrant an ongoing maintenance of the custom 
hooks, they can be run without a problem.

> Does Gerrit support:
>     a) line ending checks, with exceptions for certain file types and
> repositories?

The proposal says to handle this by the CI setup. This means that it was 
proposed to enable pushing CRLF data to our repos, with a followup e-mail 
saying "hey, you're doing a bad thing". That's a trade off for not having 
to maintain these scripts.

Alternative options for this include:
- preserving this part of the hooks and running them from Gerrit,
- extending an existing Git validation plugin to do this.

>     b) Convenient deactivation of these checks if necessary.

Yes, this is configurable.

>     c) Checks on the author name to ensure it bears a resemblence to
> something actual.

No, the author's name is not checked at the moment. If we decide to change 
this, it's going to be a couple-line patch, or a custom hook.

However, I do not think that checking names in the way the hooks to it now 
is actually a good thing. Please read http://wookware.org/name.html for an 
example of a real person from the UK who cannot commit to KDE.

The potential for mistakes is largely mitigated by checks for e-mail 
validity. In order for this to be a problem, one would have to push a 
commit with a valid e-mail address, but wrong name ("jkt <jkt at kde.org>"). 
We should evaluate whether risking this is worth the reduced maintenance.

Also, this only affects direct pushes and KDE developers. Patch proposals 
from third parties can be easily and immediately downvoted by the CI, with 
a helpful message on what to fix.

>     d) Prohibiting anyone from pushing certain types of email address
> such as *@localhost.localdomain?

Yes:

Similar check applies to e-mail validation. An ACL verifies whether an 
e-mail matches one of user’s registered address. These addresses are either 
read from LDAP, or validated by a mail probe to make sure that they 
actually exist and belong to the user in question. This validation can be 
configured on an LDAP group basis, so it is possible to allow KDE 
developers to push commits on behalf of third-party contributors while 
preventing regular users from faking their identity.

> 10) You were aware of the DSL work Scarlett is doing and the fact this
> is Jenkins specific (as it generates Jenkins configuration). How can
> this work remain relevant?
> Additionally, Scarlett's work will introduce declarative configuration
> for jobs to KDE.

My understanding of Scarlett's work is that it aims at cleaning up our 
current configuration, making it work on Windows and OS X, and to introduce 
a declarative language for preparing job descriptions. AFAIK, the only part 
which might be Jenkins-specific is the last bit, and I fully expect a 
declarative generator being able to generate job descriptions for another 
system just by adding a proper output format. Moving to declarative 
approach is the big change here; adding another output is much less work.

> 11) We actually do use some of Jenkins advanced features, and it
> offers quite a lot more than just a visual view of the last failure.
> As a quick overview:
>     a) Tracked history for tests (you can determine if a single test
> is flaky and view a graph of it's pass/fail history).

Please see section 3.3.2 which discusses possible ways on how to deal with 
flaky tests. IMHO, the key feature and our ultimate goal is "let's handle 
flaky tests efficiently", not "let's have a graph of failing tests" (how 
would that work with a non-linear history of pre-merge CI?).

>     b) Log parsing to track the history of compiler warnings and other
> matters of significance (this is fully configurable based on regexes)

That's in section 3.3.3. One solution for using this is making the build 
warning-free on one well-known platform, and enforcing -Werror in there.

>     c) Integrated cppcheck and code coverage reports, actively used by
> some projects within KDE.

The Zuul-based CI setup launches KDE's existing build scripts and delivers 
their output. I choose to disable cppcheck for simplicity and because no 
projects that are currently in Gerrit are covered by Jenkins' cppcheck on 
build.kde.org at this time. There is no reason for not enabling cppcheck 
runs again, of course. When I looked at it last time, it however seemd that 
the include paths were not being passed properly and the data I got back 
were clearly bogous, so I decided to skip that for now. The same applies to 
coverage reports. Both will be provided, of course.

>     d) Intelligent dashboards which allow you to get an overview of a
> number of jobs easily.
>
> Bear in mind that these intelligent dashboards can be setup by anyone
> and are able to filter on a number of conditions. They can also
> provide RSS feeds and update automatically when a build completes.
>
> How would Zuul offer any of this? And how custom would this all have
> to be? Custom == maintenance cost.

The report explicitly acknowledges a need of future work for this status 
matrix, and proposes how to get there (section 3.3.4).

Regarding the maintenance costs, let's wait for when it is ready and 
evaluate the maintenance burden at that point.

> Addendum: the variations, etc. offered by the Zuul instance which
> already exists in the Gerrit clone are made possible by the hardware
> resources Jan has made available to that system. Jenkins is fully
> capable of offering such builds as well with the appropriate setup,
> some of which are already used - see the Multi Configuration jobs such
> as the ones used by Trojita and Plasma Framework.

I believe that it is not about HW resources, but about services' 
configuration. Does KDE's Jenkins as-is support building against a 
systemwide version of Qt, for example?

> You've lost me i'm afraid with the third party integration - please
> clarify what you're intending here.

I am pointing out that it is easy to plug a third-party testing system to 
Gerrit/Zuul pretty easily, mainly due to the open APIs and the system's 
architecture. If e.g. one of the FreeBSD guys wanted to help, they would 
have a way of getting involved without an explicit action from sysadmins. 
To me, that lowers the bar of entry a bit, and it also frees up some 
sysadmin time for more important tasks, so I think that it's a benefit of 
such a setup.

> 12) The tone of the way the event stream feature is mentioned makes it
> sound like sysadmin actively prevents people from receiving the
> information they need. We have never in the past prevented people from
> receiving notifications they've requested - you yourself have one that
> triggers builds on the OBS for Trojita.

It was never my intention to imply anything like that; sorry for this. That 
section says that it requires manual effort from sysadmins and custom code. 
In contrast to that, the proposed setup enables anyone to listen for events 
in a machine-readable way without any prior effort from sysadmins to enable 
that.

> 13) You've used the terminology "we" throughout your document. Who are
> the other author(s)?

I think this is similar to the previous report. I received feedback about 
this paper from several developers. Due to the rather heated nature of the 
previous rounds of the discussion and some personal attacks, they preferred 
to not be credited as authors. The actual wording is mine, I wrote the 
text, so I'm listed as the only author.

Anyway, I hope that we'll be able to judge the merits of the individual 
proposals, and that this won't deteriorate into a popularity contest.

Cheers,
Jan

-- 
Trojitá, a fast Qt IMAP e-mail client -- http://trojita.flaska.net/