Jenkins-kde-ci (many CI failures)

Tue May 10 07:39:22 UTC 2016

On Sunday 08 May 2016 23:06:02 Ben Cooksley wrote:
> On Sun, May 8, 2016 at 2:44 AM, David Faure <faure at kde.org> wrote:
> > kdewebkit just failed with "Broken pipe" (the TCP error you mentionned)
> > (and kxmlrpcclient failed again with an anongit error). This is like playing wack-a-mole...
> 
> Yeah :( Fortunately the Broken Pipe error is the least common one.
> 
> >
> > I thought TCP was more robust than that. Would it help to increase some
> > TCP-related timeout somewhere?
> 
> TCP should definitely be more reliable, I agree.
> I suspect the root cause of the Broken Pipe issue will be the same as
> the Temporary failure in name resolution error.
> 
> The /etc/hosts fix should be deployed shortly - the images are rebuilding now.

kmediaplayer job #63 failed with
ssh: Could not resolve hostname build.kde.org: Temporary failure in name resolution
at 12:56 yesterday (CI system time).

Is build.kde.org missing from /etc/hosts?

> The only thing I can think of at the moment are some kind of traffic
> storm on the network bridge which disrupts arp or something similar at
> that level when one or more containers start/stop in a short amount of
> time. This could very well be Docker itself determining which IP / MAC
> addresses it can use for the newly starting container - with
> connections being broken and data lost when it steps on one that is in
> use. I do seem to recall having the issue, albeit to a lesser extent
> with the KVM setup as well. We definitely didn't have it with the LXC
> containers though, but those all had public IP addresses of some form
> or another (one was Public IPv6 only, with NAT IPv4)
> 
> The current setup (using one machine as an example, they're all
> identical except for the IP ranges used):
> 
> - Normal Linux bridges, setup using Debian's /etc/network/interfaces
> and bridge utilities.
> - Host takes 10.150.85.1/25 (br0) and 10.150.81.129/25 (br1)
> 
> - Docker containers are allocated the rest of the 10.150.85.1/25 IP
> block, and are connected to the corresponding bridge (br0)
> - Windows virtual machines are allocated static IP addresses in the
> 10.150.85.129/25 block, on the corresponding bridge (br1)
> 
> - VPN connection is established using OpenVPN, with the OpenVPN server
> routing 10.150.85.0/24 to the VPN client. Only traffic within the
> 10.150.85.0/16 subnet will be sent over the VPN. This is done to
> permit secure communication with the Docker management daemons, and to
> permit easy+secure access to the Windows VMs.
> 
> - Public network access is handled on the host (not the VPN server) using NAT.

I'm afraid I'm not enough of a network sysadmin to be able to find out what
might be wrong in this setup, if anything.

> I've bumped the limits on each anongit node so hopefully that will solve it.
> The limit was a bit on the conservative side anyway.
> 
> If Jenkins is making that number of Git connections at one moment....
> i'd be quite surprised.

I think I saw more anongit errors yesterday, but I didn't write them down. Let's see.

> Indeed. The main reason for performing many builds at once is to
> ensure the small projects don't get blocked up when big items (like
> Qt, PIM and Calligra) do a build.
> They've all been known to tie up a builder for more than an hour per
> build and has led to a large pile of other builds blocking up behind
> them (which i've received complaints about as well)

I know, but this was a smaller problem than false positives IMHO :-)

-- 
David Faure, faure at kde.org, http://www.davidfaure.fr
Working on KDE Frameworks 5