Jenkins-kde-ci (many CI failures)

Ben Cooksley bcooksley at kde.org
Sun May 8 11:06:02 UTC 2016


On Sun, May 8, 2016 at 2:44 AM, David Faure <faure at kde.org> wrote:
> kdewebkit just failed with "Broken pipe" (the TCP error you mentionned)
> (and kxmlrpcclient failed again with an anongit error). This is like playing wack-a-mole...

Yeah :( Fortunately the Broken Pipe error is the least common one.

>
> I thought TCP was more robust than that. Would it help to increase some
> TCP-related timeout somewhere?

TCP should definitely be more reliable, I agree.
I suspect the root cause of the Broken Pipe issue will be the same as
the Temporary failure in name resolution error.

The /etc/hosts fix should be deployed shortly - the images are rebuilding now.

The only thing I can think of at the moment are some kind of traffic
storm on the network bridge which disrupts arp or something similar at
that level when one or more containers start/stop in a short amount of
time. This could very well be Docker itself determining which IP / MAC
addresses it can use for the newly starting container - with
connections being broken and data lost when it steps on one that is in
use. I do seem to recall having the issue, albeit to a lesser extent
with the KVM setup as well. We definitely didn't have it with the LXC
containers though, but those all had public IP addresses of some form
or another (one was Public IPv6 only, with NAT IPv4)

The current setup (using one machine as an example, they're all
identical except for the IP ranges used):

- Normal Linux bridges, setup using Debian's /etc/network/interfaces
and bridge utilities.
- Host takes 10.150.85.1/25 (br0) and 10.150.81.129/25 (br1)

- Docker containers are allocated the rest of the 10.150.85.1/25 IP
block, and are connected to the corresponding bridge (br0)
- Windows virtual machines are allocated static IP addresses in the
10.150.85.129/25 block, on the corresponding bridge (br1)

- VPN connection is established using OpenVPN, with the OpenVPN server
routing 10.150.85.0/24 to the VPN client. Only traffic within the
10.150.85.0/16 subnet will be sent over the VPN. This is done to
permit secure communication with the Docker management daemons, and to
permit easy+secure access to the Windows VMs.

- Public network access is handled on the host (not the VPN server) using NAT.

>
> On Saturday 07 May 2016 23:51:43 Ben Cooksley wrote:
>>
>> We're exposing some flaws that were previously hidden by the fact we
>> only did a maximum of 3 builds at a time and for a while there, less
>> than that.
>> Now we do quite a few more (up to 9 at a time I think)
>
> Maybe we should lower this again then.
>
> anongit could be taught to whitelist CI nodes to not ever treat their requests as a
> DoS attack, too.

I've bumped the limits on each anongit node so hopefully that will solve it.
The limit was a bit on the conservative side anyway.

If Jenkins is making that number of Git connections at one moment....
i'd be quite surprised.

>
> I think we also build stuff more often because nowadays when committing to
> kcoreaddons, all frameworks depending on kcoreaddons get rebuilt, IIRC this
> didn't use to happen. It's a good feature, but not if it creates too many false failures.

Agreed.

>
> A CI that has spurious failures 10 times a day only teaches people to "ignore CI noise" :(
> I'd rather see it slow and reliable, than fast and unreliable.

Indeed. The main reason for performing many builds at once is to
ensure the small projects don't get blocked up when big items (like
Qt, PIM and Calligra) do a build.
They've all been known to tie up a builder for more than an hour per
build and has led to a large pile of other builds blocking up behind
them (which i've received complaints about as well)

>
> --
> David Faure, faure at kde.org, http://www.davidfaure.fr
> Working on KDE Frameworks 5
>

Cheers,
Ben


More information about the Kde-frameworks-devel mailing list