comments on KDE performance tips
Andrew Morton
kde-optimize@mail.kde.org
Mon, 13 Jan 2003 21:26:33 -0800
On Mon January 13 2003 05:40, Alexander Kellett wrote:
>
> On Mon, Jan 13, 2003 at 03:00:02PM +0100, Mathieu Kooiman wrote:
> > Hi,
> >=20
> > > - use special kernel. What are the desktop kernel option ? Pre-empt=
ive and
> > > I don't remember the other. Gentoo has it.
> >=20
> > The preemptive kernel patch by Robert Love and the Low-Latency by Ala=
n Cox (if=20
> > my memory serves me right). I highly doubt this would attribute to be=
tter=20
> > 'performance' since these patches will only make the kernel (seem) mo=
re=20
> > responsive under load.
>=20
> well, performance on a desktop system _is_ responsiveness. (disclaimer:=
imo)
> fyi it was andrew morton who did the low-latency patch :)
Ingo did the original one. I did another one (basically the same thing, =
more
complete) for 2.4.
> more important to me was the i/o disk scheduler stuff
> that went into 2.5,
hm, I seem to be largely to blame for that. 2.5 kernels are generally to=
ns
more responsive when under disk I/O load.
> the dma cdrom patch that gentoo uses
Man, I should start charging for this stuff.
> and possibly the o(1) scheduler, thought that
> in some cases made things slower unfortunatly.
It does. Generally the O(1) scheduler improves responsiveness by a large
amount when there is CPU-intensive work going on (eg: compiling things). =
But
occasionally it just gets it completely wrong and goes horrid.
The CPU scheduler in 2.5.current is tons better than the one in RH8.0. A=
nd
Andrea Arcangeli's reworked O(1) scheduler is by far the best. In the -a=
a
kernels.
> > I've done some stresstesting (although not with KDE) and in 'raw perf=
ormance'=20
> > it was even a tad SLOWER than the stock 2.4.20 kernel.
>=20
> yet in other cases (possible anal) pre-empt makes it faster.
> i know for certain that preempt improved my (as a person) performance :=
)
Nope. I dispute that the low-latency patch or the preemptible kernel mak=
e a
perceptible difference. Because it is *extremely* rare for the kernel to
hold off a context switch for longer than ten milliseconds. And that is =
less
than a monitor refresh interval. So I don't think it makes any differenc=
e at
all.
One area which could benefit from kernel help is application startup. He=
re
is a `vmstat 1' trace during startup of the OpenOffice word processor:
0 0 81316 58952 64724 447660 0 0 0 0 1235 1426 2 1 =
97 0
0 0 81316 58952 64724 447660 0 0 0 0 1001 430 1 0 =
100 0
0 1 81316 56944 64836 449508 0 0 1960 0 1089 802 10 6 =
70 15
0 1 81316 52592 64848 453444 0 0 3940 188 1195 513 7 1 =
50 42
1 1 81316 47952 64912 457232 0 0 3852 0 1178 724 9 2 =
57 33
1 0 81316 45200 64924 459784 8 0 2572 0 1158 1374 21 1 =
51 27
1 0 81316 43984 65068 460024 0 0 384 0 1060 431 27 4 =
63 8
1 0 81316 42640 65180 460504 0 0 580 388 1057 350 36 1 =
56 8
2 0 81316 40408 65180 462484 0 0 1980 0 1031 308 49 2 =
49 1
1 0 81316 38120 65184 464788 0 0 2308 0 1039 318 50 1 =
48 0
1 0 81316 35688 65184 467220 0 0 2432 0 1038 302 50 1 =
48 0
1 0 81316 33400 65184 469524 0 0 2304 0 1040 382 51 1 =
48 0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cp=
u----
r b swpd free buff cache si so bi bo in cs us sy =
id wa
2 0 81316 31112 65192 471828 0 0 2304 48 1040 308 50 1 =
49 0
1 0 81316 30408 65192 472500 0 0 672 0 1045 352 51 1 =
49 0
1 0 81316 30272 65192 472572 0 0 72 0 1375 1730 53 5 =
41 2
0 1 81316 26560 65276 476128 0 0 3632 0 1369 967 7 3 =
58 34
0 1 81316 22992 65280 479592 0 0 3468 0 1187 524 1 2 =
50 49
0 1 81316 20496 65280 482148 0 0 2556 0 1214 475 1 1 =
49 50
0 1 81316 17888 65288 484240 0 0 2092 132 1082 376 41 0 =
49 9
0 1 81316 15392 65288 486476 0 0 2236 0 1066 350 39 1 =
49 12
0 1 81316 10872 65296 490884 0 0 4416 0 1220 612 3 1 =
50 46
0 1 81316 7188 65328 494112 0 0 3260 0 1184 638 9 2 =
54 36
0 1 81316 9532 65088 491332 0 0 2720 0 1127 470 17 1 =
53 29
1 0 81316 5828 65128 494584 0 0 3284 112 1189 671 14 2 =
54 31
0 1 81316 7196 64332 493652 0 0 3736 0 1178 563 9 1 =
51 38
0 1 81316 4840 64332 495804 0 0 3308 0 1192 473 3 1 =
50 47
0 1 81316 5224 64356 495004 0 0 3108 0 1161 670 11 3 =
53 33
0 0 81316 5288 64368 494752 0 0 2164 0 1125 1895 17 4 =
56 24
0 0 81316 5224 64376 494796 0 0 44 176 1201 998 1 1 =
97 0
0 0 81316 5224 64376 494796 0 0 0 0 1033 610 0 0 =
99 0
That's twenty six seconds of disk reading. It's sustaining maybe 2.5
megabytes per second off a disk which can do 25 megabytes/sec.
Starting the application again, now that everything is in kernel pagecach=
e
takes about six seconds. That's pure compute and is pretty gross.
So there's almost twenty seconds worth of startup time here which could b=
e
shaved off by improving the IO access patterns. This would require runti=
me
analysis of the pagefault pattern, and re-layout of elf files based on th=
at
analysis. Basically we want to pull each executable and shared library i=
nto
pagecache in one great big slurp.
Kernel help would be needed for the instrumentation/analysis. The reorg =
of
the elf files would be a userspace issue. One possible solution would be=
,
once the elf files are reorganised, to change the libc dynamic loader so =
that
it starts asynchronous reads against all the libraries which earlier
instrumentation indicated will be needed.
I assume these applications are also reading zillions of little config, f=
ont,
icon, etc files as well. That'll hurt. One possible way of speeding tha=
t up
would be for the application to maintain its own cache (on-disk) of all t=
he
info which it needed to start itself up. So on the second startup it can=
all
be read in in a single swipe. Obviously, information coherency is an iss=
ue
here.
It's all a fair bit of work, but *this* is where the gains are to be made=
=2E=20
In the case of this app, the twenty six seconds can be potentially reduce=
d to
eight seconds by getting the IO scheduling right.
Incidentally: be aware that the linker lays files out in strange manners =
- it
seeks all over the file placing bits and pieces everywhere, so there is n=
o
correspondence between offset-into-file and offset-into-disk. But if you
then take the output of the linker and copy it somewhere else (ie: `make
install') then the copied file _will_ have good linear layout. This shou=
ld
be borne in mind when studying application startup times: don't try to
measure the startup time for an executable or library which was laid out =
on
disk by /usr/bin/ld.