comments on KDE performance tips

Andrew Morton kde-optimize@mail.kde.org
Mon, 13 Jan 2003 21:26:33 -0800


On Mon January 13 2003 05:40, Alexander Kellett wrote:
>
> On Mon, Jan 13, 2003 at 03:00:02PM +0100, Mathieu Kooiman wrote:
> > Hi,
> >=20
> > > - use special kernel. What are the desktop kernel option ? Pre-empt=
ive and
> > > I don't remember the other. Gentoo has it.
> >=20
> > The preemptive kernel patch by Robert Love and the Low-Latency by Ala=
n Cox (if=20
> > my memory serves me right). I highly doubt this would attribute to be=
tter=20
> > 'performance' since these patches will only make the kernel (seem) mo=
re=20
> > responsive under load.
>=20
> well, performance on a desktop system _is_ responsiveness. (disclaimer:=
 imo)
> fyi it was andrew morton who did the low-latency patch :)

Ingo did the original one.  I did another one (basically the same thing, =
more
complete) for 2.4.

> more important to me was the i/o disk scheduler stuff
> that went into 2.5,

hm, I seem to be largely to blame for that.  2.5 kernels are generally to=
ns
more responsive when under disk I/O load.

> the dma cdrom patch that gentoo uses

Man, I should start charging for this stuff.

> and possibly the o(1) scheduler, thought that
> in some cases made things slower unfortunatly.

It does.  Generally the O(1) scheduler improves responsiveness by a large
amount when there is CPU-intensive work going on (eg: compiling things). =
 But
occasionally it just gets it completely wrong and goes horrid.

The CPU scheduler in 2.5.current is tons better than the one in RH8.0.  A=
nd
Andrea Arcangeli's reworked O(1) scheduler is by far the best.  In the -a=
a
kernels.

> > I've done some stresstesting (although not with KDE) and in 'raw perf=
ormance'=20
> > it was even a tad SLOWER than the stock 2.4.20 kernel.
>=20
> yet in other cases (possible anal) pre-empt makes it faster.
> i know for certain that preempt improved my (as a person) performance :=
)

Nope.  I dispute that the low-latency patch or the preemptible kernel mak=
e a
perceptible difference.  Because it is *extremely* rare for the kernel to
hold off a context switch for longer than ten milliseconds.  And that is =
less
than a monitor refresh interval.  So I don't think it makes any differenc=
e at
all.

One area which could benefit from kernel help is application startup.  He=
re
is a `vmstat 1' trace during startup of the OpenOffice word processor:


 0  0  81316  58952  64724 447660    0    0     0     0 1235  1426  2  1 =
97  0
 0  0  81316  58952  64724 447660    0    0     0     0 1001   430  1  0 =
100  0
 0  1  81316  56944  64836 449508    0    0  1960     0 1089   802 10  6 =
70 15
 0  1  81316  52592  64848 453444    0    0  3940   188 1195   513  7  1 =
50 42
 1  1  81316  47952  64912 457232    0    0  3852     0 1178   724  9  2 =
57 33
 1  0  81316  45200  64924 459784    8    0  2572     0 1158  1374 21  1 =
51 27
 1  0  81316  43984  65068 460024    0    0   384     0 1060   431 27  4 =
63  8
 1  0  81316  42640  65180 460504    0    0   580   388 1057   350 36  1 =
56  8
 2  0  81316  40408  65180 462484    0    0  1980     0 1031   308 49  2 =
49  1
 1  0  81316  38120  65184 464788    0    0  2308     0 1039   318 50  1 =
48  0
 1  0  81316  35688  65184 467220    0    0  2432     0 1038   302 50  1 =
48  0
 1  0  81316  33400  65184 469524    0    0  2304     0 1040   382 51  1 =
48  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cp=
u----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy =
id wa
 2  0  81316  31112  65192 471828    0    0  2304    48 1040   308 50  1 =
49  0
 1  0  81316  30408  65192 472500    0    0   672     0 1045   352 51  1 =
49  0
 1  0  81316  30272  65192 472572    0    0    72     0 1375  1730 53  5 =
41  2
 0  1  81316  26560  65276 476128    0    0  3632     0 1369   967  7  3 =
58 34
 0  1  81316  22992  65280 479592    0    0  3468     0 1187   524  1  2 =
50 49
 0  1  81316  20496  65280 482148    0    0  2556     0 1214   475  1  1 =
49 50
 0  1  81316  17888  65288 484240    0    0  2092   132 1082   376 41  0 =
49  9
 0  1  81316  15392  65288 486476    0    0  2236     0 1066   350 39  1 =
49 12
 0  1  81316  10872  65296 490884    0    0  4416     0 1220   612  3  1 =
50 46
 0  1  81316   7188  65328 494112    0    0  3260     0 1184   638  9  2 =
54 36
 0  1  81316   9532  65088 491332    0    0  2720     0 1127   470 17  1 =
53 29
 1  0  81316   5828  65128 494584    0    0  3284   112 1189   671 14  2 =
54 31
 0  1  81316   7196  64332 493652    0    0  3736     0 1178   563  9  1 =
51 38
 0  1  81316   4840  64332 495804    0    0  3308     0 1192   473  3  1 =
50 47
 0  1  81316   5224  64356 495004    0    0  3108     0 1161   670 11  3 =
53 33
 0  0  81316   5288  64368 494752    0    0  2164     0 1125  1895 17  4 =
56 24
 0  0  81316   5224  64376 494796    0    0    44   176 1201   998  1  1 =
97  0
 0  0  81316   5224  64376 494796    0    0     0     0 1033   610  0  0 =
99  0

That's twenty six seconds of disk reading.  It's sustaining maybe 2.5
megabytes per second off a disk which can do 25 megabytes/sec.

Starting the application again, now that everything is in kernel pagecach=
e
takes about six seconds.  That's pure compute and is pretty gross.

So there's almost twenty seconds worth of startup time here which could b=
e
shaved off by improving the IO access patterns.  This would require runti=
me
analysis of the pagefault pattern, and re-layout of elf files based on th=
at
analysis.  Basically we want to pull each executable and shared library i=
nto
pagecache in one great big slurp.

Kernel help would be needed for the instrumentation/analysis.  The reorg =
of
the elf files would be a userspace issue.  One possible solution would be=
,
once the elf files are reorganised, to change the libc dynamic loader so =
that
it starts asynchronous reads against all the libraries which earlier
instrumentation indicated will be needed.

I assume these applications are also reading zillions of little config, f=
ont,
icon, etc files as well.  That'll hurt.  One possible way of speeding tha=
t up
would be for the application to maintain its own cache (on-disk) of all t=
he
info which it needed to start itself up.  So on the second startup it can=
 all
be read in in a single swipe.  Obviously, information coherency is an iss=
ue
here.

It's all a fair bit of work, but *this* is where the gains are to be made=
=2E=20
In the case of this app, the twenty six seconds can be potentially reduce=
d to
eight seconds by getting the IO scheduling right.

Incidentally: be aware that the linker lays files out in strange manners =
- it
seeks all over the file placing bits and pieces everywhere, so there is n=
o
correspondence between offset-into-file and offset-into-disk.  But if you
then take the output of the linker and copy it somewhere else (ie: `make
install') then the copied file _will_ have good linear layout.  This shou=
ld
be borne in mind when studying application startup times: don't try to
measure the startup time for an executable or library which was laid out =
on
disk by /usr/bin/ld.