KHTML-Measurement (from my talk at aKademy)
Josef Weidendorfer
Josef.Weidendorfer at gmx.de
Thu Aug 26 12:08:15 CEST 2004
Hello,
listeners to my talk (Performance Analysis) at aKademy will remember the
example I had in the demo: Rendering of a large HTML-Page (500kB manual, i.e.
text).
There, OProfile did show that 20% of time (on my Centrino-Notebook!) was
spent in khtml:RenderObject:findNextLayer. If you look at the inclusive time
cost (i.e. with functions called from there - currently, this has to be done
by hand by looking at the call graph from the simulation), it is more in the
range of 30%.
In contrast, in simulation with Calltree/Callgrind this function was way at
the bottom (perhaps 1%). This shows that there can be huge differences to
reality.
I looked a little bit into it, and have at least some explanation for this.
First, it does not seem to be a bug in my simulation tool :-) If you suspect
things like this, look at the annotated disassembler, and check if this makes
sense to you. Any bug reports always welcome!
Second, OProfile gives you only results for one special system, i.e. a
Pentium-M running at 1,4 GHz with DDR-266 memory. So results can be quite
different to e.g. a P4 or Athlon.
The simulation currently does a time estimation by looking at memory/cache
behaviour only. The above function seems to be quite cache-friendly, so
running with/without cache simulation does not change much.
Another source for stall time (wasted time) on modern processors is branch
misprediction. Work is done in a pipelined manner. And if there is a
mispredicted jump target, the pipeline has to be flushed. If I remember
right, an Athlon or P-III has a pipeline length of around 15, a P4 has 21,
and a P4 Prescott has 30. So every mispredicted branch costs around 15 cycles
wasted time on my notebook.
Observation: The above function was called around 350000 times in a recursive
manner (this information comes from simulation!), and it contains 2 virtual
function calls and around 5 condition jumps. Conditional jumps in recursive
functions are *BAD* for any branch prediction algorithm. The same goes for
any indirect calls/jumps (like virtual functions in C++). So you get around
2,5 millions of branch predictions in this function. Multiplying this with
e.g. 15 wasted cycles for every misprediction, the function gets more to the
top.
If I would implement some simple branch prediction algorithm in the simulator,
I think the time estimation of the simulation will be nearer to reality.
So what to do here? Changing the implementation seems not to be easy/usefull.
Better to avoid the 350000 calls to findNextLayer when building up large
webpages. The function here is most often called from
khtml::RenderContainer::addChild. Perhaps there is a possibility for caching
the last layer used?
But I'm not a KHTML Guru. Comments?
Josef
More information about the Kde-optimize
mailing list