KHTML-Measurement (from my talk at aKademy)

Thu Aug 26 12:08:15 CEST 2004

Hello,

listeners to my talk (Performance Analysis) at aKademy will remember the 
example I had in the demo: Rendering of a large HTML-Page (500kB manual, i.e. 
text).

There, OProfile did show that  20% of time (on my Centrino-Notebook!) was 
spent in khtml:RenderObject:findNextLayer. If you look at the inclusive time 
cost (i.e. with functions called from there - currently, this has to be done 
by hand by looking at the call graph from the simulation), it is more in the 
range of 30%.

In contrast, in simulation with Calltree/Callgrind this function was way at 
the bottom (perhaps 1%). This shows that there can be huge differences to 
reality.

I looked a little bit into it, and have at least some explanation for this.
First, it does not seem to be a bug in my simulation tool :-) If you suspect 
things like this, look at the annotated disassembler, and check if this makes 
sense to you. Any bug reports always welcome!

Second, OProfile gives you only results for one special system, i.e. a 
Pentium-M running at 1,4 GHz with DDR-266 memory. So results can be quite 
different to e.g. a P4 or Athlon.

The simulation currently does a time estimation by looking at memory/cache 
behaviour only. The above function seems to be quite cache-friendly, so 
running with/without cache simulation does not change much.

Another source for stall time (wasted time) on modern processors is branch 
misprediction. Work is done in a pipelined manner. And if there is a 
mispredicted jump target, the pipeline has to be flushed. If I remember 
right, an Athlon or P-III has a pipeline length of around 15, a P4 has 21, 
and a P4 Prescott has 30. So every mispredicted branch costs around 15 cycles 
wasted time on my notebook.

Observation: The above function was called around 350000 times in a recursive 
manner (this information comes from simulation!), and it contains 2 virtual 
function calls and around 5 condition jumps. Conditional jumps in recursive 
functions are *BAD* for any branch prediction algorithm. The same goes for 
any indirect calls/jumps (like virtual functions in C++). So you get around 
2,5 millions of branch predictions in this function. Multiplying this with 
e.g. 15 wasted cycles for every misprediction, the function gets more to the 
top.

If I would implement some simple branch prediction algorithm in the simulator, 
I think the time estimation of the simulation will be nearer to reality.

So what to do here? Changing the implementation seems not to be easy/usefull. 
Better to avoid the 350000 calls to findNextLayer when building up large 
webpages. The function here is most often called from 
khtml::RenderContainer::addChild. Perhaps there is a possibility for caching 
the last layer used?

But I'm not a KHTML Guru. Comments?

Josef