KHTML-Measurement (from my talk at aKademy)

Thu Aug 26 18:15:33 CEST 2004

On Thursday 26 August 2004 15:32, Simon Perreault wrote:
> On Thursday August 26 2004 6:08, Josef Weidendorfer wrote:
> > Another source for stall time (wasted time) on modern processors is
> > branch misprediction. Work is done in a pipelined manner. And if there is
> > a mispredicted jump target, the pipeline has to be flushed. If I remember
> > right, an Athlon or P-III has a pipeline length of around 15, a P4 has
> > 21, and a P4 Prescott has 30. So every mispredicted branch costs around
> > 15 cycles wasted time on my notebook.
>
> I have been doing a lot of optimization of numerical software lately,
> nothing related to KDE, but heavy optimization nevertheless. Even for
> numerical computations, optimizing for branch prediction is at a way too
> low level. The effort will be huge for almost no gain. And, as you say,
> different CPUs have different algorithms, so this almost always negates any
> gain you might have. You'd better leave that optimization to compilers that
> feature profile-guided optimization, like Intel's. Never optimize at a
> level below the compiler's.

You are right.

But I never suggested optimizing the given function e.g. by doing strange 
things like inline assembler in KDE code. I simply stated the only 
explanation I can imagine for the difference of simulation results to 
reality.

I suggested getting rid of most of the calls to this function at all. And that 
is a very valid optimization possible in the scope of KDE applications.
In this case, it seems possible by doing some caching.
In my experience, when something is slow in GUI code, often things are done 
multiple times redundantly (e.g. unneeded repaints).
This is totally different from numerical computations, where one can not 
easily get rid of floating point operations, as they are needed by the 
numerical algorithm.

> By this I also suggest that Calltree/Callgrind should not be used to
> profile CPU-limited program, only for memory-limited programs. But if you

Hmm... How will you know before actually seeing measurements?
The advantage of calltree is that it is easier to setup and get some results, 
in contrast to e.g. OProfile (where you need root access). Additionally, you 
get call-graph and call counts, which is currently not possible with 
OProfile. This alone will make results easier to understand. Cache simulation 
can be left switched off.
And if results suggest that there are unneeded function calls, that's fine. 
Get rid of them, and it will make your program faster, regardless whether it 
is memory-bound or CPU-bound.

If you want to invest more time on analysis, it is always better to measure 
reality, and every optimization has to be checked against runtime figures.
But then, results from OProfile are difficult to understand, are subject to 
overlapping effects of multiple obscure microarchitecture issues, to 
processor erratas, and change with every processor.

As I said, I plan to add a heuristic to KCachegrind to extract the callgraph 
from a simulation run, and use this with sampling data from OProfile to get 
inclusive costs.

> really want to implement some sort of branch prediction algorithm
> simulation, you could simply use the statistical results of such an
> algorithm. On current processors, branch prediction is correct 90-95% of
> the time. So I imagine you could adjust the simulation using these figures,
> rather than implementing an algorithm directly.

Adding statistics gives you no more results at all. The only thing interesting 
here is to see if there are functions where the misprediction ratio is not in 
the 90%, and for this you need real simulation.

Josef