-Bdirect linking patch

Fri Nov 4 11:54:48 CET 2005

On Thursday 03 November 2005 23:42, Bastian, Waldo wrote:
> http://sourceware.org/ml/binutils/2005-10/msg00436.html

Obviously this patch is not really welcome ;-(
Of course prelinking is better, but often not possible or
inpracticable.

I noticed myself in profile experiments that symbol lookup
often is really a cache burden, both adding latency because
of main memory accesses and evicting quite some data
from L1/L2, leading to subsequent cache misses.

This stems from the fact that a symbol lookup does a hash table lookup
for every shared library loaded (until the symbol is found).
One hash table lookup on a P4 loads 64 bytes into L1 (and 128 byte into
L2), but only uses 4 bytes of it.
For every such lookup, additionally a string compare
has to be started, leading to eviction of another cache line.
Now take the number of shared libs for a KDE application (or OO)
and hash conflict cases into account, you probably get around 15 hash
lookups per symbol lookups on average, summing up to 2kB evicted from a
8KB L1 cache (on P4). Note that this 2kB most often has to be loaded
from main memory, leading to quite some slowdown because of memory
access latency.

It probably would be better to dynamically build up a persistant, fixed
size and mmapable large hash table for the most often used symbol lookups in
a system, which would allow to satisfy a symbol lookup with one hash table
lookup only. For this, it probably would be good to make a hash table entry
32 or 64 byte, and fill the space with the symbol names, such that one
cache line load is probably enough for both the lookup and the string
comparision. The hash table size itself will grow, but this is not important
as almost every hash table access usually leads to a main memory access,
and it is better to use the data which is loaded either way.

Any takers?
To be fair, I can not predict any speedup reachable with this technique, but
it should match the above mentioned -Bdirect patch, without changing the linker
and thus, existing binaries.
It seems to be a good idea, and we should have real numbers.

If you read about this low-level boring cache stuff until here :-),
I have something to do your own experiments:

Callgrind has an option --cacheuse=yes to enable further cache use statistics.
It will provide you with the amount of data which was loaded into cache
without ever being really used. I call this metric "spatial loss", abreviated
SpLoss1 for L1 and SpLoss2 for L2.
You should compare these values with the total amount of data being loaded
into L1 or L2 to get an idea of the percentage of transfer bandwidth wasted:
E.g. the total amount of data loaded into L2 is the number of L2 cache misses
multiplied with the L2 cache line size. In KCachegrind, you would write
"64 * L2m" as formula for a new event type (the simulator uses a 64 byte
L2 cache line for a P4).
This way, you can directly compare total vs. wasted loads side by side.

Be aware: Using "--cacheuse=yes" produces profile data with 13 event types.
Unfortunately, KCachegrind from KDE 3.4.x has a hard coded limit of 10 for
the maximal numbers of event types it can load, and produces a nonsense
error message like
  kcachegrind: ERROR: No event line found. Skipping './callgrind.out.22474
when you try to load such files. In KDE 3.5, I changed this
to 13.
For KDE 3.4.x, you need to modify the line
	#define MaxRealIndexValue 10
in kdesdk/kcachegrind/kcachegrind/tracedata.h, class TraceCost, to be able to
load these profile data files.

Josef