Review Request 124539: Optimize KDevHash for integral types.

Sat Aug 1 17:58:15 UTC 2015

> On Aug. 1, 2015, 6:36 a.m., Olivier Jean de Gaalon wrote:
> > (actually looking at the code)
> > 
> > qHash for integral types is just a cast to uint, maybe would be clearer to do that directly.
> > 
> > Often you end up having to balance the cost of hashing vs the cost of collisions, so a cheaper hash isn't in and of itself a win.
> > 
> > My benchmarking was focused on reducing collisions, so in retrospect it's likely the current implementation is sacrificing real performance. Possibly we shouldn't be taking byte-sized chunks of /any/ data type size (though then padding doesn't help the hash much).
> > 
> > OTOH, benching the hash seems like an exercise in futility: the best hash by this measurement will be "return 0". The cost of any other hash is pure waste unless it's improving the overall performance of the system via reduced collisions. Changes to the hash should then be ignoring these benchmarks and testing if real insertion and lookup patterns are improved... unfortunately benchmarking that is quite a bit more work.
> 
> Milian Wolff wrote:
>     I looked at std::hash, and that one also uses the value directly. I rather use qHash or std::hash as it documents clearly what I want it to be, and if they ever change the implementation it's automagically picked up by us then as well.
>     
>     And I looked at this code as I've seen it as a hotspots a couple of times. A hash function should be fast, but of course you are right in saying that the collision reduction is even more important. But can you explain why using the integral value directly would decrease the randomness (or however one calls it) of the hash function? Especially looking at our current code:
>     
>     
>         // any integral type gets interpreted as const char*
>         static unsigned int hash(const char* const data, const unsigned int size, unsigned int seed = DEFAULT_SEED)
>         {
>             // 4 iterations for (u) int, 8 for q(u)int64
>             for (unsigned int i = 0; i < size; ++i) {
>                 // data[i] gets casted to a (u)int again
>                 seed += data[i];
>                 seed += ( seed << 10 );
>                 seed ^= ( seed >> 6 );
>             }
>             return seed;
>         }
>     
>     wouldn't we achieve a similar result, but much faster, if we simply do
>     
>     
>         // any integral type gets interpreted as const char*
>         static unsigned int hash(uint hash, unsigned int seed = DEFAULT_SEED)
>         {
>             seed += hash;
>             seed += ( seed << 10 );
>             seed ^= ( seed >> 6 );
>             return seed;
>         }
>     
>     Then, when we keep in mind that DEFAULT_SEED is 2166136261u, we are extremely close to what boost::hash_combine does internally:
>     
>         seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
>     
>     Thus I propose I refactor KDev::hash to operate on boost::hash_combine until we get http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf . I cannot think that the boost code is worse than the KDevHash code, quite the contrary.
>     
>     When we do follow this approach, I also want to remove the implicit hash function that converts any T to `const char*` and hashes that. Rather, I want people (i.e. me - in the first step) to write custom qHash or std::hash functions for their types.

I agree that the implementation should be changed to use the much cheaper combination. There's no value in magic that happened to work in the few mediocre benchmarks I did, where I never benchmarked /performance/, only collisions.

The combination is all that really matters, and that's where the pre-kdevhash code was really failing, and boost::hash_combine is obviously an improvement to kdevhash. (IOW, we are effectively doing the same as the no-op std::hash/qhash to get bits from types, but with expensive 8-bit combination instead of uint combination)

So +2 to not even bothering with the sfinae stuff and just punting to std::hash (or qHash) + hash_combine.

Side note: I doubt that kdevhash is used for non-integral types at all (yes, that means the current implementation is /more/ braindead than you think). The whole point of its existence is to act as a nice API so that you don't shoot yourself in the foot writing a custom hash implementation for your type. We should just turn it into a nice wrapper for boost::combine_hash and be done with it.

- Olivier Jean de

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/124539/#review83272
-----------------------------------------------------------

On July 30, 2015, 10:04 p.m., Milian Wolff wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/124539/
> -----------------------------------------------------------
> 
> (Updated July 30, 2015, 10:04 p.m.)
> 
> 
> Review request for KDevelop.
> 
> 
> Repository: kdevplatform
> 
> 
> Description
> -------
> 
> Optimize KDevHash for integral types.
> 
> For integral types, we now reuse qHash internally and only
> fall back to the O(sizeof(T)) implementation for other
> types.
> 
> Before:
> 
> ********* Start testing of TestKDevHash *********
> Config: Using QtTest library 5.5.0, Qt 5.5.0 (x86_64-little_endian-lp64 shared (dynamic) release build; by GCC 5.1.0)
> PASS   : TestKDevHash::initTestCase()
> PASS   : TestKDevHash::benchHash_int()
> RESULT : TestKDevHash::benchHash_int():
>      0.075 msecs per iteration (total: 77, iterations: 1024)
> PASS   : TestKDevHash::benchHash_uint()
> RESULT : TestKDevHash::benchHash_uint():
>      0.075 msecs per iteration (total: 77, iterations: 1024)
> PASS   : TestKDevHash::benchHash_quint64()
> RESULT : TestKDevHash::benchHash_quint64():
>      0.15 msecs per iteration (total: 77, iterations: 512)
> PASS   : TestKDevHash::benchHash_bool()
> RESULT : TestKDevHash::benchHash_bool():
>      0.018 msecs per iteration (total: 77, iterations: 4096)
> PASS   : TestKDevHash::cleanupTestCase()
> Totals: 6 passed, 0 failed, 0 skipped, 0 blacklisted
> ********* Finished testing of TestKDevHash *********
> 
> After:
> 
> ********* Start testing of TestKDevHash *********
> Config: Using QtTest library 5.5.0, Qt 5.5.0 (x86_64-little_endian-lp64 shared (dynamic) release build; by GCC 5.1.0)
> PASS   : TestKDevHash::initTestCase()
> PASS   : TestKDevHash::benchHash_int()
> RESULT : TestKDevHash::benchHash_int():
>      0.024 msecs per iteration (total: 51, iterations: 2048)
> PASS   : TestKDevHash::benchHash_uint()
> RESULT : TestKDevHash::benchHash_uint():
>      0.024 msecs per iteration (total: 51, iterations: 2048)
> PASS   : TestKDevHash::benchHash_quint64()
> RESULT : TestKDevHash::benchHash_quint64():
>      0.026 msecs per iteration (total: 54, iterations: 2048)
> PASS   : TestKDevHash::benchHash_bool()
> RESULT : TestKDevHash::benchHash_bool():
>      0.022 msecs per iteration (total: 92, iterations: 4096)
> PASS   : TestKDevHash::cleanupTestCase()
> Totals: 6 passed, 0 failed, 0 skipped, 0 blacklisted
> ********* Finished testing of TestKDevHash *********
> 
> 
> Diffs
> -----
> 
>   language/CMakeLists.txt 2c2b0285b68d209e07af185c43dcf157507f5a54 
>   language/util/kdevhash.h 40c32935aec9c7e5552769e780c033e96cf63166 
>   language/util/tests/CMakeLists.txt PRE-CREATION 
>   language/util/tests/test_kdevhash.cpp PRE-CREATION 
>   plugins/projectfilter/projectfilter.h 014e5a1eabbb3a2d9e71baeee3e38643cb8649e7 
> 
> Diff: https://git.reviewboard.kde.org/r/124539/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Milian Wolff
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kdevelop-devel/attachments/20150801/dd846fed/attachment.html>