Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Mark markg85 at gmail.com
Thu Oct 25 11:20:42 BST 2012


On Thu, Oct 25, 2012 at 11:41 AM, Frank Reininghaus
<frank78ac at googlemail.com> wrote:
> Hi Mark,
>
> 2012/10/25 Mark:
>> On Wed, Oct 24, 2012 at 1:13 AM, Christoph Feck wrote:
>>> Hi,
>>>
>>> you wrote:
>>>> Copy KStringHandler::naturalCompare into dolphin for further in
>>>> depth optimizations
>>>
>>> A good read about locale-aware sorting is this link from glibc doc:
>>>
>>> http://www.gnu.org/software/libc/manual/html_node/Collation-
>>> Functions.html
>>>
>>> Basically, it says to get an effective locale-aware sorting, you need
>>> to cache a sort key ("collating sequence") for each string, then use
>>> that for the sort algorithm. That sort key could additionally encode
>>> the digit sequences in "magnitude - digits" order, so that "a12" comes
>>> after "a2", and do case mapping for case insensitive compares.
>>>
>>> If you sort N = 200.000 entries, you only have to create 200.000
>>> collating sequences, instead of 2*N*log(N) = 7.000.000. Speed of
>>> comparing the sequences is that of a memcpy() if done properly. See
>>> Thiago's posts about QString performance optimizations.
>>>
>>> So what you have to work on before doing anything else, is to ask
>>> Frank, if he is okey with adding a sort key (QByteArray basically) for
>>> each file item. Your code would have to make sure it is always updated
>>> on file renames, locale changes, sort mode changes (natural vs.
>>> simple) etc.
>>>
>>> For more information about collating, I already gave you the link for
>>> the official Unicode collating code. You would need to improve the
>>> algorithm to include the "magnitude - digits" coding.
>>>
>>> Christoph Feck (kdepepo)
>>> KDE Quality Team
>>
>> Hi Christoph,
>>
>> Thanks a lot for your suggestion! However, there is a "bit" of a
>> knowledge problem here. I don't have the knowledge - by far - to even
>> implement something like you suggest or to even fully understand what
>> thiago wrote about regarding string optimizations. I like to read that
>> kind of stuff and i understand it partly, but implementing it is a
>> whole different story.
>>
>> I'm getting there in knowledge, just give me a few more years :)
>>
>> I guess i just keep it to the "simpler" optimizations for now. Unless
>> someone would like to help me implementing your suggestion?
>
> But which 'simple' optimizations that could yield a considerable
> performance improvement remain? Christoph's suggestion is essentially
> what you call 2.1 and 2.2.

With "simple" i mean "doable" and "understandable" for me. I'm not at
the programming level of you, Christoph and certainly not Thiago so i
have to take the routes i can manage. Yes, i will certainly try to
understand how Christoph' suggestion works and implement it, but don't
expect i can.

Lets just wait and see what i can come up with, oke? :)
Perhaps just doing 2.2 might already speed up naturalCompare
massively. I don't know, still have to implement and benchmark it.
>
> I like the idea, and I'm willing to review any patches, of course, but
> I'm afraid I'm too busy with other things to work on the code myself
> :-(
>
> Best regards,
> Frank

Question for that GCC link (in an attempt to understand it)
So i would end up with a:
        struct SortEntry
        {
                QByteArray collatingKey;
                KFileItem *fileItem;
        };

Where the collatingKey is meant to be what? Could you give an example
for - lets say - 5 different names?
- a0.txt
- 111.txt
- A01.txt
- aaa.txt
- a2.txt

I don't really understand what the "collating sequence" is supposed to
be.. The GCC link talks about it, but it still doesn't help me
understand what it is exactly. If one of you can explain it in simple
terms with the example files above, that would certainly help me.

Where i expect the end result to look like this (this is how windows
sorts it so lets take that as an example):
111.txt
a0.txt
A01.txt
a1.txt
aaa.txt

Cheers,
Mark




More information about the kfm-devel mailing list