Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Thu Oct 25 21:15:41 BST 2012

On Thu, Oct 25, 2012 at 8:24 PM, Mark <markg85 at gmail.com> wrote:
> On Thu, Oct 25, 2012 at 2:09 PM, Christoph Feck <christoph at maxiom.de> wrote:
>> On Thursday 25 October 2012 12:20:42 Mark wrote:
>>> So i would end up with a:
>>>         struct SortEntry
>>>         {
>>>                 QByteArray collatingKey;
>>>                 KFileItem *fileItem;
>>>         };
>>>
>>> Where the collatingKey is meant to be what?
>>
>> Let's start with comparing using strcmp. It works like this, omitting
>> details:
>>
>>         while (*s1 && *s2 && *s1 == *s2) {
>>                 ++s1;
>>                 ++s2;
>>         }
>>         return *s1 - *s2;
>>
>> As you see, the code comparing each character is very fast, but does
>> not handle anything special, in particular it does not
>> - compare case insensitive
>> - sort german 'ä' after 'a' but before 'b' ("locale aware")
>> - sort 'a12' after 'a2' ("natural sorting")
>>
>> The solution to the first inability is known to you: You convert each
>> string to lower case before comparing them. This converted string is
>> the actual "sort key", i.e. what you use for the sort's "lessThan"
>> functor:
>>
>>         'A' -> 'a'
>>         'B' -> 'b'
>>         'a' -> 'a'
>>
>> Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking
>> up the actual item which you stored within the "SortEntry".
>
> Till this point i fully understand it.
>>
>> This very same idea can be applied to tackle the second inability. You
>> convert each string to its locale-dependend "collating" string, often
>> just called a "sequence", because it ususally does contain non-
>> characters. Here are example keys that sort 'a' < 'ä' < 'b':
>>
>>         'a' ->  'a'
>>         'b' -> 'b'
>>         'ä' -> 'a' '\0377'
>>
>> The appended 0xFF character ensures that 'ä' is always after 'a',
>> regardless of which other character follows. But it will never be
>> after 'b', because of the first 'a' in the sort key.
>
> This is where things get odd... So, i should place 'ä' before 'a', but
> that means that i have to make a translation table somewhere from
> latin chars to alphabet chars.. Or am i misunderstanding something
> here?
>
> There must be some automatic way to do this, right?
>
>>
>> If you look up the Unicode collating algorithm, you will see that it
>> is much more complicated, but the basic idea is the same. It should
>> not bother you for the initial version; you can simply use glibc
>> function "strxfrm()" to get the collating sequence for your string
>> parts where you would call "localeAwareCompare" on.
>>
>> Also the last inability can be solved in the same way. Simply
>> _prepend_ a code which states the magnitude (number of digits).
>> Example:
>>
>>         'a1234' -> 'a' \004' '123'
>>         'a56' -> 'a' '\002' '56'
>>         'a8' -> 'a' '\001' '8'
>>
>> A combined algorithm could, for example, return these collating
>> sequences as sort keys:
>>
>>         'a12' -> 'a' '\002' '12'
>>         'ä8' -> 'a' '\0377' '\001' '8'
>>         'A6' -> 'a' \001' '6'
>>         'Ä56' -> 'a' '\0377' '\002' '56'
>>
>> If you sort by those keys, you get ('A6', 'a12', 'ä8', 'ä56'), which
>> might be what you want, but you could also want ('A6', 'ä8', 'a12',
>> 'ä56'), which would require a different algorithm for generating the
>> sort key.
>>
>> Christoph Feck (kdepepo)
>
> Lets skip the third issue for the time being, the locate aware stuff
> seems "slightly" more important then the natural part.
>
> So how about this stuff in the implementations side?
> Am i right that i'm going to get 2 new parts:
>
> 1. A part for building up the collation sequence. That part is outside
> the sorting algorithm.
> 2. The sorting algorithm becomes a LOT less code and simply does
> string compares based on the collation sequences. So no more difficult
> checks like are done now. Thus the sorting would indeed speed up
> massively, right?
>
> For Todd and Christoph, thanks a lot for writing such long informative
> replies! I really appreciate that and it really helps me understand
> how to implement this. Though i obviously still have questions ;)
>
> Cheers,
> Mark

I'm guessing it won't be accepted to use QCollater? ;-) (Qt 5.1 material)
http://qt.gitorious.org/qt/qtbase/blobs/1bf5865a4137c615e7b9b10fd157a32e2bf61274/src/corelib/tools/qcollator.cpp