Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Thu Oct 25 19:24:23 BST 2012

On Thu, Oct 25, 2012 at 2:09 PM, Christoph Feck <christoph at maxiom.de> wrote:
> On Thursday 25 October 2012 12:20:42 Mark wrote:
>> So i would end up with a:
>>         struct SortEntry
>>         {
>>                 QByteArray collatingKey;
>>                 KFileItem *fileItem;
>>         };
>>
>> Where the collatingKey is meant to be what?
>
> Let's start with comparing using strcmp. It works like this, omitting
> details:
>
>         while (*s1 && *s2 && *s1 == *s2) {
>                 ++s1;
>                 ++s2;
>         }
>         return *s1 - *s2;
>
> As you see, the code comparing each character is very fast, but does
> not handle anything special, in particular it does not
> - compare case insensitive
> - sort german 'ä' after 'a' but before 'b' ("locale aware")
> - sort 'a12' after 'a2' ("natural sorting")
>
> The solution to the first inability is known to you: You convert each
> string to lower case before comparing them. This converted string is
> the actual "sort key", i.e. what you use for the sort's "lessThan"
> functor:
>
>         'A' -> 'a'
>         'B' -> 'b'
>         'a' -> 'a'
>
> Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking
> up the actual item which you stored within the "SortEntry".

Till this point i fully understand it.
>
> This very same idea can be applied to tackle the second inability. You
> convert each string to its locale-dependend "collating" string, often
> just called a "sequence", because it ususally does contain non-
> characters. Here are example keys that sort 'a' < 'ä' < 'b':
>
>         'a' ->  'a'
>         'b' -> 'b'
>         'ä' -> 'a' '\0377'
>
> The appended 0xFF character ensures that 'ä' is always after 'a',
> regardless of which other character follows. But it will never be
> after 'b', because of the first 'a' in the sort key.

This is where things get odd... So, i should place 'ä' before 'a', but
that means that i have to make a translation table somewhere from
latin chars to alphabet chars.. Or am i misunderstanding something
here?

There must be some automatic way to do this, right?

>
> If you look up the Unicode collating algorithm, you will see that it
> is much more complicated, but the basic idea is the same. It should
> not bother you for the initial version; you can simply use glibc
> function "strxfrm()" to get the collating sequence for your string
> parts where you would call "localeAwareCompare" on.
>
> Also the last inability can be solved in the same way. Simply
> _prepend_ a code which states the magnitude (number of digits).
> Example:
>
>         'a1234' -> 'a' \004' '123'
>         'a56' -> 'a' '\002' '56'
>         'a8' -> 'a' '\001' '8'
>
> A combined algorithm could, for example, return these collating
> sequences as sort keys:
>
>         'a12' -> 'a' '\002' '12'
>         'ä8' -> 'a' '\0377' '\001' '8'
>         'A6' -> 'a' \001' '6'
>         'Ä56' -> 'a' '\0377' '\002' '56'
>
> If you sort by those keys, you get ('A6', 'a12', 'ä8', 'ä56'), which
> might be what you want, but you could also want ('A6', 'ä8', 'a12',
> 'ä56'), which would require a different algorithm for generating the
> sort key.
>
> Christoph Feck (kdepepo)

Lets skip the third issue for the time being, the locate aware stuff
seems "slightly" more important then the natural part.

So how about this stuff in the implementations side?
Am i right that i'm going to get 2 new parts:

1. A part for building up the collation sequence. That part is outside
the sorting algorithm.
2. The sorting algorithm becomes a LOT less code and simply does
string compares based on the collation sequences. So no more difficult
checks like are done now. Thus the sorting would indeed speed up
massively, right?

For Todd and Christoph, thanks a lot for writing such long informative
replies! I really appreciate that and it really helps me understand
how to implement this. Though i obviously still have questions ;)

Cheers,
Mark