Ideas for improving sorting. Goal: under 1 second for 200.000 files.
Christoph Feck
christoph at maxiom.de
Thu Oct 25 13:09:31 BST 2012
On Thursday 25 October 2012 12:20:42 Mark wrote:
> So i would end up with a:
> struct SortEntry
> {
> QByteArray collatingKey;
> KFileItem *fileItem;
> };
>
> Where the collatingKey is meant to be what?
Let's start with comparing using strcmp. It works like this, omitting
details:
while (*s1 && *s2 && *s1 == *s2) {
++s1;
++s2;
}
return *s1 - *s2;
As you see, the code comparing each character is very fast, but does
not handle anything special, in particular it does not
- compare case insensitive
- sort german 'ä' after 'a' but before 'b' ("locale aware")
- sort 'a12' after 'a2' ("natural sorting")
The solution to the first inability is known to you: You convert each
string to lower case before comparing them. This converted string is
the actual "sort key", i.e. what you use for the sort's "lessThan"
functor:
'A' -> 'a'
'B' -> 'b'
'a' -> 'a'
Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking
up the actual item which you stored within the "SortEntry".
This very same idea can be applied to tackle the second inability. You
convert each string to its locale-dependend "collating" string, often
just called a "sequence", because it ususally does contain non-
characters. Here are example keys that sort 'a' < 'ä' < 'b':
'a' -> 'a'
'b' -> 'b'
'ä' -> 'a' '\0377'
The appended 0xFF character ensures that 'ä' is always after 'a',
regardless of which other character follows. But it will never be
after 'b', because of the first 'a' in the sort key.
If you look up the Unicode collating algorithm, you will see that it
is much more complicated, but the basic idea is the same. It should
not bother you for the initial version; you can simply use glibc
function "strxfrm()" to get the collating sequence for your string
parts where you would call "localeAwareCompare" on.
Also the last inability can be solved in the same way. Simply
_prepend_ a code which states the magnitude (number of digits).
Example:
'a1234' -> 'a' \004' '123'
'a56' -> 'a' '\002' '56'
'a8' -> 'a' '\001' '8'
A combined algorithm could, for example, return these collating
sequences as sort keys:
'a12' -> 'a' '\002' '12'
'ä8' -> 'a' '\0377' '\001' '8'
'A6' -> 'a' \001' '6'
'Ä56' -> 'a' '\0377' '\002' '56'
If you sort by those keys, you get ('A6', 'a12', 'ä8', 'ä56'), which
might be what you want, but you could also want ('A6', 'ä8', 'a12',
'ä56'), which would require a different algorithm for generating the
sort key.
Christoph Feck (kdepepo)
More information about the kfm-devel
mailing list