Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Thu Oct 25 13:09:31 BST 2012

On Thursday 25 October 2012 12:20:42 Mark wrote:
> So i would end up with a:
>         struct SortEntry
>         {
>                 QByteArray collatingKey;
>                 KFileItem *fileItem;
>         };
> 
> Where the collatingKey is meant to be what?

Let's start with comparing using strcmp. It works like this, omitting 
details:

	while (*s1 && *s2 && *s1 == *s2) {
		++s1;
		++s2;
	}
	return *s1 - *s2;

As you see, the code comparing each character is very fast, but does 
not handle anything special, in particular it does not
- compare case insensitive
- sort german 'ä' after 'a' but before 'b' ("locale aware")
- sort 'a12' after 'a2' ("natural sorting")

The solution to the first inability is known to you: You convert each 
string to lower case before comparing them. This converted string is 
the actual "sort key", i.e. what you use for the sort's "lessThan" 
functor:

	'A' -> 'a'
	'B' -> 'b'
	'a' -> 'a'

Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking 
up the actual item which you stored within the "SortEntry".

This very same idea can be applied to tackle the second inability. You 
convert each string to its locale-dependend "collating" string, often 
just called a "sequence", because it ususally does contain non-
characters. Here are example keys that sort 'a' < 'ä' < 'b':

	'a' ->	'a'
	'b' -> 'b'
	'ä' -> 'a' '\0377'

The appended 0xFF character ensures that 'ä' is always after 'a', 
regardless of which other character follows. But it will never be 
after 'b', because of the first 'a' in the sort key.

If you look up the Unicode collating algorithm, you will see that it 
is much more complicated, but the basic idea is the same. It should 
not bother you for the initial version; you can simply use glibc 
function "strxfrm()" to get the collating sequence for your string 
parts where you would call "localeAwareCompare" on.

Also the last inability can be solved in the same way. Simply 
_prepend_ a code which states the magnitude (number of digits). 
Example:

	'a1234' -> 'a' \004' '123'
	'a56' -> 'a' '\002' '56'
	'a8' -> 'a' '\001' '8'

A combined algorithm could, for example, return these collating 
sequences as sort keys:

	'a12' -> 'a' '\002' '12'
	'ä8' -> 'a' '\0377' '\001' '8'
	'A6' -> 'a' \001' '6'
	'Ä56' -> 'a' '\0377' '\002' '56'

If you sort by those keys, you get ('A6', 'a12', 'ä8', 'ä56'), which 
might be what you want, but you could also want ('A6', 'ä8', 'a12', 
'ä56'), which would require a different algorithm for generating the 
sort key.

Christoph Feck (kdepepo)