Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Mon Oct 29 15:23:09 GMT 2012

2012/10/29 Mark:
[...]
> Thanks a lot for your suggestions. I will certainly try to add the
> tests in the mix. However, the tests are likely to be different. The
> naturalCompare function returns an integer which is more then 0 and
> 1... My "alternative" is a lessThan replacement and only returns a
> bool.
>
> So obviously some tests are going to fail.. Is it even possible to
> change the tests to work for this?

I'd say that your "alternative" should also return an int. If you want
to provide a real alternative to

int KStringHandler::naturalCompare(const QString &_a, const QString
&_b, Qt::CaseSensitivity caseSensitivity),

I think that you should have

a) A function which takes a const QString& and a Qt:CaseSensitivity
argument and creates the collation sequence, and

b) A function which takes two collation sequences and returns +1, 0,
or -1 (could be memcmp() or something like it).

This would also mean that the same unit tests can easily be used for
both implementations.

> As for the numbers, i already have a different approach there :) The
> algorithm right now is as follows:
> - if the current value is a number and the next value isn't then the
> collation key becomes the integer representation of a number (not the
> unicode representation)
> - if the current value AND the next value are both numbers then the
> current value is 10 + the current number. The 10 makes sure that it's
> sorted after the 9. This is recursive so a input of: a1000.txt will
> become this (for the numbers) 1 = 11, 0 = 10, 0 = 10, 0 = 0.
>
> You might wonder why i'm using the number representation and not the
> unicode representation. The reasoning for that is putting a 10 after a
> 9 would mean that i have to pick the next unicode value for 1 which is
> ":". That works but makes it more complicated then needed and will
> also give issues if filenames have an actual ":" in them. So i just
> decided to numbers as they are. If i look at the unicode table [1]
> then i'm safe. The highest possible number would be 19 in my algorithm
> which is before any unicode character.

Hm, if I look at the Unicode table

> [1] http://www.tamasoft.co.jp/en/general-info/unicode.html

it looks like this approach would actually change the relative sorting
of digits and characters like #+,.- (think of filenames like "a1" and
"a-1").

Cheers,
Frank