Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Mark markg85 at gmail.com
Mon Oct 29 16:52:56 GMT 2012


On Mon, Oct 29, 2012 at 4:23 PM, Frank Reininghaus
<frank78ac at googlemail.com> wrote:
> 2012/10/29 Mark:
> [...]
>> Thanks a lot for your suggestions. I will certainly try to add the
>> tests in the mix. However, the tests are likely to be different. The
>> naturalCompare function returns an integer which is more then 0 and
>> 1... My "alternative" is a lessThan replacement and only returns a
>> bool.
>>
>> So obviously some tests are going to fail.. Is it even possible to
>> change the tests to work for this?
>
> I'd say that your "alternative" should also return an int. If you want
> to provide a real alternative to
>
> int KStringHandler::naturalCompare(const QString &_a, const QString
> &_b, Qt::CaseSensitivity caseSensitivity),
>
> I think that you should have
>
> a) A function which takes a const QString& and a Qt:CaseSensitivity
> argument and creates the collation sequence, and
>
> b) A function which takes two collation sequences and returns +1, 0,
> or -1 (could be memcmp() or something like it).
>
> This would also mean that the same unit tests can easily be used for
> both implementations.
>
>> As for the numbers, i already have a different approach there :) The
>> algorithm right now is as follows:
>> - if the current value is a number and the next value isn't then the
>> collation key becomes the integer representation of a number (not the
>> unicode representation)
>> - if the current value AND the next value are both numbers then the
>> current value is 10 + the current number. The 10 makes sure that it's
>> sorted after the 9. This is recursive so a input of: a1000.txt will
>> become this (for the numbers) 1 = 11, 0 = 10, 0 = 10, 0 = 0.
>>
>> You might wonder why i'm using the number representation and not the
>> unicode representation. The reasoning for that is putting a 10 after a
>> 9 would mean that i have to pick the next unicode value for 1 which is
>> ":". That works but makes it more complicated then needed and will
>> also give issues if filenames have an actual ":" in them. So i just
>> decided to numbers as they are. If i look at the unicode table [1]
>> then i'm safe. The highest possible number would be 19 in my algorithm
>> which is before any unicode character.
>
> Hm, if I look at the Unicode table
>
>> [1] http://www.tamasoft.co.jp/en/general-info/unicode.html
>
> it looks like this approach would actually change the relative sorting
> of digits and characters like #+,.- (think of filenames like "a1" and
> "a-1").
>
> Cheers,
> Frank

Getting this right is very very very difficult..
I keep hitting little issues :(

As for the naturalCompare replacement.. Let me just get this working
properly before i start with that part :)




More information about the kfm-devel mailing list