Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Mon Oct 29 12:30:10 GMT 2012

On Mon, Oct 29, 2012 at 9:42 AM, Frank Reininghaus
<frank78ac at googlemail.com> wrote:
> Hi Mark,
>
> 2012/10/28 Mark:
> [...]
>> http://paste.kde.org/584534/
>>
>> I already had the nagging feeling that the QChar->unicode() stuff was
>> making it more complicated then it had to bee in the lessThan
>> function. So i converted it all to ints. You can see that in the above
>> paste.
>>
>> Before sorting
>> "0000.txt"
>> "1a2b3a4.txt"
>> "a1.1001.txt"
>> "a10.txt"
>> "a6.txt"
>> "a3.txt"
>> "a8.txt"
>> "a1.txt"
>> "a2.txt"
>> "a9.txt"
>> "a5.txt"
>> "a4.txt"
>> "a14.txt"
>> "a11.txt"
>> "a12.txt"
>> "a1.2.txt"
>> "a1.11.txt"
>> "a7.txt"
>>
>> After sorting
>> "0000.txt"
>> "1a2b3a4.txt"
>> "a1.txt"
>> "a1.2.txt"
>> "a1.11.txt"
>> "a1.1001.txt"
>> "a2.txt"
>> "a3.txt"
>> "a4.txt"
>> "a5.txt"
>> "a6.txt"
>> "a7.txt"
>> "a8.txt"
>> "a9.txt"
>> "a10.txt"
>> "a11.txt"
>> "a12.txt"
>> "a14.txt"
>
> Looks like a reasonable sorting of these files, yes :-)
>
>> Victory i guess :)
>> Well, i actually think that a call for victory might be too soon. I
>> still have to test this on dolphin itself. Which i'm going to do right
>> now.
>
> Another thing that might be worth checking is if your comparison
> algorithm passes all unit tests:
>
> https://projects.kde.org/projects/kde/kdelibs/repository/revisions/master/entry/kdecore/tests/kstringhandlertest.cpp#L52
>
> There's some pretty tricky stuff in there, but any candidate for a new
> comparison function must pass these, of course - we obviously don't
> want regressions.
>
> Another thing that might be worth trying is to take a very large set
> of files (like all files on your hard drive) and check if they are
> sorted equally using the existing naturalCompare() function and your
> algorithm. If that is not the case, find out two files for which the
> lessThan() results differ, add a new unit test, then try to fix it.
>
> I haven't checked every detail of your code yet, but I'm not sure if
> replacing digits by ':' in the collation sequence is a good idea. This
> is only asking for trouble with file names which contain ':' and might
> also cause other issues. Christoph's suggestion to prepend numbers
> with their number of digits (which probably has to be done even
> recursively to make sure that numbers with 10 digits are really
> considered larger than a number with 9 digits) looks better to me.
>
> Best regards,
> Frank

Hi Frank,

Thanks a lot for your suggestions. I will certainly try to add the
tests in the mix. However, the tests are likely to be different. The
naturalCompare function returns an integer which is more then 0 and
1... My "alternative" is a lessThan replacement and only returns a
bool.

So obviously some tests are going to fail.. Is it even possible to
change the tests to work for this?

As for the numbers, i already have a different approach there :) The
algorithm right now is as follows:
- if the current value is a number and the next value isn't then the
collation key becomes the integer representation of a number (not the
unicode representation)
- if the current value AND the next value are both numbers then the
current value is 10 + the current number. The 10 makes sure that it's
sorted after the 9. This is recursive so a input of: a1000.txt will
become this (for the numbers) 1 = 11, 0 = 10, 0 = 10, 0 = 0.

You might wonder why i'm using the number representation and not the
unicode representation. The reasoning for that is putting a 10 after a
9 would mean that i have to pick the next unicode value for 1 which is
":". That works but makes it more complicated then needed and will
also give issues if filenames have an actual ":" in them. So i just
decided to numbers as they are. If i look at the unicode table [1]
then i'm safe. The highest possible number would be 19 in my algorithm
which is before any unicode character.

[1] http://www.tamasoft.co.jp/en/general-info/unicode.html