Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Sun Oct 28 19:27:52 GMT 2012

On Sun, Oct 28, 2012 at 5:55 PM, Mark <markg85 at gmail.com> wrote:
> On Sun, Oct 28, 2012 at 1:39 AM, Mark <markg85 at gmail.com> wrote:
>> On Sat, Oct 27, 2012 at 10:14 PM, Mark <markg85 at gmail.com> wrote:
>>> On Fri, Oct 26, 2012 at 9:48 PM, Mark <markg85 at gmail.com> wrote:
>>>> On Fri, Oct 26, 2012 at 9:40 PM, Bernd Buschinski
>>>> <b.buschinski at googlemail.com> wrote:
>>>>> On Friday 26 October 2012 21:21:39 Mark wrote:
>>>>>>
>>>>>> Oke, i got "something" working now but i kinda miss some more
>>>>>> knowledge in the sorting area.
>>>>>>
>>>>>> First the current implementation. I went for a structure like this:
>>>>>> struct KFileItem
>>>>>> {
>>>>>>     QString collatingSequence;
>>>>>>     QString filename;
>>>>>>     ....
>>>>>> };
>>>>>>
>>>>>> Note: this is just in a clean new (one file) project so don't bother the
>>>>>> names.
>>>>>>
>>>>>> filename is obviously the filename. collatingSequence is a copy of
>>>>>> filename. In that collatingSequence i'm tweaking the unicode values
>>>>>> and lower them for numbers. So the intention is to have numbers shown
>>>>>> before characters.
>>>>>>
>>>>>> So:
>>>>>> 000.txt (in unicode: 48, 48, 48, 46, 116, 120, 116)
>>>>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>>>>> ...
>>>>>>
>>>>>> And i tweaked that to look like this: (1 + num)
>>>>>> 000.txt (in unicode: 1, 1, 1, 46, 116, 120, 116)
>>>>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> That part "seems" to be oke in theory but i'm getting stuck when
>>>>>> sorting. For this "proof of concept" i simply use qSort like so:
>>>>>> QList<KFileItem> fileList;
>>>>>>
>>>>>> ... add a bunch of items in fileList ...
>>>>>>
>>>>>> qSort(fileList.begin(), fileList.end(), lessThan);
>>>>>>
>>>>>> lessThan is where i'm stuck. I simply don't understand yet how
>>>>>> lessThan should work. What conditions i should do and when i should
>>>>>> return false or true.. My current version looks like this:
>>>>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>>>>> {
>>>>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>>>>
>>>>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>>>>     {
>>>>>>         if(sTwo->unicode() < sOne->unicode())
>>>>>>         {
>>>>>>             return false;
>>>>>>         }
>>>>>>
>>>>>>         ++sOne;
>>>>>>         ++sTwo;
>>>>>>     }
>>>>>>
>>>>>>     return true;
>>>>>> }
>>>>>>
>>>>>> Which compiles and runs, but does a very strange sorting :p
>>>>>> If someone can explain what the intended implementation is for a lessThan..?
>>>>>>
>>>>>> Kind regards,
>>>>>> Mark
>>>>>
>>>>> isn't the if wrong? shouldn't this be "if a is less than b"?
>>>>>
>>>>> so
>>>>>         if(sTwo->unicode() < sOne->unicode())
>>>>>         {
>>>>>             return false;
>>>>>         }
>>>>>
>>>>> should be
>>>>>
>>>>>         if(sOne->unicode() < sTwo->unicode())
>>>>>         {
>>>>>             return false;
>>>>>         }
>>>>>
>>>>> or? (untested)
>>>>>
>>>>>
>>>>>
>>>>> all:
>>>>>
>>>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>>>> {
>>>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>>>
>>>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>>>     {
>>>>>         if(sOne->unicode() < sTwo->unicode())
>>>>>         {
>>>>>             return false;
>>>>>         }
>>>>>
>>>>>         ++sOne;
>>>>>         ++sTwo;
>>>>>     }
>>>>>
>>>>>     return true;
>>>>> }
>>>>>
>>>>>
>>>>> Best regards
>>>>> Bernd
>>>>>
>>>>
>>>> Nope, i tried that but it doesn't improve sorting.
>>>> Here is the VERY HACKY single file project.
>>>> http://paste.kde.org/582536/
>>>>
>>>> Make an empty Qt console project and paste this as the main.cpp and it
>>>> should just run. There is quite a bit of debug info in there and quite
>>>> a lot more commented out.
>>>
>>> I have more stuff working now. You can find the current code here:
>>> http://paste.kde.org/583586/ the same warnings apply as last time :)
>>>
>>> The sorting of just text works fine, but i'm having issues as soon as
>>> i ass numbers in the mix.
>>> If you sort by character then a 0 comes before a 9, just as it should be.
>>> In this example code i have the following names:
>>> "0000.txt"
>>> "a4.txt"
>>> "a1.1001.txt"
>>> "a10.txt"
>>> "a1.txt"
>>> "a1.2.txt"
>>> "a1.11.txt"
>>>
>>> and once the sorting is run over it the result becomes:
>>> "0000.txt"
>>> "a1.2.txt"
>>> "a1.11.txt"
>>> "a1.txt"
>>> "a1.1001.txt"
>>> "a4.txt"
>>> "a10.txt"
>>>
>>> So i'm getting close, but that "a1.txt" is giving me troubles now.
>>> Note, for the sorting i just use the names without the .txt since that
>>> should be done when sorting on extension anyway.
>>>
>>> The "trick" i'm doing right now is looking ahead in the string. If the
>>> current character is a number and the following character is also a
>>> number then i invert the current character. So if the current
>>> character is a 1 and the next character is a 0 then i replace the
>>> current character with 9 - current character. Thus far this seems to
>>> work quite well, but i haven't drawn this out to see if it will work
>>> everywhere. I'm doing this because i really like to avoid detecting a
>>> number in a string. That will obviously work, but will also post new
>>> issues like the size of an int, float or  double.. I prefer the per
>>> character way if i can get that working.
>>>
>>> Again, that a1 is biting me now. Does anyone know a solution to get
>>> this order right? a1 should be above a1.1 and below 0000.
>>>
>>> Kind regards,
>>> Mark
>>
>> Little update: http://paste.kde.org/583670/
>> Prevents leading zeros from screwing my results.. I still have the
>> same issue with a1.txt -_- That one is really getting on my nerves
>> now.
>
> Another little update.
> Before sorting
> "0000.txt"
> "1a2b3a4.txt"
> "a1.1001.txt"
> "a10.txt"
> "a1.txt"
> "a1.2.txt"
> "a1.11.txt"
> "a7.txt"
>
> After sorting
> "0000.txt"
> "1a2b3a4.txt"
> "a1.1001.txt"
> "a1.txt"
> "a1.2.txt"
> "a1.11.txt"
> "a7.txt"
> "a10.txt"
>
>
> Somehow i need to place a1.1001.txt below a1.11.txt ... I'm currently
> out of ideas on how to do that.
> The code: http://paste.kde.org/584462/
>
> Once this "algorithm" work i will rework it to have just an integer
> array in KFileItem and an int arraySize. Then the actual sorting
> function (intSort) is even simpler and probably even faster.

Got it!
http://paste.kde.org/584534/

I already had the nagging feeling that the QChar->unicode() stuff was
making it more complicated then it had to bee in the lessThan
function. So i converted it all to ints. You can see that in the above
paste.

Before sorting
"0000.txt"
"1a2b3a4.txt"
"a1.1001.txt"
"a10.txt"
"a6.txt"
"a3.txt"
"a8.txt"
"a1.txt"
"a2.txt"
"a9.txt"
"a5.txt"
"a4.txt"
"a14.txt"
"a11.txt"
"a12.txt"
"a1.2.txt"
"a1.11.txt"
"a7.txt"

After sorting
"0000.txt"
"1a2b3a4.txt"
"a1.txt"
"a1.2.txt"
"a1.11.txt"
"a1.1001.txt"
"a2.txt"
"a3.txt"
"a4.txt"
"a5.txt"
"a6.txt"
"a7.txt"
"a8.txt"
"a9.txt"
"a10.txt"
"a11.txt"
"a12.txt"
"a14.txt"

Victory i guess :)
Well, i actually think that a call for victory might be too soon. I
still have to test this on dolphin itself. Which i'm going to do right
now.

Sorry for the noise in here, but i think i should mention every
progression just in case someone is reading this thread and tries to
figure out the same thing.

Cheers,
Mark