Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Sun Oct 28 00:39:48 BST 2012

On Sat, Oct 27, 2012 at 10:14 PM, Mark <markg85 at gmail.com> wrote:
> On Fri, Oct 26, 2012 at 9:48 PM, Mark <markg85 at gmail.com> wrote:
>> On Fri, Oct 26, 2012 at 9:40 PM, Bernd Buschinski
>> <b.buschinski at googlemail.com> wrote:
>>> On Friday 26 October 2012 21:21:39 Mark wrote:
>>>>
>>>> Oke, i got "something" working now but i kinda miss some more
>>>> knowledge in the sorting area.
>>>>
>>>> First the current implementation. I went for a structure like this:
>>>> struct KFileItem
>>>> {
>>>>     QString collatingSequence;
>>>>     QString filename;
>>>>     ....
>>>> };
>>>>
>>>> Note: this is just in a clean new (one file) project so don't bother the
>>>> names.
>>>>
>>>> filename is obviously the filename. collatingSequence is a copy of
>>>> filename. In that collatingSequence i'm tweaking the unicode values
>>>> and lower them for numbers. So the intention is to have numbers shown
>>>> before characters.
>>>>
>>>> So:
>>>> 000.txt (in unicode: 48, 48, 48, 46, 116, 120, 116)
>>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>>> ...
>>>>
>>>> And i tweaked that to look like this: (1 + num)
>>>> 000.txt (in unicode: 1, 1, 1, 46, 116, 120, 116)
>>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>>> ...
>>>>
>>>>
>>>> That part "seems" to be oke in theory but i'm getting stuck when
>>>> sorting. For this "proof of concept" i simply use qSort like so:
>>>> QList<KFileItem> fileList;
>>>>
>>>> ... add a bunch of items in fileList ...
>>>>
>>>> qSort(fileList.begin(), fileList.end(), lessThan);
>>>>
>>>> lessThan is where i'm stuck. I simply don't understand yet how
>>>> lessThan should work. What conditions i should do and when i should
>>>> return false or true.. My current version looks like this:
>>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>>> {
>>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>>
>>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>>     {
>>>>         if(sTwo->unicode() < sOne->unicode())
>>>>         {
>>>>             return false;
>>>>         }
>>>>
>>>>         ++sOne;
>>>>         ++sTwo;
>>>>     }
>>>>
>>>>     return true;
>>>> }
>>>>
>>>> Which compiles and runs, but does a very strange sorting :p
>>>> If someone can explain what the intended implementation is for a lessThan..?
>>>>
>>>> Kind regards,
>>>> Mark
>>>
>>> isn't the if wrong? shouldn't this be "if a is less than b"?
>>>
>>> so
>>>         if(sTwo->unicode() < sOne->unicode())
>>>         {
>>>             return false;
>>>         }
>>>
>>> should be
>>>
>>>         if(sOne->unicode() < sTwo->unicode())
>>>         {
>>>             return false;
>>>         }
>>>
>>> or? (untested)
>>>
>>>
>>>
>>> all:
>>>
>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>> {
>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>
>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>     {
>>>         if(sOne->unicode() < sTwo->unicode())
>>>         {
>>>             return false;
>>>         }
>>>
>>>         ++sOne;
>>>         ++sTwo;
>>>     }
>>>
>>>     return true;
>>> }
>>>
>>>
>>> Best regards
>>> Bernd
>>>
>>
>> Nope, i tried that but it doesn't improve sorting.
>> Here is the VERY HACKY single file project.
>> http://paste.kde.org/582536/
>>
>> Make an empty Qt console project and paste this as the main.cpp and it
>> should just run. There is quite a bit of debug info in there and quite
>> a lot more commented out.
>
> I have more stuff working now. You can find the current code here:
> http://paste.kde.org/583586/ the same warnings apply as last time :)
>
> The sorting of just text works fine, but i'm having issues as soon as
> i ass numbers in the mix.
> If you sort by character then a 0 comes before a 9, just as it should be.
> In this example code i have the following names:
> "0000.txt"
> "a4.txt"
> "a1.1001.txt"
> "a10.txt"
> "a1.txt"
> "a1.2.txt"
> "a1.11.txt"
>
> and once the sorting is run over it the result becomes:
> "0000.txt"
> "a1.2.txt"
> "a1.11.txt"
> "a1.txt"
> "a1.1001.txt"
> "a4.txt"
> "a10.txt"
>
> So i'm getting close, but that "a1.txt" is giving me troubles now.
> Note, for the sorting i just use the names without the .txt since that
> should be done when sorting on extension anyway.
>
> The "trick" i'm doing right now is looking ahead in the string. If the
> current character is a number and the following character is also a
> number then i invert the current character. So if the current
> character is a 1 and the next character is a 0 then i replace the
> current character with 9 - current character. Thus far this seems to
> work quite well, but i haven't drawn this out to see if it will work
> everywhere. I'm doing this because i really like to avoid detecting a
> number in a string. That will obviously work, but will also post new
> issues like the size of an int, float or  double.. I prefer the per
> character way if i can get that working.
>
> Again, that a1 is biting me now. Does anyone know a solution to get
> this order right? a1 should be above a1.1 and below 0000.
>
> Kind regards,
> Mark

Little update: http://paste.kde.org/583670/
Prevents leading zeros from screwing my results.. I still have the
same issue with a1.txt -_- That one is really getting on my nerves
now.