Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Sat Oct 27 21:14:29 BST 2012

On Fri, Oct 26, 2012 at 9:48 PM, Mark <markg85 at gmail.com> wrote:
> On Fri, Oct 26, 2012 at 9:40 PM, Bernd Buschinski
> <b.buschinski at googlemail.com> wrote:
>> On Friday 26 October 2012 21:21:39 Mark wrote:
>>>
>>> Oke, i got "something" working now but i kinda miss some more
>>> knowledge in the sorting area.
>>>
>>> First the current implementation. I went for a structure like this:
>>> struct KFileItem
>>> {
>>>     QString collatingSequence;
>>>     QString filename;
>>>     ....
>>> };
>>>
>>> Note: this is just in a clean new (one file) project so don't bother the
>>> names.
>>>
>>> filename is obviously the filename. collatingSequence is a copy of
>>> filename. In that collatingSequence i'm tweaking the unicode values
>>> and lower them for numbers. So the intention is to have numbers shown
>>> before characters.
>>>
>>> So:
>>> 000.txt (in unicode: 48, 48, 48, 46, 116, 120, 116)
>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>> ...
>>>
>>> And i tweaked that to look like this: (1 + num)
>>> 000.txt (in unicode: 1, 1, 1, 46, 116, 120, 116)
>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>> ...
>>>
>>>
>>> That part "seems" to be oke in theory but i'm getting stuck when
>>> sorting. For this "proof of concept" i simply use qSort like so:
>>> QList<KFileItem> fileList;
>>>
>>> ... add a bunch of items in fileList ...
>>>
>>> qSort(fileList.begin(), fileList.end(), lessThan);
>>>
>>> lessThan is where i'm stuck. I simply don't understand yet how
>>> lessThan should work. What conditions i should do and when i should
>>> return false or true.. My current version looks like this:
>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>> {
>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>
>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>     {
>>>         if(sTwo->unicode() < sOne->unicode())
>>>         {
>>>             return false;
>>>         }
>>>
>>>         ++sOne;
>>>         ++sTwo;
>>>     }
>>>
>>>     return true;
>>> }
>>>
>>> Which compiles and runs, but does a very strange sorting :p
>>> If someone can explain what the intended implementation is for a lessThan..?
>>>
>>> Kind regards,
>>> Mark
>>
>> isn't the if wrong? shouldn't this be "if a is less than b"?
>>
>> so
>>         if(sTwo->unicode() < sOne->unicode())
>>         {
>>             return false;
>>         }
>>
>> should be
>>
>>         if(sOne->unicode() < sTwo->unicode())
>>         {
>>             return false;
>>         }
>>
>> or? (untested)
>>
>>
>>
>> all:
>>
>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>> {
>>     const QChar* sOne = s1.collatingSequence.unicode();
>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>
>>     while(!sOne->isNull() && !sTwo->isNull())
>>     {
>>         if(sOne->unicode() < sTwo->unicode())
>>         {
>>             return false;
>>         }
>>
>>         ++sOne;
>>         ++sTwo;
>>     }
>>
>>     return true;
>> }
>>
>>
>> Best regards
>> Bernd
>>
>
> Nope, i tried that but it doesn't improve sorting.
> Here is the VERY HACKY single file project.
> http://paste.kde.org/582536/
>
> Make an empty Qt console project and paste this as the main.cpp and it
> should just run. There is quite a bit of debug info in there and quite
> a lot more commented out.

I have more stuff working now. You can find the current code here:
http://paste.kde.org/583586/ the same warnings apply as last time :)

The sorting of just text works fine, but i'm having issues as soon as
i ass numbers in the mix.
If you sort by character then a 0 comes before a 9, just as it should be.
In this example code i have the following names:
"0000.txt"
"a4.txt"
"a1.1001.txt"
"a10.txt"
"a1.txt"
"a1.2.txt"
"a1.11.txt"

and once the sorting is run over it the result becomes:
"0000.txt"
"a1.2.txt"
"a1.11.txt"
"a1.txt"
"a1.1001.txt"
"a4.txt"
"a10.txt"

So i'm getting close, but that "a1.txt" is giving me troubles now.
Note, for the sorting i just use the names without the .txt since that
should be done when sorting on extension anyway.

The "trick" i'm doing right now is looking ahead in the string. If the
current character is a number and the following character is also a
number then i invert the current character. So if the current
character is a 1 and the next character is a 0 then i replace the
current character with 9 - current character. Thus far this seems to
work quite well, but i haven't drawn this out to see if it will work
everywhere. I'm doing this because i really like to avoid detecting a
number in a string. That will obviously work, but will also post new
issues like the size of an int, float or  double.. I prefer the per
character way if i can get that working.

Again, that a1 is biting me now. Does anyone know a solution to get
this order right? a1 should be above a1.1 and below 0000.

Kind regards,
Mark