Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Fri Oct 26 20:21:39 BST 2012

On Fri, Oct 26, 2012 at 12:02 AM, Mark <markg85 at gmail.com> wrote:
> On Thu, Oct 25, 2012 at 11:00 PM, Christoph Feck <christoph at maxiom.de> wrote:
>> On Thursday 25 October 2012 20:24:23 Mark wrote:
>>> On Thu, Oct 25, 2012 at 2:09 PM, Christoph Feck
>> <christoph at maxiom.de> wrote:
>>> >         'a' ->  'a'
>>> >         'b' -> 'b'
>>> >         'ä' -> 'a' '\0377'
>>> >
>>> > The appended 0xFF character ensures that 'ä' is always after 'a',
>>> > regardless of which other character follows. But it will never be
>>> > after 'b', because of the first 'a' in the sort key.
>>>
>>> This is where things get odd... So, i should place 'ä' before 'a',
>>> but that means that i have to make a translation table somewhere
>>> from latin chars to alphabet chars.. Or am i misunderstanding
>>> something here?
>>
>> You are right, there need to be some tables, but as I wrote, you do
>> not bother, because you can use strxform() from glibc for this (or
>> libicu, which is what QCollator is going to use). Those libraries
>> already load locale-specific tables (every wondered, why "glibc-
>> locale" package is several hundred megabytes big? ;) Simply check the
>> source code of QString::localeAwareCompare(), it's no black magic to
>> call glibc.
>>
>>> So how about this stuff in the implementations side?
>>> Am i right that i'm going to get 2 new parts:
>>>
>>> 1. A part for building up the collation sequence. That part is
>>> outside the sorting algorithm.
>>
>> Yes. That's why Frank said this should be in kdelibs, right besides
>> the "naturalCompare" function. Looking at the QCollator link for
>> future Q5, we should model the function in kdelibs the same way, so
>> that it simply can be "moved" to Qt in frameworks.
>>
>>> 2. The sorting algorithm becomes a LOT less code and simply does
>>> string compares based on the collation sequences. So no more
>>> difficult checks like are done now. Thus the sorting would indeed
>>> speed up massively, right?
>>
>> Well, the comparing gets massively faster, but the actual sorting
>> algorithm gets a bit more complicated. First, you have to create a
>> temporary SortEntry list, create the keys, then sort the side-
>> information, and after that, have to use that sorted list to do the
>> actual model sorting. But those additional operations are only O(N),
>> not O(N*log N), which is why we expect it to be very worth for large
>> models.
>>
>>> For Todd and Christoph, thanks a lot for writing such long
>>> informative replies! I really appreciate that and it really helps
>>> me understand how to implement this. Though i obviously still have
>>> questions ;)
>>
>> Wer nicht fragt, bleibt dumm ;)
>>
>> Christoph Feck (kdepepo)
>
> I'm playing with this stuff right now. The comparison gets MASSIVELY
> faster since it are actually int compares! (or numeric compares) No
> string compares at all anymore.
> Btw, that german line you wrote. I can read that one you know ;)
>
> Cheers,
> Mark

Oke, i got "something" working now but i kinda miss some more
knowledge in the sorting area.

First the current implementation. I went for a structure like this:
struct KFileItem
{
    QString collatingSequence;
    QString filename;
    ....
};

Note: this is just in a clean new (one file) project so don't bother the names.

filename is obviously the filename. collatingSequence is a copy of
filename. In that collatingSequence i'm tweaking the unicode values
and lower them for numbers. So the intention is to have numbers shown
before characters.

So:
000.txt (in unicode: 48, 48, 48, 46, 116, 120, 116)
a.txt (in unicode: 97, 46, 116, 120, 116)
...

And i tweaked that to look like this: (1 + num)
000.txt (in unicode: 1, 1, 1, 46, 116, 120, 116)
a.txt (in unicode: 97, 46, 116, 120, 116)
...

That part "seems" to be oke in theory but i'm getting stuck when
sorting. For this "proof of concept" i simply use qSort like so:
QList<KFileItem> fileList;

... add a bunch of items in fileList ...

qSort(fileList.begin(), fileList.end(), lessThan);

lessThan is where i'm stuck. I simply don't understand yet how
lessThan should work. What conditions i should do and when i should
return false or true.. My current version looks like this:
bool intSort(const KFileItem& s1, const KFileItem& s2)
{
    const QChar* sOne = s1.collatingSequence.unicode();
    const QChar* sTwo = s2.collatingSequence.unicode();

    while(!sOne->isNull() && !sTwo->isNull())
    {
        if(sTwo->unicode() < sOne->unicode())
        {
            return false;
        }

        ++sOne;
        ++sTwo;
    }

    return true;
}

Which compiles and runs, but does a very strange sorting :p
If someone can explain what the intended implementation is for a lessThan..?

Kind regards,
Mark