Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Christoph Feck christoph at maxiom.de
Thu Oct 25 22:00:23 BST 2012


On Thursday 25 October 2012 20:24:23 Mark wrote:
> On Thu, Oct 25, 2012 at 2:09 PM, Christoph Feck 
<christoph at maxiom.de> wrote:
> >         'a' ->  'a'
> >         'b' -> 'b'
> >         'ä' -> 'a' '\0377'
> > 
> > The appended 0xFF character ensures that 'ä' is always after 'a',
> > regardless of which other character follows. But it will never be
> > after 'b', because of the first 'a' in the sort key.
> 
> This is where things get odd... So, i should place 'ä' before 'a',
> but that means that i have to make a translation table somewhere
> from latin chars to alphabet chars.. Or am i misunderstanding
> something here?

You are right, there need to be some tables, but as I wrote, you do 
not bother, because you can use strxform() from glibc for this (or 
libicu, which is what QCollator is going to use). Those libraries 
already load locale-specific tables (every wondered, why "glibc-
locale" package is several hundred megabytes big? ;) Simply check the 
source code of QString::localeAwareCompare(), it's no black magic to 
call glibc.

> So how about this stuff in the implementations side?
> Am i right that i'm going to get 2 new parts:
> 
> 1. A part for building up the collation sequence. That part is
> outside the sorting algorithm.

Yes. That's why Frank said this should be in kdelibs, right besides 
the "naturalCompare" function. Looking at the QCollator link for 
future Q5, we should model the function in kdelibs the same way, so 
that it simply can be "moved" to Qt in frameworks.

> 2. The sorting algorithm becomes a LOT less code and simply does
> string compares based on the collation sequences. So no more
> difficult checks like are done now. Thus the sorting would indeed
> speed up massively, right?

Well, the comparing gets massively faster, but the actual sorting 
algorithm gets a bit more complicated. First, you have to create a 
temporary SortEntry list, create the keys, then sort the side-
information, and after that, have to use that sorted list to do the 
actual model sorting. But those additional operations are only O(N), 
not O(N*log N), which is why we expect it to be very worth for large 
models.

> For Todd and Christoph, thanks a lot for writing such long
> informative replies! I really appreciate that and it really helps
> me understand how to implement this. Though i obviously still have
> questions ;)

Wer nicht fragt, bleibt dumm ;)

Christoph Feck (kdepepo)




More information about the kfm-devel mailing list