Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Mark markg85 at gmail.com
Fri Dec 28 14:41:26 GMT 2012


On Fri, Dec 28, 2012 at 2:46 PM, Frank Reininghaus
<frank78ac at googlemail.com> wrote:
> Hi,
>
> 2012/10/24 Christoph Feck:
>> On Wednesday 24 October 2012 01:13:40 Christoph Feck wrote:
>>> Speed of comparing the sequences is that of a memcpy()
>>
>> Of course I meant memcmp().
>>
>>> So what you have to work on before doing anything else, is to ask
>>> Frank, if he is okey with adding a sort key (QByteArray basically)
>>> for each file item. Your code would have to make sure it is always
>>> updated on file renames, locale changes, sort mode changes
>>> (natural vs. simple) etc.
>>
>> Another idea is to not store this side information "permanently", but
>> only create it temporarily when sorting. So you effectively only sort
>> a list with side information.
>>
>>         struct SortEntry
>>         {
>>                 QByteArray collatingKey;
>>                 KFileItem *fileItem;
>>         };
>
> I have been thinking about sorting from time to time, and I had an
> idea how to to store the sort key. It could be done either
>
> a) In the ItemData's "values" hash (as a QVariant with a key like
> "naturalCollatingSequence"), or,
>
> b) in a new member "QByteArray naturalCollatingSequence" in ItemData
> (if a QByteArray seems more suitable because QByteArrays can be
> compared with < directly and the casting overhead for each comparison
> seems to much).
>
> When comparing two ItemData, we would then first check if both already
> have a collating sequence, and calculate and store them if that is not
> the case. One could then just compare those sequences.
>
> When an item changes, the collating sequence would be deleted from the
> hash (or set to an empty QByteArray, respectively).
>
> This would make sure that the collating sequence is calculated only
> once for each item, and only if it is really needed.
>
> Best regards,
> Frank

Hi Frank,

That's a nice idea and close to what i was experimenting with as well.

The issue that i keep having all the time is building a collation
sequence itself. For example, take the following "file names":

88.txt
100.txt

What i'm doing right now is translating everything to numbers, and
then just comparing the numbers. Another way would be to convert them
all to numbers and then make a new long string that contains all the
numbers. So i end up with:

88.txt
8 = 8
8 = 8
. = <some number>
t = <some number>
x = <some number>
t = <some number>

100.txt
1 = 1
0 = 0
0 = 0
. = <some number>
t = <some number>
x = <some number>
t = <some number>

That works, but only till 9.. That's a pitfall i discovered early on.
So then i continues and changed the algorithm to say: "if the next
character is also a number then the current number gets incremented by
10" so that will end up with: (omitting the .txt)
88.txt
8 = 18
8 = 8

100.txt
1 = 11
0 = 10
0 = 0

Doing that improved it greatly but it still gives unexpected results.
So what i was about to try now is making the algorithm say: "increment
the current number by the previous number. If there is no previous
number, but there are next numbers, just add 10." so that would end up
looking like this:
88.txt // stays the same
8 = 18
8 = 8

100.txt differs
1 = 11
0 = 10 (+11) = 21
0 = 0

I still have to see if that one nails all the issues.
Once i get that working then it's just a simple matter of putting all
the numbers in one string again, so for the last example that would
end up like so: "11210" as a string. But i'm not there - by far! It's
very complicated to get it right and i fear that this last option also
has some unexpected consequences.

If you have any other idea to get this working, please do share. This
issue is really nagging me and i want to get it fixed. It would be the
fastest natural sorting algorithm that i know of if it works :)

Cheers,
Mark




More information about the kfm-devel mailing list