Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Sat Dec 29 12:05:18 GMT 2012

Hi,

2012/12/28 Mark:
> On Fri, Dec 28, 2012 at 3:41 PM, Mark wrote:
>> On Fri, Dec 28, 2012 at 2:46 PM, Frank Reininghaus
>> <frank78ac at googlemail.com> wrote:
>>> Hi,
>>>
>>> 2012/10/24 Christoph Feck:
>>>> On Wednesday 24 October 2012 01:13:40 Christoph Feck wrote:
>>>>> Speed of comparing the sequences is that of a memcpy()
>>>>
>>>> Of course I meant memcmp().
>>>>
>>>>> So what you have to work on before doing anything else, is to ask
>>>>> Frank, if he is okey with adding a sort key (QByteArray basically)
>>>>> for each file item. Your code would have to make sure it is always
>>>>> updated on file renames, locale changes, sort mode changes
>>>>> (natural vs. simple) etc.
>>>>
>>>> Another idea is to not store this side information "permanently", but
>>>> only create it temporarily when sorting. So you effectively only sort
>>>> a list with side information.
>>>>
>>>>         struct SortEntry
>>>>         {
>>>>                 QByteArray collatingKey;
>>>>                 KFileItem *fileItem;
>>>>         };
>>>
>>> I have been thinking about sorting from time to time, and I had an
>>> idea how to to store the sort key. It could be done either
>>>
>>> a) In the ItemData's "values" hash (as a QVariant with a key like
>>> "naturalCollatingSequence"), or,
>>>
>>> b) in a new member "QByteArray naturalCollatingSequence" in ItemData
>>> (if a QByteArray seems more suitable because QByteArrays can be
>>> compared with < directly and the casting overhead for each comparison
>>> seems to much).
>>>
>>> When comparing two ItemData, we would then first check if both already
>>> have a collating sequence, and calculate and store them if that is not
>>> the case. One could then just compare those sequences.
>>>
>>> When an item changes, the collating sequence would be deleted from the
>>> hash (or set to an empty QByteArray, respectively).
>>>
>>> This would make sure that the collating sequence is calculated only
>>> once for each item, and only if it is really needed.
>>>
>>> Best regards,
>>> Frank
>>
>> Hi Frank,
>>
>> That's a nice idea and close to what i was experimenting with as well.
>>
>> The issue that i keep having all the time is building a collation
>> sequence itself. For example, take the following "file names":
>>
>> 88.txt
>> 100.txt
>>
>> What i'm doing right now is translating everything to numbers, and
>> then just comparing the numbers. Another way would be to convert them
>> all to numbers and then make a new long string that contains all the
>> numbers. So i end up with:
>>
>> 88.txt
>> 8 = 8
>> 8 = 8
>> . = <some number>
>> t = <some number>
>> x = <some number>
>> t = <some number>
>>
>> 100.txt
>> 1 = 1
>> 0 = 0
>> 0 = 0
>> . = <some number>
>> t = <some number>
>> x = <some number>
>> t = <some number>
>>
>> That works, but only till 9.. That's a pitfall i discovered early on.
>> So then i continues and changed the algorithm to say: "if the next
>> character is also a number then the current number gets incremented by
>> 10" so that will end up with: (omitting the .txt)
>> 88.txt
>> 8 = 18
>> 8 = 8
>>
>> 100.txt
>> 1 = 11
>> 0 = 10
>> 0 = 0
>>
>> Doing that improved it greatly but it still gives unexpected results.
>> So what i was about to try now is making the algorithm say: "increment
>> the current number by the previous number. If there is no previous
>> number, but there are next numbers, just add 10." so that would end up
>> looking like this:
>> 88.txt // stays the same
>> 8 = 18
>> 8 = 8
>>
>> 100.txt differs
>> 1 = 11
>> 0 = 10 (+11) = 21
>> 0 = 0
>>
>> I still have to see if that one nails all the issues.
>> Once i get that working then it's just a simple matter of putting all
>> the numbers in one string again, so for the last example that would
>> end up like so: "11210" as a string. But i'm not there - by far! It's
>> very complicated to get it right and i fear that this last option also
>> has some unexpected consequences.
>>
>> If you have any other idea to get this working, please do share. This
>> issue is really nagging me and i want to get it fixed. It would be the
>> fastest natural sorting algorithm that i know of if it works :)
>>
>> Cheers,
>> Mark
>
> Note: one way that will certainly fix it is prefixing with 0. So if we
> take the 88 and 100 that would end up like this:
> 088
> 100
>
> That will work with sorting, but all "collation sequences" will be as
> long as the longest string which is probably not what we want to have.

Right, this is not possible because we never know what the 'longest
string' will be.

Christoph had an idea earlier in this thread. Here is a quote:

> Also the last inability can be solved in the same way. Simply
> _prepend_ a code which states the magnitude (number of digits).
> Example:
>
>        'a1234' -> 'a' \004' '123'
>        'a56' -> 'a' '\002' '56'
>        'a8' -> 'a' '\001' '8'

The only problem I see here is that numbers might have more than 10
digits, in which case the prepending might have to be done
recursively. But there is probably a way to resolve this.

I'll be mostly offline during the next few days, but after that I'm
happy to discuss any ideas how to resolve the 'natural collating
sequence' issue.

Best regards,
Frank