Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Sat Dec 29 13:12:50 GMT 2012

On Sat, Dec 29, 2012 at 1:05 PM, Frank Reininghaus
<frank78ac at googlemail.com> wrote:
> Hi,
>
> 2012/12/28 Mark:
>> On Fri, Dec 28, 2012 at 3:41 PM, Mark wrote:
>>> On Fri, Dec 28, 2012 at 2:46 PM, Frank Reininghaus
>>> <frank78ac at googlemail.com> wrote:
>>>> Hi,
>>>>
>>>> 2012/10/24 Christoph Feck:
>>>>> On Wednesday 24 October 2012 01:13:40 Christoph Feck wrote:
>>>>>> Speed of comparing the sequences is that of a memcpy()
>>>>>
>>>>> Of course I meant memcmp().
>>>>>
>>>>>> So what you have to work on before doing anything else, is to ask
>>>>>> Frank, if he is okey with adding a sort key (QByteArray basically)
>>>>>> for each file item. Your code would have to make sure it is always
>>>>>> updated on file renames, locale changes, sort mode changes
>>>>>> (natural vs. simple) etc.
>>>>>
>>>>> Another idea is to not store this side information "permanently", but
>>>>> only create it temporarily when sorting. So you effectively only sort
>>>>> a list with side information.
>>>>>
>>>>>         struct SortEntry
>>>>>         {
>>>>>                 QByteArray collatingKey;
>>>>>                 KFileItem *fileItem;
>>>>>         };
>>>>
>>>> I have been thinking about sorting from time to time, and I had an
>>>> idea how to to store the sort key. It could be done either
>>>>
>>>> a) In the ItemData's "values" hash (as a QVariant with a key like
>>>> "naturalCollatingSequence"), or,
>>>>
>>>> b) in a new member "QByteArray naturalCollatingSequence" in ItemData
>>>> (if a QByteArray seems more suitable because QByteArrays can be
>>>> compared with < directly and the casting overhead for each comparison
>>>> seems to much).
>>>>
>>>> When comparing two ItemData, we would then first check if both already
>>>> have a collating sequence, and calculate and store them if that is not
>>>> the case. One could then just compare those sequences.
>>>>
>>>> When an item changes, the collating sequence would be deleted from the
>>>> hash (or set to an empty QByteArray, respectively).
>>>>
>>>> This would make sure that the collating sequence is calculated only
>>>> once for each item, and only if it is really needed.
>>>>
>>>> Best regards,
>>>> Frank
>>>
>>> Hi Frank,
>>>
>>> That's a nice idea and close to what i was experimenting with as well.
>>>
>>> The issue that i keep having all the time is building a collation
>>> sequence itself. For example, take the following "file names":
>>>
>>> 88.txt
>>> 100.txt
>>>
>>> What i'm doing right now is translating everything to numbers, and
>>> then just comparing the numbers. Another way would be to convert them
>>> all to numbers and then make a new long string that contains all the
>>> numbers. So i end up with:
>>>
>>> 88.txt
>>> 8 = 8
>>> 8 = 8
>>> . = <some number>
>>> t = <some number>
>>> x = <some number>
>>> t = <some number>
>>>
>>> 100.txt
>>> 1 = 1
>>> 0 = 0
>>> 0 = 0
>>> . = <some number>
>>> t = <some number>
>>> x = <some number>
>>> t = <some number>
>>>
>>> That works, but only till 9.. That's a pitfall i discovered early on.
>>> So then i continues and changed the algorithm to say: "if the next
>>> character is also a number then the current number gets incremented by
>>> 10" so that will end up with: (omitting the .txt)
>>> 88.txt
>>> 8 = 18
>>> 8 = 8
>>>
>>> 100.txt
>>> 1 = 11
>>> 0 = 10
>>> 0 = 0
>>>
>>> Doing that improved it greatly but it still gives unexpected results.
>>> So what i was about to try now is making the algorithm say: "increment
>>> the current number by the previous number. If there is no previous
>>> number, but there are next numbers, just add 10." so that would end up
>>> looking like this:
>>> 88.txt // stays the same
>>> 8 = 18
>>> 8 = 8
>>>
>>> 100.txt differs
>>> 1 = 11
>>> 0 = 10 (+11) = 21
>>> 0 = 0
>>>
>>> I still have to see if that one nails all the issues.
>>> Once i get that working then it's just a simple matter of putting all
>>> the numbers in one string again, so for the last example that would
>>> end up like so: "11210" as a string. But i'm not there - by far! It's
>>> very complicated to get it right and i fear that this last option also
>>> has some unexpected consequences.
>>>
>>> If you have any other idea to get this working, please do share. This
>>> issue is really nagging me and i want to get it fixed. It would be the
>>> fastest natural sorting algorithm that i know of if it works :)
>>>
>>> Cheers,
>>> Mark
>>
>> Note: one way that will certainly fix it is prefixing with 0. So if we
>> take the 88 and 100 that would end up like this:
>> 088
>> 100
>>
>> That will work with sorting, but all "collation sequences" will be as
>> long as the longest string which is probably not what we want to have.
>
> Right, this is not possible because we never know what the 'longest
> string' will be.
>
> Christoph had an idea earlier in this thread. Here is a quote:
>
>> Also the last inability can be solved in the same way. Simply
>> _prepend_ a code which states the magnitude (number of digits).
>> Example:
>>
>>        'a1234' -> 'a' \004' '123'
>>        'a56' -> 'a' '\002' '56'
>>        'a8' -> 'a' '\001' '8'
>
> The only problem I see here is that numbers might have more than 10
> digits, in which case the prepending might have to be done
> recursively. But there is probably a way to resolve this.
>
> I'll be mostly offline during the next few days, but after that I'm
> happy to discuss any ideas how to resolve the 'natural collating
> sequence' issue.
>
> Best regards,
> Frank

Hi Frank,

Yes, that "might" work in most cases, but that involves detecting
numeric types. So what to do about filenames that contain:
"a123_01.txt" or any otther single character between numbers?  That
involves detecting them which is something i'd like to avoid
completely since it's only likely to have errors because of the
unexpected corner cases.

My last idea where i would increment a number by 10 will also not work
since it will get into the range of a-zA-Z characters thus you could
again get the nasty issue that a10000.txt would be sorted before
aa.txt

So i took a step back and looking at what we really want to do. If i
take a88.txt and a100.txt examples again and convert them into numbers
you would get:
a88.txt = 97, 8, 8, 46, 116, 120, 116
a100.txt = 97, 1, 0, 0, 46, 116, 120, 116

If we sort that based on individual ints (97 vs 97, 8 vs 1, ....) then
a100.txt will be sorted before a88.txt. So i was thinking, why do i
even go through the hassle of sorting it by individual character? It
should look at the whole string and will work if it does that.

So, my solution? Bigint! If we take the numbers from above again and
put it in one big int we get:
a88.txt =    978846116120116
a100.txt = 9710046116120116

Now i can clearly see that a88.txt is _way_ smaller then a100.txt so
the sorting will be done properly now. The only question we have then
is how to properly represent that long integer. As a bunch of small
ints would work. But i'm thinking of using the BigInteger class
http://delta.affinix.com/docs/qca/classQCA_1_1BigInteger.html

Next question would be what the penalty in performance will be when i
start using a BigInteger..