Ideas for improving sorting. Goal: under 1 second for 200.000 files.

Wed Jan 2 15:42:37 GMT 2013

On Sat, Dec 29, 2012 at 2:12 PM, Mark <markg85 at gmail.com> wrote:
> On Sat, Dec 29, 2012 at 1:05 PM, Frank Reininghaus
> <frank78ac at googlemail.com> wrote:
>> Hi,
>>
>> 2012/12/28 Mark:
>>> On Fri, Dec 28, 2012 at 3:41 PM, Mark wrote:
>>>> On Fri, Dec 28, 2012 at 2:46 PM, Frank Reininghaus
>>>> <frank78ac at googlemail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> 2012/10/24 Christoph Feck:
>>>>>> On Wednesday 24 October 2012 01:13:40 Christoph Feck wrote:
>>>>>>> Speed of comparing the sequences is that of a memcpy()
>>>>>>
>>>>>> Of course I meant memcmp().
>>>>>>
>>>>>>> So what you have to work on before doing anything else, is to ask
>>>>>>> Frank, if he is okey with adding a sort key (QByteArray basically)
>>>>>>> for each file item. Your code would have to make sure it is always
>>>>>>> updated on file renames, locale changes, sort mode changes
>>>>>>> (natural vs. simple) etc.
>>>>>>
>>>>>> Another idea is to not store this side information "permanently", but
>>>>>> only create it temporarily when sorting. So you effectively only sort
>>>>>> a list with side information.
>>>>>>
>>>>>>         struct SortEntry
>>>>>>         {
>>>>>>                 QByteArray collatingKey;
>>>>>>                 KFileItem *fileItem;
>>>>>>         };
>>>>>
>>>>> I have been thinking about sorting from time to time, and I had an
>>>>> idea how to to store the sort key. It could be done either
>>>>>
>>>>> a) In the ItemData's "values" hash (as a QVariant with a key like
>>>>> "naturalCollatingSequence"), or,
>>>>>
>>>>> b) in a new member "QByteArray naturalCollatingSequence" in ItemData
>>>>> (if a QByteArray seems more suitable because QByteArrays can be
>>>>> compared with < directly and the casting overhead for each comparison
>>>>> seems to much).
>>>>>
>>>>> When comparing two ItemData, we would then first check if both already
>>>>> have a collating sequence, and calculate and store them if that is not
>>>>> the case. One could then just compare those sequences.
>>>>>
>>>>> When an item changes, the collating sequence would be deleted from the
>>>>> hash (or set to an empty QByteArray, respectively).
>>>>>
>>>>> This would make sure that the collating sequence is calculated only
>>>>> once for each item, and only if it is really needed.
>>>>>
>>>>> Best regards,
>>>>> Frank
>>>>
>>>> Hi Frank,
>>>>
>>>> That's a nice idea and close to what i was experimenting with as well.
>>>>
>>>> The issue that i keep having all the time is building a collation
>>>> sequence itself. For example, take the following "file names":
>>>>
>>>> 88.txt
>>>> 100.txt
>>>>
>>>> What i'm doing right now is translating everything to numbers, and
>>>> then just comparing the numbers. Another way would be to convert them
>>>> all to numbers and then make a new long string that contains all the
>>>> numbers. So i end up with:
>>>>
>>>> 88.txt
>>>> 8 = 8
>>>> 8 = 8
>>>> . = <some number>
>>>> t = <some number>
>>>> x = <some number>
>>>> t = <some number>
>>>>
>>>> 100.txt
>>>> 1 = 1
>>>> 0 = 0
>>>> 0 = 0
>>>> . = <some number>
>>>> t = <some number>
>>>> x = <some number>
>>>> t = <some number>
>>>>
>>>> That works, but only till 9.. That's a pitfall i discovered early on.
>>>> So then i continues and changed the algorithm to say: "if the next
>>>> character is also a number then the current number gets incremented by
>>>> 10" so that will end up with: (omitting the .txt)
>>>> 88.txt
>>>> 8 = 18
>>>> 8 = 8
>>>>
>>>> 100.txt
>>>> 1 = 11
>>>> 0 = 10
>>>> 0 = 0
>>>>
>>>> Doing that improved it greatly but it still gives unexpected results.
>>>> So what i was about to try now is making the algorithm say: "increment
>>>> the current number by the previous number. If there is no previous
>>>> number, but there are next numbers, just add 10." so that would end up
>>>> looking like this:
>>>> 88.txt // stays the same
>>>> 8 = 18
>>>> 8 = 8
>>>>
>>>> 100.txt differs
>>>> 1 = 11
>>>> 0 = 10 (+11) = 21
>>>> 0 = 0
>>>>
>>>> I still have to see if that one nails all the issues.
>>>> Once i get that working then it's just a simple matter of putting all
>>>> the numbers in one string again, so for the last example that would
>>>> end up like so: "11210" as a string. But i'm not there - by far! It's
>>>> very complicated to get it right and i fear that this last option also
>>>> has some unexpected consequences.
>>>>
>>>> If you have any other idea to get this working, please do share. This
>>>> issue is really nagging me and i want to get it fixed. It would be the
>>>> fastest natural sorting algorithm that i know of if it works :)
>>>>
>>>> Cheers,
>>>> Mark
>>>
>>> Note: one way that will certainly fix it is prefixing with 0. So if we
>>> take the 88 and 100 that would end up like this:
>>> 088
>>> 100
>>>
>>> That will work with sorting, but all "collation sequences" will be as
>>> long as the longest string which is probably not what we want to have.
>>
>> Right, this is not possible because we never know what the 'longest
>> string' will be.
>>
>> Christoph had an idea earlier in this thread. Here is a quote:
>>
>>> Also the last inability can be solved in the same way. Simply
>>> _prepend_ a code which states the magnitude (number of digits).
>>> Example:
>>>
>>>        'a1234' -> 'a' \004' '123'
>>>        'a56' -> 'a' '\002' '56'
>>>        'a8' -> 'a' '\001' '8'
>>
>> The only problem I see here is that numbers might have more than 10
>> digits, in which case the prepending might have to be done
>> recursively. But there is probably a way to resolve this.
>>
>> I'll be mostly offline during the next few days, but after that I'm
>> happy to discuss any ideas how to resolve the 'natural collating
>> sequence' issue.
>>
>> Best regards,
>> Frank
>
> Hi Frank,
>
> Yes, that "might" work in most cases, but that involves detecting
> numeric types. So what to do about filenames that contain:
> "a123_01.txt" or any otther single character between numbers?  That
> involves detecting them which is something i'd like to avoid
> completely since it's only likely to have errors because of the
> unexpected corner cases.
>
> My last idea where i would increment a number by 10 will also not work
> since it will get into the range of a-zA-Z characters thus you could
> again get the nasty issue that a10000.txt would be sorted before
> aa.txt
>
> So i took a step back and looking at what we really want to do. If i
> take a88.txt and a100.txt examples again and convert them into numbers
> you would get:
> a88.txt = 97, 8, 8, 46, 116, 120, 116
> a100.txt = 97, 1, 0, 0, 46, 116, 120, 116
>
> If we sort that based on individual ints (97 vs 97, 8 vs 1, ....) then
> a100.txt will be sorted before a88.txt. So i was thinking, why do i
> even go through the hassle of sorting it by individual character? It
> should look at the whole string and will work if it does that.
>
> So, my solution? Bigint! If we take the numbers from above again and
> put it in one big int we get:
> a88.txt =    978846116120116
> a100.txt = 9710046116120116
>
> Now i can clearly see that a88.txt is _way_ smaller then a100.txt so
> the sorting will be done properly now. The only question we have then
> is how to properly represent that long integer. As a bunch of small
> ints would work. But i'm thinking of using the BigInteger class
> http://delta.affinix.com/docs/qca/classQCA_1_1BigInteger.html
>
> Next question would be what the penalty in performance will be when i
> start using a BigInteger..

Don't bother thinking about this issue anymore. My brother (a far
better programmer then me) managed to make a extremely fast natural
sorting algorithm. 1 second for 1 million items. If you include init
stuff then it's 1.8 seconds. For 200.000 items it's even far less!
That's just on a cheap fusion pc.

I will blog about that sometime this week or next week. Depends a bit
on my experiments with a KDirModel rewrite :)