Ideas for improving sorting. Goal: under 1 second for 200.000 files.
todd rme
toddrme2178 at gmail.com
Thu Oct 25 12:42:17 BST 2012
On Thu, Oct 25, 2012 at 12:20 PM, Mark <markg85 at gmail.com> wrote:
> On Thu, Oct 25, 2012 at 11:41 AM, Frank Reininghaus
> <frank78ac at googlemail.com> wrote:
>> Hi Mark,
>>
>> 2012/10/25 Mark:
>>> On Wed, Oct 24, 2012 at 1:13 AM, Christoph Feck wrote:
>>>> Hi,
>>>>
>>>> you wrote:
>>>>> Copy KStringHandler::naturalCompare into dolphin for further in
>>>>> depth optimizations
>>>>
>>>> A good read about locale-aware sorting is this link from glibc doc:
>>>>
>>>> http://www.gnu.org/software/libc/manual/html_node/Collation-
>>>> Functions.html
>>>>
>>>> Basically, it says to get an effective locale-aware sorting, you need
>>>> to cache a sort key ("collating sequence") for each string, then use
>>>> that for the sort algorithm. That sort key could additionally encode
>>>> the digit sequences in "magnitude - digits" order, so that "a12" comes
>>>> after "a2", and do case mapping for case insensitive compares.
>>>>
>>>> If you sort N = 200.000 entries, you only have to create 200.000
>>>> collating sequences, instead of 2*N*log(N) = 7.000.000. Speed of
>>>> comparing the sequences is that of a memcpy() if done properly. See
>>>> Thiago's posts about QString performance optimizations.
>>>>
>>>> So what you have to work on before doing anything else, is to ask
>>>> Frank, if he is okey with adding a sort key (QByteArray basically) for
>>>> each file item. Your code would have to make sure it is always updated
>>>> on file renames, locale changes, sort mode changes (natural vs.
>>>> simple) etc.
>>>>
>>>> For more information about collating, I already gave you the link for
>>>> the official Unicode collating code. You would need to improve the
>>>> algorithm to include the "magnitude - digits" coding.
>>>>
>>>> Christoph Feck (kdepepo)
>>>> KDE Quality Team
>>>
>>> Hi Christoph,
>>>
>>> Thanks a lot for your suggestion! However, there is a "bit" of a
>>> knowledge problem here. I don't have the knowledge - by far - to even
>>> implement something like you suggest or to even fully understand what
>>> thiago wrote about regarding string optimizations. I like to read that
>>> kind of stuff and i understand it partly, but implementing it is a
>>> whole different story.
>>>
>>> I'm getting there in knowledge, just give me a few more years :)
>>>
>>> I guess i just keep it to the "simpler" optimizations for now. Unless
>>> someone would like to help me implementing your suggestion?
>>
>> But which 'simple' optimizations that could yield a considerable
>> performance improvement remain? Christoph's suggestion is essentially
>> what you call 2.1 and 2.2.
>
> With "simple" i mean "doable" and "understandable" for me. I'm not at
> the programming level of you, Christoph and certainly not Thiago so i
> have to take the routes i can manage. Yes, i will certainly try to
> understand how Christoph' suggestion works and implement it, but don't
> expect i can.
>
> Lets just wait and see what i can come up with, oke? :)
> Perhaps just doing 2.2 might already speed up naturalCompare
> massively. I don't know, still have to implement and benchmark it.
>>
>> I like the idea, and I'm willing to review any patches, of course, but
>> I'm afraid I'm too busy with other things to work on the code myself
>> :-(
>>
>> Best regards,
>> Frank
>
> Question for that GCC link (in an attempt to understand it)
> So i would end up with a:
> struct SortEntry
> {
> QByteArray collatingKey;
> KFileItem *fileItem;
> };
>
> Where the collatingKey is meant to be what? Could you give an example
> for - lets say - 5 different names?
> - a0.txt
> - 111.txt
> - A01.txt
> - aaa.txt
> - a2.txt
>
> I don't really understand what the "collating sequence" is supposed to
> be.. The GCC link talks about it, but it still doesn't help me
> understand what it is exactly. If one of you can explain it in simple
> terms with the example files above, that would certainly help me.
>
> Where i expect the end result to look like this (this is how windows
> sorts it so lets take that as an example):
> 111.txt
> a0.txt
> A01.txt
> a1.txt
> aaa.txt
>
> Cheers,
> Mark
My understanding is that you would have some sort of mapping from
characters to sort order, a mapping which is i10n-dependent. I don't
know exactly how it is implemented in the function described, but it
could be implemented as a sequence of unsigned integers. Characters,
besides number sequences, would be converted to unsigned integers.
Numbers, depending on how sophisticated the algorithm is, would be
converted to unsigned integers, signed integers (if it detects
negative signs before numbers), or even floats (if it detects
decimals).
Let's assume that, for the implementation, numbers are placed before
characters, and we each element in the array is a pair of elements,
one bit to determine whether it is a number or a character, and an
unsigned integer. 0 is a number, 1 is a character. The sorting is
case-insensitive.
https://www.ddms.com/resources/help/reportsmenu/ascii_sort_order_chart.htm
111.txt --> [0,111] --- (1,14) (1,52) (1,56) (1,52)
a0.txt --> (1, 33) [0,0] --- (1,14) (1,52) (1,56) (1,52)
A01.txt --> (1, 33) [0,1] --- (1,14) (1,52) (1,56) (1,52)
a1.txt --> (1, 33) [0,1] --- (1,14) (1,52) (1,56) (1,52)
aaa.txt --> (1, 33) (1, 33) (1, 33) --- (1,14) (1,52) (1,56) (1,52)
For clarity I put numbers in [] and characters in (), although in the
implementation they will both be the same data structure. I also put
--- between the file name and the extension because the extension
is the same in each case.
So there were be two components, in practice.
One would be a mapping of some sort between characters and numerical
codes, which would be i10n-dependent (I don't know the most efficient
data structure for this sort of mapping). This mapping would only be
done once per file name, either done in the sorting method and cached,
or done in the filename object and lazy-loaded the first time it is
used. Numbers would be converted directly to their uint equivalent.
The second would be the sorting algorithm. It would go through
corresponding elements one-at-a-time until it finds a mismatch. In
each case it would first compare the first element in each
corresponding pair, the bit. Numbers would be placed before
characters that way. If they are both the same, then a numerical
comparison is done between the second element in each bit. This would
put smaller numbers before larger numbers, and smaller-sorted
characters below larger-sorted characters.
There are four important points here.
1. The conversion between character and number only happens once per filename.
2. The mapping between character and number can be defined by the i10n
group on a per-locale basis, although this could be implemented later.
3. The algorithm doesn't know or care whether it is dealing with
numbers or characters or what the locale is, all it sees is an
ordinary numerical sorting on either binary or integer numbers.
4. it would be possible to start with simple uint numbers, and add
support for signed or floating-point values later, since this would be
independent of both the locale and the sorting algorithm (as long as
the sorting algorithm knows how to deal with other numeric types).
Whether this is the absolute best method to implement such a system I
don't know. But hopefully it is at least a starting point to begin
thinking about it.
-Todd
More information about the kfm-devel
mailing list