Sorting speed improvement
Eray Ozkural
exa at kablonet.com.tr
Wed Apr 23 19:40:40 CEST 2003
Hi Jaime,
Below you will find a more detailed analysis. I hope it helps you come up with
the ultimate string sort code!!!
On Wednesday 23 April 2003 17:25, Jaime Torres wrote:
> Second. In the University, I've learned two complexity notations:
> O(....) o in uppercase, that talks about the higher exponent in the
> complexity
> and
> o(.....) o in lowercase, that talks about all the factors in the
> complexity.
>
Err. To be exact, big-oh notation denotes an inequality which is used to
characterize the asymptotic behavior of a function... So the "higher
exponent" notion comes from that if you recall....
> In that way, an algorith with o(n^2*k) or o(3*n^2+50) is O(n^2) if k is
> constant.
> In this notation, you are talking about an o(n*k) complexity.
>
Definitely right. But your program (radix sort) is O(k.n) which was my point.
k isn't a constant in general radix sort. The correct analysis is:
The worst case running time of postman's sort is O(k.n) where k is the
maximum length of strings in input set and n the number of elements in the
input set.
[Note: In the bottom of the post, you will see how that bound is derived]
So, there really are two parameters.....
Note "worst case". Average case analysis is probably better than O(k.n)....
> Does anyone knows a sorting algorithm faster than a radix sort ?
>
There are linear-time algorithms called counting sort and bucket sort when we
have other conditions in the input set.... Radix sort is among the fastest
surely with O(d.n) worst case running time complexity! However, it is O(n)
only when there is a constant number of digits (d constant), like 32. This of
course corresponds to the condition of counting sort that the input set is
drawn from a finite interval in Z.... However, radix sort was obviously not
invented by that guy. His insight was that radix sort could be implemented
efficiently when we are sorting alphanumerical strings..... It therefore
isn't fair to call "postman's algorithm" the WORLD'S FASTEST SORTING PROGRAM.
There are three equivalent algorithms, and postman's algorithm is an instance
of one of them, a special case.
> I do not know where I said that this was "my" "program", but I will review
> it
Isn't the code you posted yours? That's what I meant by "program", not
algorithm.
> and fix it. And I will include links to the other pages that talk about
> this algorithm.
>
Sure. nist.gov has a page that google turns up.
> > 3) Your implementation seems to be OK but you have to be
> > skeptical when you
> > are implementing a sorting algorithm! Did you actually test it?
>
> Not the implementation I made yesterday night. Well, I'm using the
> generated QT to run KDE now, but this is not a good test.
>
Wrt correctness it seems right. It actually takes a while to understand an
implementation of radix sort with strings but I ran it a few times in my
head, it looks right ;)
> I do not know enougth of QT (still) to move the pointers, I just used the
> provided methods (QT methods are easy to learn).
>
Hmm. What I thought was this. If the algorithm actually copied strings it
would be useless. That would automatically make its complexity O(k^2.n) which
is worse than O(k.n.logn) for large n and k. So, you shouldn't be doing that.
:) However, there is reference counting in Qt, so maybe you are already
avoiding that in the code. I am not sure :)
> This implementation, therefore, is quite expensive in CPU and memory,
> and, of course, is not optimal at all. It was intended as a hint.
>
Just to make sure where the cost is. Assuming that we are avoiding the string
copy operation in the inner loop which would destroy the purpose of the
algorithm, and instead simply copying pointers to strings, the only remaining
cost is dynamic memory allocation/deletion. Here, you are using
QStringList::clear and QStringList::append which will incur dynamic memory
allocation code even if you are not copying wholesome strings into them.. (I
don't know the exact implementation in Qt right now maybe somebody else can
point out)
It looks like we are making O(k.n) appends here in the worst case. This is
likely to suffer a lot of context switches with ordinary memory allocation
routines, and I don't know if the fast malloc routines will help. The safer
way is to use a custom list operation with a fast memory allocation routine..
We can pre-allocate all the mem. The required size is proportional to the
size of the input list so it's rather easy to compute (since we have to store
only pointers in the intermediate lists) Now, it's getting interesting :))))
> > Also, w.r.t. point 1, quicksort with ordinary string
> > comparison would really
> > have O(n*logn*k) worst case running time complexity... Is
> > that the case
> > currently in QStringList? If so, I strongly suggest that this
> > algorithm be
> > made available as an optional code in Qt, maybe something like
> > QStringList::fastSortASCII(). Please license it under BSD
> > after you test it!
> > It would be nice if you could write a test suite code with
> > assert()'s spewed
> > out all over the place. You should already have such a thing, no?
>
> First. In the above notation, the Quicksort algorith is
> o(n*log2(n)*k) or O(n*log2(n)).
>
Errr. That's wrong, you are misinterpreting the big-oh notation, there are two
independent variables there...... That's why it goes into big-oh. Note again
that the maximum length of a string isn't constant here!!!
Also I've made a mistake in the above quote. Correction:
quicksort with ordinary string comparison would really have O(k.n^2) worst
case running time complexity and O(k.n.logn) average case running time
complexity.
I wonder if they are careful about moving only pointers to strings!!!
Otherwise their heapsort implementation would be O(k^2.n.logn) instead of
O(k.n.logn)!!!!
> Third. In the Quicksort, there are String comparations, that are more
> expensive than accesing an array.
>
?????? Both quicksort and heapsort have the same average case running time
complexity. (quicksort has quadratic worst case) *and* they are both
comparison sort. k goes into big-oh *because* of the string comparisons.
> Fourth. No, I do not have such an implementation, but I will do it (now
> that I have again some free time) and release it under an Open Source
> License.
That's nice! Would make it easier for other projects to adapt it.
Here I'm quoting from the analysis of radix sort from the aforementioned
textbook because I don't want you to go to all the trouble deriving it from
scratch :)
The analysis of the running time depends on the stable sort used as the
intermediate sorting algorithm. When each digit is in the range 1 to k, and k
is not too large, counting sort is the obvious choice. Each pass over n
d-digit numbers then takes time \theta(n+k). There are d passes, so the total
time for radix sort is \theta(dn + kd). When d is constant and k = O(n),
radix sort runs in linear time.
So, since d isn't constant in this problem, it isn't linear time!!
Tight bound \theta becomes upper bound big-oh because not all strings have the
same length!! k is not too large, as a matter of fact a constant, 256 for
ASCII but surely it's a large constant... so I think we can remove kd, the
resulting time complexity is O(d.n). This concludes the analysis! [*]
Happy hacking and thanks for your contribution,
[*] I know what I'm saying!!!
--
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara KDE Project: http://www.kde.org
www: http://www.cs.bilkent.edu.tr/~erayo Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B EA0F 7C07 AE16 874D 539C
More information about the Kde-optimize
mailing list