Sorting speed improvement

Wed Apr 23 19:40:40 CEST 2003

Hi Jaime,

Below you will find a more detailed analysis. I hope it helps you come up with 
the ultimate string sort code!!! 

On Wednesday 23 April 2003 17:25, Jaime Torres wrote:
> Second. In the University, I've learned two complexity notations:
> O(....)   o in uppercase, that talks about the higher exponent in the
> complexity
> and
> o(.....)  o in lowercase, that talks about all the factors in the
> complexity.
>

Err. To be exact, big-oh notation denotes an inequality which is used to 
characterize the asymptotic behavior of a function... So the "higher 
exponent" notion comes from that if you recall....

> In that way, an algorith with o(n^2*k) or o(3*n^2+50) is O(n^2) if k is
> constant.
> In this notation, you are talking about an o(n*k) complexity.
>

Definitely right. But your program (radix sort) is O(k.n) which was my point. 
k isn't a constant in general radix sort. The correct analysis is:
  The worst case running time of postman's sort is O(k.n) where k is the 
maximum length of strings in input set and n the number of elements in the 
input set.
[Note: In the bottom of the post, you will see how that bound is derived]

So, there really are two parameters.....

Note "worst case". Average case analysis is probably better than O(k.n)....

> Does anyone knows a sorting algorithm faster than a radix sort ?
>

There are linear-time algorithms called counting sort and bucket sort when we 
have other conditions in the input set.... Radix sort is among the fastest 
surely with O(d.n) worst case running time complexity! However, it is O(n) 
only when there is a constant number of digits (d constant), like 32. This of 
course corresponds to the condition of counting sort that the input set is 
drawn from a finite interval in Z.... However, radix sort was obviously not 
invented by that guy. His insight was that radix sort could be implemented 
efficiently when we are sorting alphanumerical strings..... It therefore 
isn't fair to call "postman's algorithm" the WORLD'S FASTEST SORTING PROGRAM. 
There are three equivalent algorithms, and postman's algorithm is an instance 
of one of them, a special case.

> I do not know where I said that this was "my" "program", but I will review
> it

Isn't the code you posted yours? That's what I meant by "program", not 
algorithm.

> and fix it.  And I will include links to the other pages that talk about
> this algorithm.
>

Sure. nist.gov has a page that google turns up.

> > 3) Your implementation seems to be OK but you have to be
> > skeptical when you
> > are implementing a sorting algorithm!  Did you actually test it?
>
> Not the implementation I made yesterday night. Well, I'm using the
> generated QT to run KDE now, but this is not a good test.
>

Wrt correctness it seems right. It actually takes a while to understand an 
implementation of radix sort with strings but I ran it a few times in my 
head, it looks right ;)

> I do not know enougth of QT (still) to move the pointers, I just used the
> provided methods (QT methods are easy to learn).
>

Hmm. What I thought was this. If the algorithm actually copied strings it 
would be useless. That would automatically make its complexity O(k^2.n) which 
is worse than O(k.n.logn) for large n and k. So, you shouldn't be doing that. 
:) However, there is reference counting in Qt, so maybe you are already 
avoiding that in the code. I am not sure :)

> This implementation, therefore, is quite expensive in CPU and memory,
> and, of course, is not optimal at all. It was intended as a hint.
>

Just to make sure where the cost is. Assuming that we are avoiding the string 
copy operation in the inner loop which would destroy the purpose of the 
algorithm, and instead simply copying pointers to strings, the only remaining 
cost is dynamic memory allocation/deletion. Here, you are using 
QStringList::clear and QStringList::append which will incur dynamic memory 
allocation code even if you are not copying wholesome strings into them.. (I 
don't know the exact implementation in Qt right now maybe somebody else can 
point out)

It looks like we are making O(k.n) appends here in the worst case. This is 
likely to suffer a lot of context switches with ordinary memory allocation 
routines, and I don't know if the fast malloc routines will help. The safer 
way is to use a custom list operation with a fast memory allocation routine.. 
We can pre-allocate all the mem. The required size is proportional to the 
size of the input list so it's rather easy to compute (since we have to store 
only pointers in the intermediate lists) Now, it's getting interesting :))))

> > Also, w.r.t. point 1, quicksort with ordinary string
> > comparison would really
> > have O(n*logn*k) worst case running time complexity... Is
> > that the case
> > currently in QStringList? If so, I strongly suggest that this
> > algorithm be
> > made available as an optional code in Qt, maybe something like
> > QStringList::fastSortASCII(). Please license it under BSD
> > after you test it!
> > It would be nice if you could write a test suite code with
> > assert()'s spewed
> > out all over the place.  You should already have such a thing, no?
>
> First. In the above notation, the Quicksort algorith is
> o(n*log2(n)*k) or O(n*log2(n)).
>

Errr. That's wrong, you are misinterpreting the big-oh notation, there are two 
independent variables there......  That's why it goes into big-oh. Note again 
that the maximum length of a string isn't constant here!!!

Also I've made a mistake in the above quote. Correction:
quicksort with ordinary string comparison would really have O(k.n^2) worst 
case running time complexity and O(k.n.logn) average case running time 
complexity.

I wonder if they are careful about moving only pointers to strings!!! 
Otherwise their heapsort implementation would be O(k^2.n.logn) instead of 
O(k.n.logn)!!!!

> Third. In the Quicksort, there are String comparations, that are more
> expensive than accesing an array.
>

?????? Both quicksort and heapsort have the same average case running time 
complexity. (quicksort has quadratic worst case) *and* they are both 
comparison sort. k goes into big-oh *because* of the string comparisons.

> Fourth. No, I do not have such an implementation, but I will do it (now
> that I have again some free time) and release it under an Open Source
> License.

That's nice! Would make it easier for other projects to adapt it.

Here I'm quoting from the analysis of radix sort from the aforementioned 
textbook because I don't want you to go to all the trouble deriving it from 
scratch :)

The analysis of the running time depends on the stable sort used as the 
intermediate sorting algorithm. When each digit is in the range 1 to k, and k 
is not too large, counting sort is the obvious choice. Each pass over n 
d-digit numbers then takes time \theta(n+k). There are d passes, so the total 
time for radix sort is \theta(dn + kd). When d is constant and k = O(n), 
radix sort runs in linear time.

So, since d isn't constant in this problem, it isn't linear time!!
Tight bound \theta becomes upper bound big-oh because not all strings have the 
same length!! k is not too large, as a matter of fact a constant, 256 for 
ASCII but surely it's a large constant... so I think we can remove kd, the 
resulting time complexity is O(d.n). This concludes the analysis! [*]

Happy hacking and thanks for your contribution,

[*] I know what I'm saying!!!

-- 
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara  KDE Project: http://www.kde.org
www: http://www.cs.bilkent.edu.tr/~erayo  Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B  EA0F 7C07 AE16 874D 539C