<p dir="ltr">Sorry, I forgot to "Reply All" and didn't send to the list.</p>

<div class="gmail_quote">---------- Forwarded message ----------<br>From: "Henry de Valence" <<a href="mailto:hdevalence@hdevalence.ca">hdevalence@hdevalence.ca</a>><br>Date: 16 Jul 2013 10:54<br>Subject: Re: [Kstars-devel] Some very preliminary results with OpenCL<br>

To: "Daniel Baboiu" <<a href="mailto:daniel.baboiu@shaw.ca">daniel.baboiu@shaw.ca</a>><br>Cc: <br><br type="attribution"><p dir="ltr"><br>

On 16 Jul 2013 09:17, "Daniel Baboiu" <<a href="mailto:daniel.baboiu@shaw.ca" target="_blank">daniel.baboiu@shaw.ca</a>> wrote:<br>

><br>

> What is your setup, what CPU and GPU are you using? I've seen benchmarks<br>

> claiming 30x improvement (and I got similar results myself), but in<br>

> those cases, the CPU code was single-threaded. Once I multithreaded it<br>

> (with OpenMP -- quite simple to implement), the improvement dropped to<br>

> (still impressive) 5-6x.</p>

<p dir="ltr">I'm using an AMD Radeon 7850 and an Intel i5-3350P. In this case the CPU code is single threaded, since KStars does no threading whatsoever.</p>

<p dir="ltr">I expect that we won't see such a dramatic differential between the CPU and the GPU when all is said and done, and we are comparing the same algorithms on different hardware (the above is a comparison of a faster algorithm on faster hardware against a slower algorithm on slower hardware).</p>


<p dir="ltr">But the baseline isn't KStars with OpenMP and well-designed code, it's KStars as it is in the master branch: some mess of linked-lists of arrays of objects with virtual methods that do the computation by modifying some internal state of those objects in complicated and unpredictable ways.</p>


<p dir="ltr">Most of the effort, indeed, is actually orthogonal to how the computation is actually carried out (OpenMP/OpenCL/single threaded CPU code/etc) and relates to slightly bigger issues: the algorithms we use, how the data is stored, etc. </p>


<p dir="ltr">To get real numbers on how the CPU compares to the GPU we have to wait until I've finished more of the work: this is just an encouraging first proof-of-concept (a GPU is better at lots of little matrix operations, quelle surprise).</p>


<p dir="ltr">> Linearization of the problem is not necessary, as the GPU can handle<br>

> trig functions.</p>

<p dir="ltr">Yes, it is not necessary, but I think that it is desirable. All of the other transformations work on vectors, so to do it using the existing algorithm requires a whole set of trig functions to obtain the spherical coordinates, then another set of trig calls to do the transformation, then another set of trig calls to get a vector again. It seems less than optimal.</p>


<p dir="ltr">Also if we are concerned enough about accuracy to be using doubles at such great expense (16x slower on my card, for instance, and I think nVidia cripples their consumer cards as badly), I don't know that we should be using native_sin() and friends instead of the slower trig functions that have precisions set by the standard.</p>


<p dir="ltr">The alternative is stereographic projection + scaling + deprojection, which just needs standard operations plus possibly a square root. It seems preferable, but I may implement the existing algorithm in the meantime just to be able to run a complete pipeline.</p>


<p dir="ltr">Cheers,<br>

Henry</p>

</div>