<div dir="ltr"><div>Hi all,</div><div><br></div><div>I'd like to propose some architectural changes to the KStars data</div><div>processing pipeline. Generally speaking, this would involve rewriting</div><div>the portions of the code between the data storage (in catalogs) and the</div>
<div>painting interface so that the functions involved are OpenCL kernels</div><div>that can be executed in parallel on the CPU or with massive parallelism</div><div>(hundreds of "cores") on the GPU.</div><div>
<br>
</div><div>The steps that KStars must perform on the sky objects before they can be </div><div>displayed on the screen are roughly as follows:</div><div><br></div><div>1. Precession/Nutation;</div><div>2. Equatorial -> Horizontal conversion;</div>
<div>3. Projection to screen coordinates.</div><div><br></div><div>Currently, both 1 and 2 are done on the CPU with code that is neither</div><div>thread- nor SIMD-parallel. 3 is also done on the CPU, even in the case</div>
<div>of the GL backend I implemented three years ago, due to the decision to</div><div>use the legacy direct-mode instead of using OpenGL vertex shaders (as I</div><div>recall, this was motivated by compatibility concerns, but in retrospect</div>
<div>it looks like a rather poor choice). </div><div><br></div><div>The advantages of moving these tasks into CL kernels are these:</div><div><br></div><div>1. Most importantly, we gain the ability to execute code on the GPU.</div>
<div> General-purpose GPU computing is already here, and it's going to be</div><div> even more important in the future than it is now: today, a low-end</div><div> $50 AMD CPU has a graphics processor on-chip with 128 processing</div>
<div> elements, while a higher-end graphics card may have over 1024. The</div><div> benefit grows even more when we talk about low-power embedded</div><div> devices, since they usually have weak processors, but capable GPUs</div>
<div> [1]. Using all the available hardware gives really dramatic</div><div> improvements [2], and I think that the workload for KStars would be</div><div> well-suited for it. </div><div><br></div><div>In the event where the user has hardware that only supports execution on the </div>
<div>GPU, we still gain:</div><div><br></div><div>2. It's very, very, rare to see a single-core machine, but KStars uses</div><div> only a single thread for all of the processing. OpenCL automatically</div><div> runs code in parallel across all available cores. The three steps</div>
<div> above are obviously parallel between stars, and should be run in two,</div><div> four, ... threads as appropriate, with OpenCL doing the work of </div><div> determining workgroup sizes.</div><div><br></div>
<div>3. KStars currently has rather poor memory-access patterns due to</div><div> putting an OO code structure on a problem that is really more of a</div><div> functional, data- processing problem. Using CL forces us to</div>
<div> structure the code so that instead of calling functions many times</div><div> on different bits of data at different locations in memory, we</div><div> essentially call functions few times on very large contiguous arrays</div>
<div> of memory containing all of the data, resulting in better</div><div> performance. (See, for example, Drepper's matrix multiplication</div><div> example [3], where doing the extra work of malloc'ing 8MB of memory</div>
<div> and filling it with a matrix transpose gives a nearly</div><div> two-order-of-magnitude speed increase.) </div><div><br></div><div> Note that #3 is something I'd like to test and get hard numbers on</div>
<div> before writing any applications, and I have a student version of</div><div> VTune that is supposedly able to profile these things, but it</div><div> doesn't want to work properly.</div><div><br></div><div>
Finally, we also have this:</div><div><br></div><div>4. The main bottleneck of the OpenGL mode is sending stars to the</div><div> graphics card. The next is projecting the stars. Here, the stars are</div><div> already on the graphics card, and are sent there less frequently and</div>
<div> from a position where we know how many we'll be sending (e.g. if we</div><div> load a trixel at a time), so the first problem goes away, and the</div><div> second also goes away because we use a vertex shader to do</div>
<div> projection.</div><div><br></div><div>This email is somewhat light on technical details from the KStars point</div><div>of view; I'll be writing another one soon with details about how I'd</div><div>like to do this, but I want to be very, very, very careful with it,</div>
<div>specifically in the area of how to prevent scope creep.</div><div><br></div><div>The reason is that the last time I did a summer of code project I didn't</div><div>do a very good job of planning on how to avoid scope creep, and as a</div>
<div>result, I got really bogged down in trying to fix everything in KStars</div><div>and burned myself out. So, I think that it's really important to make</div><div>sure that the plan of what I'll be doing in a project proposal has not</div>
<div>just really clear scheduling, but also some stuff about what not to</div><div>do, and how to make sure that we make minimal changes elsewhere.</div><div><br></div><div>I'm hoping to have that ready by midweek; however, I'd appreciate any</div>
<div>comments that people have on the generalities in the meantime.</div><div><br></div><div>Cheers,</div><div><br></div><div>Henry de Valence</div><div><br></div><div>[1]: See also the list of companies here: <a href="http://hsafoundation.com/">http://hsafoundation.com/</a> to get an </div>
<div>idea of where OpenCL is headed in the embedded realm.</div><div><br></div><div>[2]: Compare GIMP performance with and without OpenCL here:</div><div><a href="http://www.tomshardware.com/reviews/photoshop-cs6-gimp-aftershot-">http://www.tomshardware.com/reviews/photoshop-cs6-gimp-aftershot-</a></div>
<div>pro,3208-10.html</div><div><br></div><div>[3]: <a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a> page 50</div></div>