Hi!<br><br>Yesterday I played with CUDA a bit (just for fun, not for the sponsored work) and I wanted to share my ideas about it.<br><br>First Impression<br><br>The first thing to say: "CUDA is a really nice thing!". I have a very low-end GPU (GT610) with only 48 cores, but it performs pure calculations (not counting data transfers) almost twice faster than the ex-top Intel Core i7 CPU can do with its AVX extension. I'm  afraid even to think about how these operations are executed on high-end GPUs with 500+ cores.<br>

<br>The Test<br><br>In the attachment you can find a table that compares the speed of various implementations of Composite Over. The test consisted of a composition of a single buffer containing 32 million random pixels (about 122MiB) into the similar buffer using a mask.<br>

<br>Results<br><br>It's a pity, but in real life we don't get much improvement by using the GPU this way. Although the calculations are performed almost twice as fast as the CPU can, the benefit is neglected by delays we get due to data transfers between CPU and GPU. According to the tests, these transfers may take up to 50% of the time for the Composite Over. <br>

<br>It is quite convenient to measure the results in the Memcpy's time. In our case the data transfers take 5.36 memcpy time. We transfer about 396 MiB, which means that a single data copy to/from GPU is about 1.5 times slower than a usual RAM-to-RAM copy. <br>

<br>Taking into account that some fast paths of Vc implementation of this Op take about 1...3 memcpy time, this approach (with at least 5.36 memcpy time) will not work for us.<br><br>Idea<br><br>Well, its obvious that the bottleneck of this approach is data transfers. So we should avoid them somehow. What if we moved the storage of our layers from cpu to gpu memory? Of course not completely. All the layers should be stored at the CPU RAM, but some of them (say, the active one), would have a full copy at the GPU RAM. It means that the paintops and the composition can be performed completely on a gpu without the need of data transfers. This would give 2 times performance gain on low-end GPUs (as mine with 48 cores) and I can't even say how fast it would run on 500+ cores high-end GPUs.<br>

<br>What is more, if the projection of the image was stored in the GPU memory, we would avoid one data transfer KisPaintDevice->QImage->OpenGL Texture. The point is, the OpenGL textures can be linked directly to the CUDA buffers, so the projection would be written directly to the texture. Just for you know, according to my last profiling, at the moment we spend 13.8% of time on this transfer.<br>

<br>Of course, this idea sounds like a dream and, of course, there are lots of complications hidden. But I guess, we need to think about it at least. It might be a quite interesting, though huge and difficult project for GSoC, for example...<br>

<br clear="all"><br>-- <br>Dmitry Kazakov<br>