Krita Composition and CUDA

Dmitry Kazakov dimula73 at gmail.com
Sun Nov 4 12:26:03 UTC 2012


Hi!

Yesterday I played with CUDA a bit (just for fun, not for the sponsored
work) and I wanted to share my ideas about it.

First Impression

The first thing to say: "CUDA is a really nice thing!". I have a very
low-end GPU (GT610) with only 48 cores, but it performs pure calculations
(not counting data transfers) almost twice faster than the ex-top Intel
Core i7 CPU can do with its AVX extension. I'm afraid even to think about
how these operations are executed on high-end GPUs with 500+ cores.

The Test

In the attachment you can find a table that compares the speed of various
implementations of Composite Over. The test consisted of a composition of a
single buffer containing 32 million random pixels (about 122MiB) into the
similar buffer using a mask.

Results

It's a pity, but in real life we don't get much improvement by using the
GPU this way. Although the calculations are performed almost twice as fast
as the CPU can, the benefit is neglected by delays we get due to data
transfers between CPU and GPU. According to the tests, these transfers may
take up to 50% of the time for the Composite Over.

It is quite convenient to measure the results in the Memcpy's time. In our
case the data transfers take 5.36 memcpy time. We transfer about 396 MiB,
which means that a single data copy to/from GPU is about 1.5 times slower
than a usual RAM-to-RAM copy.

Taking into account that some fast paths of Vc implementation of this Op
take about 1...3 memcpy time, this approach (with at least 5.36 memcpy
time) will not work for us.

Idea

Well, its obvious that the bottleneck of this approach is data transfers.
So we should avoid them somehow. What if we moved the storage of our layers
from cpu to gpu memory? Of course not completely. All the layers should be
stored at the CPU RAM, but some of them (say, the active one), would have a
full copy at the GPU RAM. It means that the paintops and the composition
can be performed completely on a gpu without the need of data transfers.
This would give 2 times performance gain on low-end GPUs (as mine with 48
cores) and I can't even say how fast it would run on 500+ cores high-end
GPUs.

What is more, if the projection of the image was stored in the GPU memory,
we would avoid one data transfer KisPaintDevice->QImage->OpenGL Texture.
The point is, the OpenGL textures can be linked directly to the CUDA
buffers, so the projection would be written directly to the texture. Just
for you know, according to my last profiling, at the moment we spend 13.8%
of time on this transfer.

Of course, this idea sounds like a dream and, of course, there are lots of
complications hidden. But I guess, we need to think about it at least. It
might be a quite interesting, though huge and difficult project for GSoC,
for example...


-- 
Dmitry Kazakov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kimageshop/attachments/20121104/c9674a1c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Krita_CUDA_vs_AVX_Comparison.pdf
Type: application/pdf
Size: 21655 bytes
Desc: not available
URL: <http://mail.kde.org/pipermail/kimageshop/attachments/20121104/c9674a1c/attachment-0001.pdf>


More information about the kimageshop mailing list