[KPhotoAlbum] NVMe

Tue Oct 22 01:36:05 BST 2019

On Mon, 21 Oct 2019 20:59:43 +0200, Andreas Schleth wrote:
> Hi Robert,
> what do I do, to obtain your branch and how do I set the thread
> numbers/preload options?

git clone git at git.kde.org:/kphotoalbum.git
git checkout parallel-md5

The two tunables are IMABE_SCOUT_THREAD_COUNT and PRELOAD_MD5 in
DB/NewImageFinder.cpp.  This of course is prototype.

> How do you instrument the code to obtail such detailled performance numbers?
> My knowledge does not go much further than "time executable"...
> That said, I'll be happy to provide some numbers on my setup here.
> Cheers, Andreas

I simply ran

iostat 5

to get I/O throughput and total CPU consumption.

That said, I also need to stopwatch-time the loading; if I add more
threads for thumbnail generation the throughput doesn't drop but it
looks like fewer images are being read per second, probably because
after a while the thumbnailing starts reading more data.

I've done profiling (via kcachegrind) in earlier phases of this work
(Load-performance, elide_unnecessary_metadata, exifdb_improvements,
startup-performance, no-statvfs) because user CPU was involved in a
lot of those improvements.  This work has very little to do with user
CPU; it's a function of I/O throughput and to some extent scheduling,
which profiling won't help with.

> Am 20.10.19 um 19:57 schrieb Robert Krawitz:
>> So it looks like I've got the following numbers.  I'm showing 2
>> significant figures here; in reality, probably no more than 1, maybe
>> 1.5, are really significant in most cases.
>>
>> * PCIe gen3/x4 NVMe:
>>
>>    4 scout/preload MD5: 1.9 GB/sec
>>    4 scout/no preload: 490 MB/sec (75-80% CPU)
>>    1 scout/preload MD5: 480 MB/sec
>>    1 scout/no preload: 480 MB/sec (75-80% CPU)
>>    2 scout/preload MD5: 1.2 GB/sec
>>    5 scout/preload MD5: 1.9 GB/sec (maybe slightly faster than 4 scouts)
>>    6 scout/preload MD5: 1.75 GB/sec
>>
>>    All of these were about 90-95% CPU consumption except as noted,
>>    regardless of I/O throughput.  What I think is happening is that at
>>    the lower throughput the extra CPU is going toward building
>>    thumbnails.
>>
>> * HDD:
>>
>>    4 scout/preload MD5: 70-75 MB/sec (490 IO/sec)
>>    4 scout/no preload: 75-80 MB/sec (115 IO/sec)
>>    1 scout/preload MD5: 95 MB/sec (900 IO/sec)
>>    1 scout/no preload: 95-98 MB/sec (150 IO/sec)
>>    2 scout/no preload: 65-70 MB/sec
>>
>>    All generally <20% CPU
>>
>> * SATA SSD
>>
>>    4 scout/preload MD5: 380 MB/sec
>>    4 scout/no preload: 480 MB/sec
>>    1 scout/preload MD5: 200 MB/sec
>>    1 scout/no preload: 370 MB/sec
>>    2 scout/no preload: 440 MB/sec
>>    3 scout/no preload: 470 MB/sec
>>    5 scout/no preload: 470 MB/sec
>>
>>    CPU varied considerably, generally in parallel with I/O throughput.
>>
>> So the general themes are:
>>
>> * On NVMe devices (fast ones, at any rate), more scout threads (up to
>>    4 on my system, which coincidentally or not is the number of cores)
>>    and computing MD5 during preload gives a big benefit.  It appears
>>    that throughput scales with threads up to the number of cores
>>    available.  I won't have a chance this week, but at some point I'll
>>    have to try on my Ryzen 2700X (with 8 cores that are at least
>>    somewhat faster than those on my laptop).  I know that my NVMe can
>>    do better than 2 GB/sec.
>>
>> * On SADA SSDs, more scout threads is a benefit although it levels
>>    off, but computing MD5 on preload is distinctly detrimental.
>>
>> * On HDDs, more scout threads is detrimental, but when the MD5 is
>>    computed is of little import.
>>
>> It would be interesting to see what would happen on slower and faster
>> NVMe devices and slower/faster/more core CPUs.  It would also be
>> interesting to see what happens on network filesystems, if someone
>> wants to try, but if you do, make sure to record information about the
>> server, network, and remote filesystem location/type in addition to
>> the client.
>>
>> The main benefits to this work are probably for initial impression,
>> initial database load, and loading large numbers of images.  For
>> someone who wants to try out KPA or start a large database, having
>> very fast load times will make for a good first impression.  The
>> thumbnails won't all be built by the end of load, but from a user
>> interaction standpoint that likely doesn't matter; if they start from
>> the top and scroll down, the thumbnail building will probably already
>> be ahead.  For loading many gigabytes of images onto fast storage, the
>> benefits of correct tuning are obvious.

-- 
Robert Krawitz                                     <rlk at alum.mit.edu>

***  MIT Engineers   A Proud Tradition   http://mitathletics.com  ***
Member of the League for Programming Freedom  --  http://ProgFree.org
Project lead for Gutenprint   --    http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton