[KPhotoAlbum] NVMe

Sun Oct 20 17:27:52 BST 2019

On Sun, 20 Oct 2019 07:55:12 +0200, Andreas Schleth wrote:
> Hi Robert,
> Out there (actually here) we also have the use case of slow rust storage via network (NFS). Using too much concurrent io might virtually kill my server. So an option to limit the io threads manually might be nice. Thumbnail generation might tap into the system generated thumbnails and reuse them...

My early work with the scout threads was actually aimed at improving
performance on HDD's.  I don't see any reason for more than one scout
thread in that case, but it really did help there by reducing the time
spent waiting for data to be present for MD5 checksumming.

As for resuing system-generated thumbnails, that would only help (if
at all) on extremely low latency storage with an extremely efficient
filesystem.  The purposes of the thumbnail scheme as it exists is to
reduce filesystem latency when reading thumbnails for use and to
pre-scale the thumbnails correctly.  So thumbnalis are packed into a
small number of much larger (32 MB) files with a general attempt to
preserve locality.  Using system-generated thumbnail files is very
slow because each thumbnail requires its own filesystem operation.

Furthermore, in the normal process of thumbnail generation (as opposed
to rebuilding thumbnails), the data is already in RAM because it was
read in for MD5 generation and unless the system is very tight for
RAM, or checksum generation has run far ahead of thumbnail generation,
it will still be present when thumbnails are built.  The exception is
that if I/O is very fast and there are a lot of scout threads, in
which case checksumming could get far ahead of thumbnail creation
(which is computationally more expensive).  That's the case with what
I'm doing on NVMe devices, but in that case, the I/O is fast enough to
keep the thumbnail generation well fed and the advantage is that
checksumming finishes quickly allowing the user to access the images.

The question of NFS is tricky and hard to generalize.  My inclination
is to want to minimize I/O but actually have more scout threads
because network filesystems introduce more non-media overhead.  See my
discussion of throughput, media latency, and protocol latency in
DB/NewImageFinder.cpp.  Of course, that's looking at it from the
standpoint of putting kpa first, and you're quite correct that that
might not be good for other users of that server.

In any event, I don't think that globally setting 4 (or more!) scout
threads is desirable except for extremely fast storage.

> Am 20. Oktober 2019 07:09:54 MESZ schrieb Robert Krawitz <rlk at alum.mit.edu>:
>>On Sun, 13 Oct 2019 19:39:52 -0400 (EDT), Robert Krawitz wrote:
>>> Unfortunately, I'm not getting a lot of benefit from use of an NVMe;
>>> it looks like I'm hitting other limits right away (MD5 checksumming
>>> and thumbnail extraction) even with everything cached in memory.
>>>
>>> With increasing thread counts, it would be very uesful to be able to
>>> farm out checksumming and thumbnail generation.  Checksum generation
>>> shouldn't be much of a problem; it could be computed by the scout
>>> thread and stored in a hash, and only computed by the loader if it
>>> doesn't exist.  The good news (subject to verifying that it did the
>>> checksum correctly) is that even 3-way parallelism (3 scout threads)
>>> got me to 1.8 GB/sec I/O rate, something like 200 files/sec.
>>> Unfortunately, this then gets ahead of thumbnail generation, with the
>>> result that images have size (-1, -1) until their thumbnails get
>>> created.  Still need to figure out how to deal with that.
>>
>>I've created a parallel-md5 branch that prototypes this.  With fast
>>backing store, such as the Inland Premium 2TB NVMe, image loading
>>really flies; I'm getting about 1.9 GB/sec loading images, limited by
>>MD5 checksum generation on a processor that's not especially fast by
>>recent standards (Xeon E3-1505M, 4x2 Skylake at 2.8/3.7 GHz).  I'm
>>using 4 scout threads to get there, with the scouts doing the MD5
>>calculation.  With RAID gen4 NVMe on a Threadripper or higher thread
>>count Epyc the results would be interesting.
>>
>>Thumbnail generation of course lags badly on my hardware.  The result
>>is that I'm actually doing about 2x as much total I/O, but the user
>>gets control back very quickly.  I managed to get the image size
>>during preload, so the -1 problem went away.
>>
>>The hard part's going to be figuring out how to autotune the number of
>>scout threads.
-- 
Robert Krawitz                                     <rlk at alum.mit.edu>

***  MIT Engineers   A Proud Tradition   http://mitathletics.com  ***
Member of the League for Programming Freedom  --  http://ProgFree.org
Project lead for Gutenprint   --    http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton