[KPhotoAlbum] Speed up new image load time

Tue May 30 00:35:50 BST 2017

On Mon, 29 May 2017 19:05:47 -0400 (EDT), Robert Krawitz wrote:
> On Mon, 29 May 2017 18:47:05 -0400 (EDT), Robert Krawitz wrote:
>> On Mon, 29 May 2017 17:27:49 -0400 (EDT), Robert Krawitz wrote:
>>> Some timings, for loading 1133 images:
>>>
>>>      	      Old 	  New
>>> 20 MP	      5:41	  0:32
>> ...
>>> It looks like storing the EXIF data in the database takes about 3
>>> seconds.  The next big time consumer is file version detection; if I
>>> turn that off, the total time drops off to about 7 seconds.  At that
>>> point, in a realistic scenario, I'd likely be I/O-bound; if I were
>>> loading 3000 images (30 GB, typically), I'd need on the order of
>>> 250-300 seconds just to read the data from disk.  But if someone were
>>> storing their images on nVME, it might matter.
>>
>> Well, there's some very low hanging fruit here: the modified file
>> detection computes the MD5 checksum of each file twice!  It's a very
>> simple matter to get rid of one of those; the time drops to about 20
>> seconds (which is consistent with what I saw running md5sum on all of
>> the files: it took about 10 seconds).
>
> If I take out MD5 checksumming altogether it drops to about 8 seconds,
> as would be expected.
>
> Of that time, about 3-4 seconds is spent in what looks like saving the
> EXIF data, 2-3 seconds scanning the filesystem, and 2-3 seconds
> reading the files in (when I interrupted gdb several times during
> that, it looked like most of it was library routines scanning the EXIF
> headers).
>
> So, 20'ish seconds to read in 1100 files, which would normally be
> around 10 GB.  And that's with a fairly slow processor; with a
> contemporary fast processor it would be more like 10.  With a large
> amount of data, thatt would be completely I/O-bound unless you had an
> nVME.
>
> I think this problem is solved.

I tried the same experiment on my server (i7-5820K, with single
threads a bit more than twice as fast as my laptop).  The first time I
loaded the new files, it was on a pace to take something like a
minute.  When I repeated it, it took 15 seconds.  That's I/O-bound,
and short of not computing the MD5, there's not much we can do.

One option, if detect duplicate files isn't turned on, would be to
compute the MD5 checksum only when the thumbnails are created or the
image viewed.  Since the working set of the images is frequently
larger than RAM, this would save on I/O.  But it would be rather
complicated, I suspect.

This may not be entirely accurate, because I ran it to a remote
display (my laptop).  But I suspect it's not off by much.
-- 
Robert Krawitz                                     <rlk at alum.mit.edu>

***  MIT Engineers   A Proud Tradition   http://mitathletics.com  ***
Member of the League for Programming Freedom  --  http://ProgFree.org
Project lead for Gutenprint   --    http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton