[KPhotoAlbum] Speed up new image load time

Tue May 30 20:20:13 BST 2017

Hi Robert,

Thanks for providing these patches! They are appreciated ;-)

I'm a little sleep deprived right now, so please bear with me if I don't merge 
them right away.

@Tobias: If you have time to review and merge Robert's patches, I won't mind 
:)

Cheers,
  Johannes

On Montag, 29. Mai 2017 19:35:50 CEST Robert Krawitz wrote:
> On Mon, 29 May 2017 19:05:47 -0400 (EDT), Robert Krawitz wrote:
> > On Mon, 29 May 2017 18:47:05 -0400 (EDT), Robert Krawitz wrote:
> >> On Mon, 29 May 2017 17:27:49 -0400 (EDT), Robert Krawitz wrote:
> >>> Some timings, for loading 1133 images:
> >>>      	      Old 	  New
> >>> 
> >>> 20 MP	      5:41	  0:32
> >> 
> >> ...
> >> 
> >>> It looks like storing the EXIF data in the database takes about 3
> >>> seconds.  The next big time consumer is file version detection; if I
> >>> turn that off, the total time drops off to about 7 seconds.  At that
> >>> point, in a realistic scenario, I'd likely be I/O-bound; if I were
> >>> loading 3000 images (30 GB, typically), I'd need on the order of
> >>> 250-300 seconds just to read the data from disk.  But if someone were
> >>> storing their images on nVME, it might matter.
> >> 
> >> Well, there's some very low hanging fruit here: the modified file
> >> detection computes the MD5 checksum of each file twice!  It's a very
> >> simple matter to get rid of one of those; the time drops to about 20
> >> seconds (which is consistent with what I saw running md5sum on all of
> >> the files: it took about 10 seconds).
> > 
> > If I take out MD5 checksumming altogether it drops to about 8 seconds,
> > as would be expected.
> > 
> > Of that time, about 3-4 seconds is spent in what looks like saving the
> > EXIF data, 2-3 seconds scanning the filesystem, and 2-3 seconds
> > reading the files in (when I interrupted gdb several times during
> > that, it looked like most of it was library routines scanning the EXIF
> > headers).
> > 
> > So, 20'ish seconds to read in 1100 files, which would normally be
> > around 10 GB.  And that's with a fairly slow processor; with a
> > contemporary fast processor it would be more like 10.  With a large
> > amount of data, thatt would be completely I/O-bound unless you had an
> > nVME.
> > 
> > I think this problem is solved.
> 
> I tried the same experiment on my server (i7-5820K, with single
> threads a bit more than twice as fast as my laptop).  The first time I
> loaded the new files, it was on a pace to take something like a
> minute.  When I repeated it, it took 15 seconds.  That's I/O-bound,
> and short of not computing the MD5, there's not much we can do.
> 
> One option, if detect duplicate files isn't turned on, would be to
> compute the MD5 checksum only when the thumbnails are created or the
> image viewed.  Since the working set of the images is frequently
> larger than RAM, this would save on I/O.  But it would be rather
> complicated, I suspect.
> 
> This may not be entirely accurate, because I ran it to a remote
> display (my laptop).  But I suspect it's not off by much.