[KimDaBa] KimDaBa 2.0 is released.

Fri Oct 22 13:00:37 BST 2004

   From: "Jesper K. Pedersen" <blackie at blackie.dk>
   Date: Fri, 22 Oct 2004 08:42:32 +0200

   On Friday 22 October 2004 02:35, Robert L Krawitz wrote:
   |    From: jedd <jedd at progsoc.org>
   |    Date: Fri, 22 Oct 2004 00:53:35 +1000
   |
   |     Have you done any comparisons on different file systems?  I don't
   |     have any ext2/3 fs's big enough to take my collection, everything
   |     else is reiser.  On this system (points laptop with 2.2ghz / 512mb
   |     / lethargic internal hdd) it takes 13 seconds from <return> to
   |     KimDaBa -- and that's with the fairly standard 5,400ish pics
   |     spread through 420 directories.
   |
   | No, I just use reiserfs.
   |
   |     4 seconds on the second run.
   |
   | This is consistent with what I'm seeing.  It would be a worthwhile
   | experiment to eliminate the scan and see what it would do to startup time.

   I tried that last night, and was surprised how much that gave.  My
   next KimDaBa change will be to disable this scan (with an option to
   enabled it), and try to use SAX again to limit the last few seconds
   of XML loading.  When I tried sax last time I gained like 2 secs of
   my 11 secs load time, which I concluded wasn't worth it, but if my
   load time now are down to 4 secs, then it is indeed worth it.

The startup time will vary considerably with the CPU, memory, and disk
subsystems (the restart shouldn't be too sensitive to disk if there's
enough memory, since everything will be cached in main memory).  I
observed a restart time of 7 seconds on my laptop; the strace output
suggested that reading the index.xml file took 2 seconds or less
(there were a bunch of sbrk's within this, as it was growing memory).
This is a slow machine by contemporary standards (1 GHz with 512 MB
PC100 SDRAM).

23518 20:11:49.278580 open("/home/rlk/images/index.xml", O_RDONLY|O_LARGEFILE) = 11
...
23518 20:11:51.362672 close(11)         = 0

Note that while it was reading the 2.6 MB index.xml file, the size of
the process grew about 38 MB:

23518 20:11:49.473808 brk(0)            = 0x8381000
23518 20:11:51.360448 brk(0xa880000)    = 0xa880000

It grew another 7 MB during the image scan.

In my index.xml file, each image takes about 500 bytes -- obviously
this will vary with the number of keywords and other decoration an
individual uses.  Certainly in binary format, this could be more
compact, but for a 50,000 item database it's only about 25 MB, which
isn't serious.  However, it suggests that the variable memory load in
a 50K database would be about 450 MB, which means that a memory-based
representation would require at least about 1 GB to be useful.  The
question is whether a photographer serious enough to have a database
that size (or larger!) would spring for 1-2 GB of memory.  I think so,
myself.

BTW, take a look at the save time, also.

   Robert, it seems like you helped me avoid going through all the
   trouble of implementing a database backend, just to realize that it
   didn't give me anything for real. Thanks!

The number suggest to me that ultimately you may want to consider a
back end that doesn't keep everything in RAM, but that it's not
particularly urgent just yet.  My personal sweet spot appears to be
something in the range of 5000 images/year; professional photographers
will need a lot more, although I couldn't say exactly how many
offhand, and it will vary tremendously.  However, assuming that
wedding photographers will shoot 500 images per event, three times per
week, they'll accumulate about 75,000 images per year.  The storage
requirements for the images themselves are not insignificant at these
levels; assuming 10 MB/image, that would be 750 GB storage per year.
So anyone with that many images is going to have a high powered
computer and shouldn't balk at a few GB of RAM for indexing.

Keep in mind that digital photography is still relatively new, so
people haven't accumulated really large databases yet, and scanning
thousands of back images is prohibitively time consuming.  Therefore
databases will grow over time, and you have time to see what kind of
indexing requirements people have.