[Digikam-devel] [Bug 159477] New: "Robust metadata support", "Crash resistance", "Versioning", "Outside Program Interference"

Mon Mar 17 15:50:12 GMT 2008

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

http://bugs.kde.org/show_bug.cgi?id=159477         
           Summary: "Robust metadata support", "Crash resistance",
                    "Versioning", "Outside Program Interference"
           Product: digikam
           Version: 0.8.2
          Platform: Debian stable
        OS/Version: Linux
            Status: UNCONFIRMED
          Severity: wishlist
          Priority: NOR
         Component: general
        AssignedTo: digikam-devel kde org
        ReportedBy: sgbotsford gmail com

Version:           0.8.2 (using KDE 4.0.0)
Installed from:    Debian stable Packages
OS:                Linux

While digikam has lots of features, I find myself using it with some fear.
As far as I can tell from the docs, dk stores all of it's data in a sqlite database.
Given the ease which a database can be turned into a random number generator, I want belt and
suspenders.

My wishlist for a  photo database:

0.  I want to store up to a million images.

1.  I want to maintain a folder hierarchy.  I want the app to be able to both manipulate
the physical directory structure of the collection (folders) as well as the virtual struture (albums)
(The inability to manipulate the folder structure led to my discarding digikam the first time I tried it.)

2.  I want to be able to manipulate the folder hierarchy outside of the application.  

3.  I want a view of an album or a folder to optionally include views of subfolders/sub albums.  Eg
I have an album called forests, with subabums called birch forests, aspen forests, alder forests, spruce forests.
When I click forests, I want to optionally be able to see either just the stuff in the top level album, or all the 
stuff in all the subalbums.

4.  I want to be able to filter on any combination of date, time of day, description, keywords, categories, who's in the pic
using and/or/not, and agrep near misses.

5.  I want to be able to pick a photo, edit it in some other program, drop a copy in the folder tree
of the database, and have the database realize that it's a copy, and that most of the meta data is the same.

6.  I want to be able to set the database a task with an existing database that will spend the night groveling through 
it and with reasonable accuracy tell me that picture B is a cropped resized, color adjusted version of picture A.

7.  I want the database to survive corruption.

Whew!

How can this be done?

A.  All data is written at least twice.  One time in the database.  One time in the metadata fields in the
picture itself.  One time in a dot file in the directory where the picture lives.  Not all file formats support
writable meta data -e.g. most raw formats, many simple raster formats.

B.  Part of the metadata written is a unique ID for the image.  For image formats that support metadata
this allows images to move around the file tree.  Digicam should be able to catch this by monitoring ctime
changes.  For non-writable formats, a hash value of the file can be stored both in the database and in the
directory.  Simple raster formats are almost certainly derivative. Not sure how best to deal with these.

C.  When an image is modified, a suffix is added to the ID.  One of the meta data fields can state how it was
modified if done from within digicam. If it was done by an external program, Digikam prompts for information about
a photo or directory of photos when it discovers them.

D.  I've heard rumours of invariants that when run on an image can show similarity/difference.  Essentially the
opposite of a hash function, where the slightest difference gives an entirely different string, invariants give
similar strings for similar but not identical images.  I suspect that creating an invariant for scaling would be
fairly easy, one for colour transforms wouldn't be too hard.  Ones for cropping would be a lot harder. I'm pretty sure
that there is no single invariant that works all the time.  It would take  a bunch to be sure. This would
need to be part of the housekeeping function.  Where digikam knows they are derivative works, it's easy, and indeed
this can be used as a test bed for finding derivatives produced by outside programs. 

E.  The multiple locations of metadata gives robustness.  If the database is completely corrupt, it can rebuild much
of it by scanning the images & directories.   If a picture vanishes from a folder, both the meta data and the folder
data show what used to be there.  This could be matched with images that suddenly appear elsewhere in the directory tree.

F.  Maintaining an internal directory tree of hard links to images also can help track down outside file moves.
If the move was a copy and delete, then the link count goes to 1.  Digikam can do comparisons amoung images that 
have link counts of 1 to establish it's location.  If it was a move, then the internal hard link directory points
to the new location, which disagrees with the database.  Digikam updates the database, and the folder data.
Additional robustness for meta-data capable files could had by storing the current location in the tree in the file itself.
When digicam moves a file, this is automatically updated.  If an outside program moves this file, it still points to the
old location.   The housekeeper looks for inconsistencies. If a file has been moved off the file system it gets trickier.
I would propose that digikam keeps a list of one link files. If digikam can't find an orphaned file, then the single
remaining link is considered 'trash'   Characteristics of the trash file are user selectable but should include:
* Don't empty the trash until I tell you do.
* Keep everything in the trash for N days.
* Keep the last X GB of images in the trash.
* Fuss at me if the trash is getting too full.
Trash appears as a folder in the database.

This is all quite compute intensive.  But I'll point out that anyone who is serious about photography has at least
two cores, and maybe as many as 8 cores working for him.  Keep those other cores busy.  I suspect that digikam
to do all this needs to be separated into three programs:  A front end, a database daemon, and a housekeeping daemon.
Doing it as three programs allows the housekeeper to be reniced to some non-obnoxious value so that even with a 
single core machine it doesn't slow to a crawl.

Examples of projects I've done:
I go on a trip.  I come back with 500 images.
1.  Edit with photoshop.  Produce an edited psd for each image.
2.  Export from photoshop to jpeg.
3.  Create 12 sizes of each image ranging from 2024x3032 down to 64x96.  This becomes
the web page images.

So 500 images has just become 7500 images.  The #3 would probably happen outside the photo library.
Since it's script run, recreating that information is easy.

I go out on the tree farm and take my spring snaps.
1.  I bring back 60 raw format images.
2.  Open each one in photoshop.
3.  Produce one to three cropped images.
4.  Save each cropped photoshop file separately
5.  Batch process the photoshop directory to create a jpeg directory of
all the images in jpg format.
6.  Produces different sizes of jpeg file for use on a web page.

So each image:
Raw format.
Adjusted full size PS format
1-3 cropped PSD formats
2-4 full resolution JPEG derived form PSDs
? resized images from JPegs.