[Digikam-users] backup and data integrity

Arnd Baecker arnd.baecker at web.de
Sun Feb 3 18:47:50 GMT 2008


On Wed, 23 Jan 2008, Gerhard Kulzer wrote:
[... previous discussion about checksum algorithms snipped ...]
> Hi Arnd,
> I try to sumarize what we said last night on IRC, just as a public memo.
>
> Aim is to
> a) prevent corrupt images to be saved onto disk and to
> b) detect existing corrupt files on disk
>   (to prevent overwriting of potentially good backups)
>
> Strategies like DIF and HARD are not available in the consumer market for
> another couple of years, but given the inclrease in size, speed and
> complexity of systems, consumer system will implement some kind of ECC
> (horizon ~ 3y).
>
> Protection on file system level as provided by zfs and btrfs are good but
> insufficient as they protect the disk only and not the transmission chain
> appl - OS - I/O controller - fs
>
> So we have to do it 'by hand' (meaning digikam)

Yes, full agreement!

> While saving a file after modification a)
> 1. keep it in memory
> 2. save it to disk
> 3. flush disk to clear cache
> (3a. make sure all disk internal buffers are cleared by reading other data the
> size of the disk buffer) = optional
> 5. run CRC checksum on file on disk and file in memory
> 5a. alternative: store checksum already in metadata and save it with file.

Does this work? I mean: you compute the checksum, based on the
file contents. Then you add the check-sum to the file, but
then the file contents changes and thus its checksum. So
there is no way to embed the correct checksum of a file
in the file itself.

> 6. if mismatch, re-write file and repeat procedure
>
> for problem b)
> 7. if 5a was used, as simple scrubbing scan can be launched, manually or
> programmed at frequency X
> 7a. try to open files and look for errors produced (but this method is not
> reliable, I have images that show the upper part, are corrupt and produce no
> error message. However, the more severe error can be found)
> 8. generate user alert so that one can manually check between backup and
> original.
>
> This method may seem tedious, but has the advantage of being independent of OS
> and file system, works on nfs as well.

OK, the next thing is a proposal for the more technical side
on how to integrate all this into digikam:

A) For every new image/file getting under digikams control:
   compute checksum/hash and
   add
      (hash, date of the hash computation,
       modification time of the file on disk)
   to the data-base

B) When editing images, use the above described procedure
  to ensure that the file is correctly written to disk.
  a) before editing: verify hash
  b) After editing:
     The corresponding (hash, date of hash, mod-time)
     are stored in the data-base

C) What about files which get modified/added by external tools?
   i) when digikam is running:
      All such changes are detected by KDirwatch.
      ((Is this statement correct? E.g. even if the file date
        is not changed?))
      a) addition of a new file: see A)
      b) modification of a file already in the database:
         Here a warning should be given.
         (but not much can be done, right?)
         Apart from this: see A)
   ii) When digikam is not running
      a) addition of a new file: see A)
      b) modification of a file already in the database:
         If the file modification time is different
         than the one in the data-base, this *could*
         be detected.
         However, this might take some additional time on the
         initial scanning. ((not sure how much time ...))

         - if such a change is detected: see i)b) before
         - if such a change is not detected:
           possible problem.
           This can only be detected in a full check, see D)

D) New Check Tool for the Data integrity:

   Visual side:
   - will display: oldest non-checked file
   - maybe a visual overview of files not checked (in a given time-window)
     (could look similar to  the time line ... ;-)
   - reminder on startup of digikam to perform a check
     in regular intervals (user-specified).

  Actual check:
   - just loop over all images and recompute the hash value
     and update the date in the database for the last check.
   - a quick version could just check for modification times

  This tool should be stoppable/restartable at any time,
  and run in the background,
  while one can do all the normal stuff with digikam.

D) Backup

   Here we have to ensure, that no "good" copies of
   the backup get destroyed by corrupted images in
   the main repository.

   Using just rsync does not seem possible:
   a) rsync --checksum   takes a long time once the number
      of files is large
   b) It does not know about the hash stored inside digikams
      database

   This is of course a pity, because normally using  unix tools
   is always the best option, instead of re-inventing the wheel.
   So we have to think about this point ...

   Note that this is related to
   - "Image backup with thumbs and metadata database for fast searching"
      http://bugs.kde.org/show_bug.cgi?id=133638
   - "backup on dvd (and maybe sync with dvd-ram?)"
      http://bugs.kde.org/show_bug.cgi?id=113715
   - "Sync Plugin: New Syncronisation Framework KIPI Plugin"
      http://bugs.kde.org/show_bug.cgi?id=143978
   and to some extent also to
   - "Wish: Offline manager for Digikam"
      http://bugs.kde.org/show_bug.cgi?id=114539
   - "Wish: easy transport of albums, including tags, comments, etc."
      http://bugs.kde.org/show_bug.cgi?id=103201

   For the moment I think we should postpone the details
   for this point until A) - C) are implemented and tested.
   External tools could then use the information in the data-base
   to test the right approach for D).

Comments are very much appreciated!
(And: Should we turn this into a BKO wish?)

Best, Arnd



More information about the Digikam-users mailing list