[digikam] [Bug 374591] Deleting image only removes the file and sets the status to hidden but does not delete the image from DB

Mario Frank bugzilla_noreply at kde.org
Thu Feb 2 08:52:21 GMT 2017


https://bugs.kde.org/show_bug.cgi?id=374591

--- Comment #6 from Mario Frank <mario.frank at uni-potsdam.de> ---
Created attachment 103766
  --> https://bugs.kde.org/attachment.cgi?id=103766&action=edit
This patch introduces garbage collection as maintenance stage and reduces the
amount of generated garbage.

This patch introduces garbage collection as maintenance stage and reduces the
amount of generated garbage.
The stage in which the garbage collector is run is before the rebuilding of
thumbnails.

Description of problems and approach.

The garbage that bloated my databases was quite annoying. I want to draw
a sketch about where the garbage comes from:

1) Move image to trash. Everytime I delete some image, the Images table
entry is set to status Removed (3) and
the original album id is removed from the entry.
If I restore the image from trash, a new Images entry is generated.

2) Deleting images/albums directly or from trash. The image files are
deleted from hard drive but the Images entries
are not removed (and also ImageTagProperties and so on.

3) Moving/renaming images creates a new Images entry. The old one is set
to status Removed.

4) Deleting images does not remove the thumbnails. (by path/ uniqueHash)

5) Deleting face regions does not remove the region thumbnails (custom
identifier)

6) Removing tag does not remove identities from recognition db (every
identity should have the same faceEngineUUID as a tag)

Here is a description about what the patch does:
1) Creating less junk:
I introduced a new item status "Obsolete" and renamed the status
"Removed" to "Trashed".
Items are set to status Trashed if they are moved to trash. If items are
deleted directly/permanently,
they get the status "Obsolete".

If an image is restored from trash, i search for an item entry that has
status Trashed and has the same properties
as the new one. If i find such one, I use this entry and set the new/old
album and the status to visible.

If an image is renamed/moved, I use the moveItem method of the core DB
to set the new album/name of the image.
This way, the ImageScanner does not think that this is a new image. The
old entry is reused.
This could solve the grouping problem.

I cannot solve points 4 to 6 in the same easy way, explicitely not the
thumbnails problem since thumbnails can
be referenced by image path, image uniqueHash/file size and image
path/face region (custom identifier)

Thus, I made some clean-up routine for our databases.

2) Collecting junk:
I implemented the DbCleaner Maintenance module. It runs at every start
of digikam (if configured so in setup->misc) and removes
all stale Images entries (detectable by status Obsolete). This does not
take much time. But the DbCleaner can do more.
In Maintenance dialog, I added a stage Database Cleanup, which can be
triggered. The stage can also clean the thumbs db
and recognition db. But this must be explicitely selected in the menu as
this can take more time. Also, the thumbs and
recognition db are never cleaned at the start of digikam. I do not want
our users to wait minutes until they can work.

Now to what the DbCleaner does. As already said, it removes the stale
images. But let's take a step back. Getting the
stale images is just one call to core db. Getting the stale thumbnails
is more complicated. Getting the stale face identities
is less complicated. In first phase of the DbCleaner I analyse the
databases (thumbs and recognition only if enabled).
Identities are stale if there is no tag in core db that has the same
faceEngineUUID as the identity.
Thumbnails are stale if the following holds:
    1) There is no image in core DB whose file path leads to that
thumbnail (FilePaths table)
    2) There is no image in core DB whose uniqueHash and file size leads
to that thumbnail (UniqueHashes table)
    3) There is no face region of an image whose custom identifier
(image file path + region) leads to that thumbnail (CustomIdentifiers table)

So I first get all thumbnail ids from thumbs db into a list A and all
image ids into another list B.
Then I get the thumbnail ids for every image by their file path,
uniqueHash/file size and also
the thumbnails for the face regions and remove those thumbnail ids from
my list A.
The remainder in the list is thus neither connected to an image/video
nor to face regions.
I know that this is no really efficient way. But if, let's say a face
region is deleted, I cannot delete the thumbnail since it could
still be used from some other image by file path for example.

When I am done with that, I first clean the core db (stale images) After
that, I clean the thumbs db and after that the
recognition db. So far for the main process. The progress is shown to
the user and I show, what currently is done (analyse,
clean core DB, clean thumbs DB, clean recognition DB).

Then I tested my implementation with my database. I have got about 40000
images and my thumbnail db contains
96000 FilePath entries, 205000 UniqueHashes entries and 180000
CustomIdentifier entries. The Thumbnails entries are
about 255000. File size of the database (SQLite) file is 2.9 GB

About 200000 of the thumbnails are recognised as stale. Removing the
thumbnails one by one per thread as it is done for thumbnail
generation for example took exorbitant amount of time. To be frank (pun
intended), I gave up after 2 hours. The context swithing while
multi-threaded access to the thumbs db does not seem to work (well) is a
pain. Then I adopted my cleaner threads to work in chunks,
with chunk size modifiable (currently hard-coded). This worked better
but still too much threads that wait for IO. Against all expectations,
I found out that completely sequential cleaning is the fastest. If i let
one worker thread remove all stale thumbnails, the process of cleaning
my complete databases takes only about 8 mins on my 3 years old core i5
(with 16 GB RAM and no SSD). After vacuuming, my thumbs db has only size 
of 650 MB.

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the Digikam-devel mailing list