[Nepomuk] Why store file urls?

Fri Nov 23 08:46:22 UTC 2012

Hey everyone

Last week I was somewhat shut out from the world, so I had some time to
think about a lot of different things in Nepomuk. This is one of the many
emails to come out it.

For those of you who don't know, in Nepomuk we always store the complete
url of the file with the property nie:url.

Example -

<nepomuk:/res/23161f9c-8839-4de3-bba0-affdd6d654ef>
        rdf:type
nmm:MusicPiece
        rdf:type
nfo:FileDataObject
        rdf:type
nfo:Audio
        rdf:type
nie:InformationElement
        nie:url
file:///home/vishesh/Music/where_does_the_good_go.mp3

Storing this URL makes accessing file resources quite convenient. But I
fear it has been a terrible design decision. By storing the url we face the
following problems -

1. Changing the url of a directory is very expensive. This doesn't need to
be done very frequently, but occasionally the user might move/rename a
directory which contains a large number of files. The url of every one of
these files needs to be adjusted. Since changes in Nepomuk are not that
cheap, this results in virtuoso + nepomukstorage + nepomukfilewatch
consuming large amounts of cpu for quite some time.

This is *very* *very* noticeable when renaming a directory with over 1000
files.

2. Removable Media Handling - We have very sad support for removable media
handling. Currently we store urls which are not fixed under a "filex"
scheme. Example -

<nepomuk:/res/7017a499-786b-4e97-a9f8-e9ee2506c322>
        rdf:type          nfo:FileDataObject
        nao:created       2012-11-02T17:52:16.022Z
        nao:lastModified  2012-11-02T17:52:16.088Z
        nie:url           <filex://72acd848acd8090d/Lost>

The "72acd848acd8090d" is the UUID of the device.

When any results containing "filex" are being returned, Solid is consulted
to check if that particular device is mounted, and accordingly the filex is
translated to "file:/mountpoint/". This way one can mount a removable
device under different locations and still not loose the data.
Theoretically.

The problem with this approach is that every single url which is passed
through Nepomuk needs to be checked for the "filex" scheme and then
translated. Since we do not have a sparql parser this is done by employing
regular expressions to check for patterns with file:/mount/point and filex.

Valgrind logs show that for small queries a sizable amount (upto 40%) of
time is spent in just this regular expression based parsing. Additionally
since queries can return any kind of data, all of the data passed from
virtuoso to Nepomuk has go through these checks.

3. Database consistency -

Since we operate on an RDF based database which does not provide us any
kind of checking (primary key, types, etc), we need to do all of these
checks on our own. We currently have 3 properties which need to be given
special privileges when dealing with files - nie:url, nfo:filename, and
nie:isPartOf.

When a file is moved (and renamed) from one directory to another, all 3 of
these properties need to be updated. We currently have code in the storage
service to explicitly check if the url is being changed and accordingly
update the filename as well. These are special cases that we need to check
for each time which result in extra cpu cycles.

Additionally we have special handling for nie:url which seems to complicate
the code like crazy. In fact even I try to stay away from some of the
"core" code related to this stuff cause it is so insanely complicated.

Proposed Solution
---------------------------

We only store urls for non-file related stuff. Otherwise we rely on the
nfo:filename and nie:isPartOf relation to traverse the file system tree.
That way (1) can very easily be addressed. (2) can be stored as a
nfo:RemoveableMediaDevice with the appropriate mount point, and maybe we
can even give different treatment to RemoveableDevices and NetworkStorage.

(3) is complicated, cause the code is so complicated. But I think this
solution would result in slightly messier and slower code in some places,
but the main code should get simplified.

Problems
--------------

Accessing a files metadata is going to get trickier and slower. One will
have to load the nfo:filename for every entire chain up to the root.
However, I think this is not something the users of our libraries
should/will notice. This can be done transparently.

What do you guys think?

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20121123/a23a7bb6/attachment.html>