[Nepomuk] RFC: Storing information about the user's web browsing experience in Nepomuk

Sun Jul 28 19:24:30 UTC 2013

Hi,

At the end of a previous mail on this mailing-list, I talked about 
implementing a Firefox extension that stores information related to the 
user's web browsing experience in Nepomuk.

My first idea was to detect when the user uses a webmail (for instance, 
Outlook, GMail, Yahoo! Mail or any webmail based on Roundcube), and then 
to parse the pages and to index the contacts and mails found on them in 
Nepomuk. This would allow non-technical users that don't know how to use 
KMail or Thunderbird to have their mails indexed by Nepomuk.

I tried to think a bit more about the idea today, and I read 
documentation about Firefox and Chrome extensions to see what is 
possible. This mail presents the results of my early thinking. If you 
like the idea, I can implement it during the rest of the Google Summer 
of Code period (about two months from now).

After a bit of research, I found that there are already several Firefox 
extensions related to KDE. Some of them are also closely related to 
Nepomuk. This blog post[1] for instance talks about an extension that 
stores in Nepomuk from which website a downloaded file comes. The goal 
of the extension that I describe is to index many things in Nepomuk, 
while remaining simple and requiring only small changes in Nepomuk (for 
instance, I don't want to develop tens of executables).

My idea is to base my work on nepomukfilewatch and the Nepomuk file 
indexers. The first step would be to implement file indexers for common 
Internet standards (email files as stored in maildir directories, vCard 
files, bookmarks and history files, etc) and Nepomuk-specific file 
formats (links between a downloaded file and its origin, actions of 
visiting a web page, etc).

Using the Nepomuk file indexer is required because Chrome extensions are 
not allowed to spawn new processes (and hence helper commands). Chrome 
and Firefox extensions can contact an HTTP server, but implementing a 
Nepomuk Indexer HTTP server seems a bit too complex. The good news is 
that the extensions of both browsers can create files on-disk.

When the user visits a web-page, the browser extension creates files in 
$TMP/nepomukindex (where $TMP is the current user's temporary 
directory). The files can be Nepomuk-specific files describing the 
action of visiting a page (such file only contains the visited URL) or 
downloading a file. If the user is on a webmail, then mails can be 
extracted (with the user permission) and stored in a standard MIME format.

Bridging these two pieces (the extensions that extract information from 
the web pages and store them in $TMP/nepomukindex, and the Nepomuk file 
indexers) is done by monitoring $TMP/nepomukindex with nepomukfilewatch. 
When the browser extension adds a file, nepomukfilewatch detects that 
and indexes the new file.

There are surely many other means of doing that, but this solution seems 
relatively clean, and has some advantages. For instance, I use 
Thunderbird and my e-mails are stored in a sub-directory of my home 
directory. With the e-mail file indexer, my e-mails will be indexed, 
without the need to develop a Thunderbird extension.

One problem I see is that Nepomuk my try to erase the indexed data after 
a reboot, if nepomukfilewatch sees that the files do not exist anymore 
in the temporary directory. Is there a mean to avoid that ?

What do you think about this idea ? Is this something that could improve 
the KDE and Nepomuk user experience ?

Denis Steckelmacher.

NOTE: Why Firefox and Chrome and not Rekonq or anything else ? I chose 
these two browsers because they are widely used and provide an extension 
mechanism. I once heard that Rekonq may one time support Chrome 
extensions, but is it already the case ?

[1]: 
http://martys.typepad.com/blog/2012/02/so-you-want-to-keep-the-url-of-downloaded-file-eh.html