[Nepomuk] RFC: Storing information about the user's web browsing experience in Nepomuk
Denis Steckelmacher
steckdenis at yahoo.fr
Sun Jul 28 19:24:30 UTC 2013
Hi,
At the end of a previous mail on this mailing-list, I talked about
implementing a Firefox extension that stores information related to the
user's web browsing experience in Nepomuk.
My first idea was to detect when the user uses a webmail (for instance,
Outlook, GMail, Yahoo! Mail or any webmail based on Roundcube), and then
to parse the pages and to index the contacts and mails found on them in
Nepomuk. This would allow non-technical users that don't know how to use
KMail or Thunderbird to have their mails indexed by Nepomuk.
I tried to think a bit more about the idea today, and I read
documentation about Firefox and Chrome extensions to see what is
possible. This mail presents the results of my early thinking. If you
like the idea, I can implement it during the rest of the Google Summer
of Code period (about two months from now).
After a bit of research, I found that there are already several Firefox
extensions related to KDE. Some of them are also closely related to
Nepomuk. This blog post[1] for instance talks about an extension that
stores in Nepomuk from which website a downloaded file comes. The goal
of the extension that I describe is to index many things in Nepomuk,
while remaining simple and requiring only small changes in Nepomuk (for
instance, I don't want to develop tens of executables).
My idea is to base my work on nepomukfilewatch and the Nepomuk file
indexers. The first step would be to implement file indexers for common
Internet standards (email files as stored in maildir directories, vCard
files, bookmarks and history files, etc) and Nepomuk-specific file
formats (links between a downloaded file and its origin, actions of
visiting a web page, etc).
Using the Nepomuk file indexer is required because Chrome extensions are
not allowed to spawn new processes (and hence helper commands). Chrome
and Firefox extensions can contact an HTTP server, but implementing a
Nepomuk Indexer HTTP server seems a bit too complex. The good news is
that the extensions of both browsers can create files on-disk.
When the user visits a web-page, the browser extension creates files in
$TMP/nepomukindex (where $TMP is the current user's temporary
directory). The files can be Nepomuk-specific files describing the
action of visiting a page (such file only contains the visited URL) or
downloading a file. If the user is on a webmail, then mails can be
extracted (with the user permission) and stored in a standard MIME format.
Bridging these two pieces (the extensions that extract information from
the web pages and store them in $TMP/nepomukindex, and the Nepomuk file
indexers) is done by monitoring $TMP/nepomukindex with nepomukfilewatch.
When the browser extension adds a file, nepomukfilewatch detects that
and indexes the new file.
There are surely many other means of doing that, but this solution seems
relatively clean, and has some advantages. For instance, I use
Thunderbird and my e-mails are stored in a sub-directory of my home
directory. With the e-mail file indexer, my e-mails will be indexed,
without the need to develop a Thunderbird extension.
One problem I see is that Nepomuk my try to erase the indexed data after
a reboot, if nepomukfilewatch sees that the files do not exist anymore
in the temporary directory. Is there a mean to avoid that ?
What do you think about this idea ? Is this something that could improve
the KDE and Nepomuk user experience ?
Denis Steckelmacher.
NOTE: Why Firefox and Chrome and not Rekonq or anything else ? I chose
these two browsers because they are widely used and provide an extension
mechanism. I once heard that Rekonq may one time support Chrome
extensions, but is it already the case ?
[1]:
http://martys.typepad.com/blog/2012/02/so-you-want-to-keep-the-url-of-downloaded-file-eh.html
More information about the Nepomuk
mailing list