[Nepomuk] Nepomuk-WebMiner integration into nepomuk-core fileindxer

Wed Dec 26 17:41:55 UTC 2012

Hi all,

now that the WebMiner is working and in extragear I like to talk about
how this could be integrated better into the current indexer.
The current solution works as an additional service that listens to
all newly added resources and calls the webminer in a QProcess.

Vishesh had the idea to combine this in the current indexer chain
which will help to control the process better (suspend/resume based on
battery status and so on)

I've checked the source and saw that currently there exist the
basicindexer which fetches mimetype stuff and the fileindexer, that
takes all resources with the property "kext:indexingLevel < 2" and
extracts additional information (former strigi indexer)

At this point I like to introduce the Webminer with the proper
queue/job like the fileindexer and work on all properties with
"kext:indexingLevel == 2  or < 3".

The WebMinerIndexerJob would call my current webminer, which would go
into nepomuk-core too (as a subfolder like the fileindexer)

The parts I like to put into nepomuk-core would be my plugin based
webextraction + some basic python plugins.
So all parts I have for the WebMiner at the moment without all the ui parts.

This would not change the build dependencies but add a few more
runtime dependencies.
In order to successfully fetch the data from the web we would need the
python modules
* re
* json
* urllib
* httplib2
* tvdb
* musicbrainzngz
* as well as the krosspython plugin

This would allow to fetch:
* music data + cover from musicbrainz
* movie data + poster from themoviedb. (imdb is not working anymore
and way to unstable and slow)
* tvshow data +banner from thetvdb
* document data from microsoft academics/spingerlink

Any additional plugins. Which is currently the broken imdb(hopefully
this will be fixed in the future) as well as the extended tvdbmal
script that needs also pxKDE/pyQt and probably more should go in some
kind of extragear repository or even kde-apps for those who like to
fetch data from other resources. nepomuk-core could at least fetch
most data out-of-the box then.

The current indexing can than be controlled via the overall indexing
status and shown in the nepomuk-controller that sits in the
systemtray.

The current ui that can be used to manually find and save the metadata
would go somewhere else (kde-runtime/workspace or where ever it might
fit)

The biggest problem might be the generation of the SimpleResource
classes, which takes a very long time currently. Hopefully this can be
fixed too, as this problem should be solved by any program that will
use them in the future anyway.

Any other ideas, suggestion or comments?
Would the mentioned runtime python dependencies work or will they
still be a problem?
The good thing here, even if those runtime dependencies are missing,
the user won't get a broken desktop. Instead the additional data will
just not be fetched from the web.

Regards,
Jörg