[Nepomuk] Nepomuk-WebMiner integration into nepomuk-core fileindxer

Wed Dec 26 17:55:15 UTC 2012

On Wednesday 26 Dec 2012 18:41:55 Jörg Ehrichs wrote:
> Hi all,
> 
> now that the WebMiner is working and in extragear I like to talk about
> how this could be integrated better into the current indexer.
> The current solution works as an additional service that listens to
> all newly added resources and calls the webminer in a QProcess.
> 
> Vishesh had the idea to combine this in the current indexer chain
> which will help to control the process better (suspend/resume based on
> battery status and so on)
> 
> I've checked the source and saw that currently there exist the
> basicindexer which fetches mimetype stuff and the fileindexer, that
> takes all resources with the property "kext:indexingLevel < 2" and
> extracts additional information (former strigi indexer)
> 
> At this point I like to introduce the Webminer with the proper
> queue/job like the fileindexer and work on all properties with
> "kext:indexingLevel == 2  or < 3".
> 
> The WebMinerIndexerJob would call my current webminer, which would go
> into nepomuk-core too (as a subfolder like the fileindexer)
> 
> The parts I like to put into nepomuk-core would be my plugin based
> webextraction + some basic python plugins.
> So all parts I have for the WebMiner at the moment without all the ui parts.
> 
> This would not change the build dependencies but add a few more
> runtime dependencies.
> In order to successfully fetch the data from the web we would need the
> python modules
> * re
> * json
> * urllib
> * httplib2
> * tvdb
> * musicbrainzngz
> * as well as the krosspython plugin
> 
> This would allow to fetch:
> * music data + cover from musicbrainz
> * movie data + poster from themoviedb. (imdb is not working anymore
> and way to unstable and slow)
> * tvshow data +banner from thetvdb
> * document data from microsoft academics/spingerlink
> 
> Any additional plugins. Which is currently the broken imdb(hopefully
> this will be fixed in the future) as well as the extended tvdbmal
> script that needs also pxKDE/pyQt and probably more should go in some
> kind of extragear repository or even kde-apps for those who like to
> fetch data from other resources. nepomuk-core could at least fetch
> most data out-of-the box then.
> 
> The current indexing can than be controlled via the overall indexing
> status and shown in the nepomuk-controller that sits in the
> systemtray.
> 
> The current ui that can be used to manually find and save the metadata
> would go somewhere else (kde-runtime/workspace or where ever it might
> fit)
> 
> The biggest problem might be the generation of the SimpleResource
> classes, which takes a very long time currently. Hopefully this can be
> fixed too, as this problem should be solved by any program that will
> use them in the future anyway.
> 
> Any other ideas, suggestion or comments?
> Would the mentioned runtime python dependencies work or will they
> still be a problem?
> The good thing here, even if those runtime dependencies are missing,
> the user won't get a broken desktop. Instead the additional data will
> just not be fetched from the web.
> 
> Regards,
> Jörg
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
Hi 	Jörg,

This is a great idea.
I would suggest that a simple way to add and maintain the python plugins is required. Maybe 
distributing the plugins through KHNS would be a nice way to have other people to contribute?

Cheers,
-- 
Luis Silva

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20121226/290bcb48/attachment-0001.html>