[Nepomuk] Re: newbie questions about KDE Nepomuk

Mon Jan 3 10:45:21 CET 2011

Some follow-ups on Vishesh's answers:

On 01/02/2011 07:54 PM, Darren Cruse wrote:
> Hi guys sorry for some basic questions but I've been considering use
> of Nepomuk for a project gathering meta data from web pages...
> 
> FWIW An initial cut at the project was done using java and xquery that
> spidered the web pages and downloaded them prior to creating RDF/XML
> that drove a web based UI for searching.  And I'll spare the details,
> but some parts of that worked well, and some not so well, and I was
> trying to understand if something like Nepomuk would bring more "off
> the shelf" help for doing what we'd done, yet be open enough for us to
> enhance the generated meta data where needed.
> 
> My questions:
> 
> 1.  Is the Mandriva Linux distro the one most likely to get me the
> latest/greatest goodies for using Nepomuk KDE?
> 
> The initial effort happens to be on Ubuntu, but reading the archives,
> just installing kubuntu-desktop won't get me all that Mandriva has -
> correct?
> 
> (hope it's not a dumb question btw - old time Solaris guy here still a
> little green with Linux).
> 
> 2.  I assume html files are indexed?  But is anything more than basic
> meta data (file size, etc.) gotten?
> 
> In particular, the project requires that triples are created that
> refer to the other resources linked to by an html page.  i.e. The uris
> of images the page may use, other web pages it links to, flash files
> it might embed, etc. have to wind up as triples in the meta data.

This is not done currently but should be fairly easy. I even think it
could fit into the strigi plugin. All one has to do is to gather all the
links and resolve them. If they are absolute: simple. If they are
relative: check if the file exists and link to the corresponding Nepomuk
resource (in the strigi plugin this means to simply use the local file URL).

> 3.  To add to the fun, the project also wants entities that are more
> conceptual.  e.g. if the html pages represent a book broken down into
> volumes and sections and chapters etc. the meta data must include the
> names of the volume, section, chapter, etc. that the html page refers
> to.  i.e. This is more in the realm of "entity extraction"/"NLP" kind
> of stuff.
>
> Are there examples of something like that around?  Where an app would
> customize the meta data being extracted?
> 
> Does this mean I'm using "Scribo" for it's NLP extraction features?
> Or that I'm customizing how the "Strigi" indexing works?
> 
> Are such things a part of the current Mandriva distro or are these
> only in the playground?

Nothing has been done in this direction. But as long as you have the
chapter/section information storing it in Nepomuk is simple. The only
"hard" part is checking wether the book or chapter already exists. But I
think even that could be done easily by linking the actual files to the
book and chapter resources properly.
Of course you would need an ontology to describe books. I did not check
if something like that exists yet.

> 4.  Barring anything real specific for #3, do I understand that
> Virtuoso is now the default/preferred triple store?
> 
> And that I should be able to write software that adds/updates triples
> in Virtuoso directly if I choose to?
> 
> Or that hits a SPARQL endpoint, e.g. to display the info using a custom web app?
> 
> (part of this question also relates to the project using java btw so
> solutions that avoid the C++ api are a better fit for my work-mates -
> though I wouldn't mind :).
> 
> 5.  Not a show stopper but just curious:  Is Sesame still supported as
> an alternative backend store?

There is only Virtuoso and nothing else. Sesame was a nightmare.

Cheers,
Sebastian

> Apologize for all the newbie questions.
> 
> But so far Nepomuk looks like the bees knees btw. :)
> 
> Darren
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>