[Nepomuk] newbie questions about KDE Nepomuk

Sun Jan 2 19:54:13 CET 2011

Hi guys sorry for some basic questions but I've been considering use
of Nepomuk for a project gathering meta data from web pages...

FWIW An initial cut at the project was done using java and xquery that
spidered the web pages and downloaded them prior to creating RDF/XML
that drove a web based UI for searching.  And I'll spare the details,
but some parts of that worked well, and some not so well, and I was
trying to understand if something like Nepomuk would bring more "off
the shelf" help for doing what we'd done, yet be open enough for us to
enhance the generated meta data where needed.

My questions:

1.  Is the Mandriva Linux distro the one most likely to get me the
latest/greatest goodies for using Nepomuk KDE?

The initial effort happens to be on Ubuntu, but reading the archives,
just installing kubuntu-desktop won't get me all that Mandriva has -
correct?

(hope it's not a dumb question btw - old time Solaris guy here still a
little green with Linux).

2.  I assume html files are indexed?  But is anything more than basic
meta data (file size, etc.) gotten?

In particular, the project requires that triples are created that
refer to the other resources linked to by an html page.  i.e. The uris
of images the page may use, other web pages it links to, flash files
it might embed, etc. have to wind up as triples in the meta data.

3.  To add to the fun, the project also wants entities that are more
conceptual.  e.g. if the html pages represent a book broken down into
volumes and sections and chapters etc. the meta data must include the
names of the volume, section, chapter, etc. that the html page refers
to.  i.e. This is more in the realm of "entity extraction"/"NLP" kind
of stuff.

Are there examples of something like that around?  Where an app would
customize the meta data being extracted?

Does this mean I'm using "Scribo" for it's NLP extraction features?
Or that I'm customizing how the "Strigi" indexing works?

Are such things a part of the current Mandriva distro or are these
only in the playground?

4.  Barring anything real specific for #3, do I understand that
Virtuoso is now the default/preferred triple store?

And that I should be able to write software that adds/updates triples
in Virtuoso directly if I choose to?

Or that hits a SPARQL endpoint, e.g. to display the info using a custom web app?

(part of this question also relates to the project using java btw so
solutions that avoid the C++ api are a better fit for my work-mates -
though I wouldn't mind :).

5.  Not a show stopper but just curious:  Is Sesame still supported as
an alternative backend store?

Apologize for all the newbie questions.

But so far Nepomuk looks like the bees knees btw. :)

Darren