[Nepomuk] Re: newbie questions about KDE Nepomuk

Mon Jan 3 04:08:02 CET 2011

Hi Darren

On Mon, Jan 3, 2011 at 12:24 AM, Darren Cruse <darren.cruse at gmail.com>wrote:

> Hi guys sorry for some basic questions but I've been considering use
> of Nepomuk for a project gathering meta data from web pages...
>
> FWIW An initial cut at the project was done using java and xquery that
> spidered the web pages and downloaded them prior to creating RDF/XML
> that drove a web based UI for searching.  And I'll spare the details,
> but some parts of that worked well, and some not so well, and I was
> trying to understand if something like Nepomuk would bring more "off
> the shelf" help for doing what we'd done, yet be open enough for us to
> enhance the generated meta data where needed.
>
> My questions:
>
> 1.  Is the Mandriva Linux distro the one most likely to get me the
> latest/greatest goodies for using Nepomuk KDE?
>
> The initial effort happens to be on Ubuntu, but reading the archives,
> just installing kubuntu-desktop won't get me all that Mandriva has -
> correct?
>
>
Nepomuk just like any other project, has stable code and experimental stuff.
The experimental stuff is lying in the playground. All distros package the
stable stuff, but Mandriva additionally packages some of the experimental
stuff.

This is because 'Sebastian Trueg' ( the main Nepomuk developer ) is employed
by Mandriva.

(hope it's not a dumb question btw - old time Solaris guy here still a
> little green with Linux).
>
> 2.  I assume html files are indexed?  But is anything more than basic
> meta data (file size, etc.) gotten?
>

I just checked. The entire file's content is indexed as plain-text along
with the basic metadata.

In particular, the project requires that triples are created that
> refer to the other resources linked to by an html page.  i.e. The uris
> of images the page may use, other web pages it links to, flash files
> it might embed, etc. have to wind up as triples in the meta data.
>

This sounds interesting. How would the triples be stored? RDF+XML? Or would
the webpages contain a link to the turtle file?

>
> 3.  To add to the fun, the project also wants entities that are more
> conceptual.  e.g. if the html pages represent a book broken down into
> volumes and sections and chapters etc. the meta data must include the
> names of the volume, section, chapter, etc. that the html page refers
> to.  i.e. This is more in the realm of "entity extraction"/"NLP" kind
> of stuff.
>
> Are there examples of something like that around?  Where an app would
> customize the meta data being extracted?
>
> Does this mean I'm using "Scribo" for it's NLP extraction features?
> Or that I'm customizing how the "Strigi" indexing works?
>
>
Hmm. This is difficult to answer. We currently use Strigi just to index
files. Since webpages are HTML files, this does fall into that category a
little bit.

But since you're doing something more than what Strigi does, I think it
falls more into the line of Scribo and NLP. A lot of work has been done in
Scribo/NLP in Nov/Dec. All of it is in the playground.

> Are such things a part of the current Mandriva distro or are these
> only in the playground?
>

They are in the playground, but Madriva shops some of them. If you want the
latest stuff, it's best to manually compile the playground.

>
> 4.  Barring anything real specific for #3, do I understand that
> Virtuoso is now the default/preferred triple store?
>
>
Yes, that is correct, but what is the specific stuff in #3. We store ALL our
triples in Virtuoso.

> And that I should be able to write software that adds/updates triples
> in Virtuoso directly if I choose to?
>

The Nepomuk architecture is, well, quite structured. For storing Triples (
actually quadruples ) we use Soprano [1] We only use the Soprano API to
add/update triples.

Soprano, however, has a plugin based backend-framework. One of its jobs is
to provide a nice consistent API over any RDF store. The current backends
are - Virtuoso, Redland, and Sesame2.

> Or that hits a SPARQL endpoint, e.g. to display the info using a custom web
> app?
>

AFAIK we currently do NOT provide a SPARQL endpoint. Sebastian would be
better at answering this question. He wrote most (all?) of Soprano.

> (part of this question also relates to the project using java btw so
> solutions that avoid the C++ api are a better fit for my work-mates -
> though I wouldn't mind :).
>
>
The Nepomuk-Kde project doesn't use Java at all. It's all C++. But since we
are a part of KDE, you can use Nepomuk with any of the language bindings
provided by kde [2]

I don't think the original Nepomuk project in Java is maintained at all.
Again, Sebastian!

5.  Not a show stopper but just curious:  Is Sesame still supported as
> an alternative backend store?
>

Soprano supports it, but Nepomuk doesn't use it. We only use Virtuoso. Quite
a bit of the Nepomuk APIS ( specifically the Query API ) have been tailored
to use virtuoso specific stuff. Additionally, there is some code in Nepomuk
that detects if any other backend is being used, and converts it to
Virtuoso.

We used to use Redland as the preferred backend till Dec 2009.

> Apologize for all the newbie questions.
>

No need to apologize. Keep the questions coming.

>
> But so far Nepomuk looks like the bees knees btw. :)
>

Heh. I actually had to look up the expression 'bees knees'. Thanks :)

>
> Darren
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>

[1] http://sourceforge.net/projects/soprano/
<http://sourceforge.net/projects/soprano/>[2]
http://techbase.kde.org/Development/Languages
-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/nepomuk/attachments/20110103/217cff79/attachment.htm