[Nepomuk] [RFC] New File Indexer

Tue Sep 11 15:36:19 UTC 2012

On Tue, Sep 11, 2012 at 8:34 PM, Dean Perry <happy.heyoka at gmail.com> wrote:

> **
>
>
>
> Ok, so I'm no expert on yet Nepomuk or Strigi, but I am investing time in
> coming up to speed with them.
>
>
>
>
>
> Vishesh Handa wrote:
>
>
>
> > I don't think this entire port should take me more than a week.
>
> I'll bet you a beer this is still being discussed a year from now :-)
>
>
> > This month I'm focusing on the file indexing part of Nepomuk, and right
> now it takes forever for Strigi to index all
>
> > my files.
>
>
>
> well, I feel and share your pain, but I wonder... the file indexer has
> been banging away on my machine for at least 14 hours now (I'm on Kubuntu
> 4.9, so no patch for the reindexing thing... anyway). I have been mostly
> away from my machine or doing light browsing/email for that time so Other
> than me writing this mail, firefox and the usual system/session stuff, no
> other demands on the CPU.
>
>
>
> Most of the 70% CPU utilization is Virtuoso, with blips every few seconds
> of 3% or so for nepomindex process instances.
>

Do you also have email indexing enabled? Cause that is handled separately
by kdepim, though pushing the data does make virtuoso act up.

>
>
> There is practically no disk I/O at all (500ms every 50-70s) - all my
> indexable folders are on a physically distinct drive so it's easy to notice.
>
>
>
> So my complaint is : why isn't the index using more resources?
>
> (ie: it appears not to use resources when it could, and too many resources
> when it shouldn't, which is kind of the reverse of how you want it).
>
>
>
> > I'm not the only one with this problem. We already have another project
> called the nepomuk-metadata-extractor [1] which implements the following
> indexers -
> * PDF ( Poppler Based )
>
>
>
> yeah, the Poppler pdfinfo already extracts more data than the current PDF
> indexer, I had been thinking about this personally. Go Jörg!
>
>
>
>
>
> > I would like to move these indexers into nepomuk-core [...] It would
> then call the appropriate indexing class (if one exists) which would
> populate the SimpleResourceGraph or it would just add the appropriate rdf
> types.
>
> I think you have it "inside out"; it needs to be *more pluggable* and
> instead make it easier to write a replacement indexer for a given MIME type
> and perhaps find a clever way to factor Nepomuk domain specific knowledge
> from file-type expertise.
>
>
>
> For example, off the top of my head, I can think of at least ten different
> type of file I would want indexed; I'm sure that everyone here could name
> ten different types. It is an endless and thankless task.
>

Of course. I understand that eventually, it has to pluggable. This email
was more of a first step - something which I could easily do in a week.

>
>
> As evidence - Jörg wrote:
>
> > This will help a lot to make indexing better and easier to contribute.
>
> > Strigi seems to be a very powerful solution. But writing the
>
> > streamanalyzers or fixing in them isn't very intuitive.
>
>
>
> So, four suggestions (not sure how much of this is already done now):
>
>
>
> (1) Indexer framework is data agnostic, only finds files/resources for
> indexing; two jobs only
>
> - {a} wrangling which process to launch for MIME type, resource allocation
> and preemptive termination of that process.
>
> - {b} handling triplets supplied by process; simple validation and
> transaction support in case of crash or other preemptive termination.
>
>
>
> Why? Language agnostic indexer code; C++, bash, assembler, Python, Erlang
> or javascript. Whatever works for the resource type in question. It only
> has to know about being a regular process.
>

Currently most of this is done by the nepomukindexer process. It works as
follows -

1. Call the strigi plugins to analyze the file and give to metadata back to
us
2. Store it in Nepomuk - This is done as one transaction and has performs
the validation as well.

If the nepomukindexer process crashes, then that file is ignored, and we
continue on the next file.

>
>
> (2) Support multiple resources (of same type) per process (for launch
> efficiency)
>
>
>
> framework can keep a table of discovered resources of a given MIME type
> and when it has enough (10? 20?) launch the right process. maybe in the
> future we grade each indexer as lightweight or piggy and we decide to
> launch several sets of processes for several MIME types in parallel.
>

I take it you mean a separate process for each analyzer. We currently use a
different approach - a different process for each file. Though, this
approach seems interesting as well.

>
>
> (3) Support chains of processing per resource.
>
>
>
> Why? So as not to rely on having to re-implement features of previous
> indexer. Say I write an mpeg 4 parser to extract closed caption text; I do
> not have to reimplement Trueg's TV Show stuff.
>
>
>
> Order of operation might be important - post processing seems like
> something that several people have asked about and I'm certainly interested
> in "hooking" onto indexer to capture each freshly completed file.
>

Of course. This is something that goes without saying.

>
>
> (4) Perhaps hand each process a handle (socket? dbus?) to write to
>
>
>
> Yeah, I've been reading about 'systemd' :-)
>

Even I like the concept of systemd. Currently half of the Nepomuk
communication happens over a local socket, and the other half over dbus.
Eventually, I would like to move completely to the local socket, but that's
for later. And it's only when I profile and discover that dbus actually is
a limiting factor.

>
> Imagine the simplest indexer that adds only resource/tag/value triplets -
> it just becomes just two nested loops:
>
> - iterate over resources
>
> -- iterate over meta data items.
>
> --- Test if resource contains item 1 (eg: jpeg/exif exposure), output
> triple for item 1
>
> --- Test if resource contains item 2 (eg: jpeg/exif iso), output triple
> for item 2
>
> - exit.
>

I'm not sure I understand what you mean over here.

>
>
> What I'm trying to get at here is that if I have some document type that I
> am expert in or that good library support already exists (eg: JPEG, PDF,
> mp3 are good examples) then all I need to do is take a list of files and
> spit out triples, rather than understand how to plug into the framework.
>
>
>
> The only Nepomuk domain specific knowledge I need is the correct property
> URI and the appropriate format for the values of such properties.
>

That's exactly what I want :)

>
>
> Anyway, enough already :-)
>
>
>
> dean
>
>
>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk at kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>
>

-- 
Vishesh Handa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20120911/e26e2c1e/attachment.html>