[Nepomuk] [RFC] New File Indexer

Tue Sep 11 15:04:08 UTC 2012

Ok, so I'm no expert on yet Nepomuk or Strigi, but I am investing time in coming up to speed with them.

Vishesh Handa wrote:

>  I don't think this entire port should take me more than a week. 

I'll bet you a beer this is still being discussed a year from now :-)

> This month I'm focusing on the file indexing part of Nepomuk, and right now it takes forever for Strigi to index all
> my files.

well, I feel and share your pain, but I wonder... the file indexer has been banging away on my machine for at least 14 
hours now (I'm on Kubuntu 4.9, so no patch for the reindexing thing... anyway).  I have been mostly away from my machine 
or doing light browsing/email for that time so Other than me writing this mail, firefox and the usual system/session 
stuff, no other demands on the CPU.

Most of the 70% CPU utilization is Virtuoso, with blips every few seconds of 3% or so for nepomindex process instances.  

There is practically no disk I/O at all (500ms every 50-70s) - all my indexable folders are on a physically distinct 
drive so it's easy to notice.

So my complaint is : why isn't the index using more resources?
(ie: it appears not to use resources when it could, and too many resources when it shouldn't, which is kind of the 
reverse of how you want it).

>  I'm not the only one with this problem. We already have another project called the nepomuk-metadata-extractor [1] 
which implements the following indexers -
* PDF ( Poppler Based )

yeah, the Poppler pdfinfo already extracts more data than the current PDF indexer, I had been thinking about this 
personally.  Go Jörg!

>  I would like to move these indexers into nepomuk-core [...] It would then call the appropriate indexing class (if one 
exists) which would populate the SimpleResourceGraph or it would just add the appropriate rdf types.

I think you have it "inside out"; it needs to be *more pluggable* and instead make it easier to write a replacement 
indexer for a given MIME type and perhaps find a clever way to factor Nepomuk domain specific knowledge from file-type 
expertise.

For example, off the top of my head, I can think of at least ten different type of file I would want indexed;  I'm sure 
that everyone here could name ten different types.  It is an endless and thankless task.

As evidence - Jörg wrote:
> This will help a lot to make indexing better and easier to contribute.
> Strigi seems to be a very powerful solution. But writing the
> streamanalyzers or fixing in them isn't very intuitive.

So, four suggestions (not sure how much of this is already done now):

(1) Indexer framework is data agnostic, only finds files/resources for indexing; two jobs only
  - {a} wrangling which process to launch for MIME type, resource allocation and preemptive termination of that process. 
  - {b} handling triplets supplied by process; simple validation and transaction support in case of crash or other 
preemptive termination.

Why? Language agnostic indexer code; C++, bash, assembler, Python, Erlang or javascript.  Whatever works for the 
resource type in question.  It only has to know about being a regular process.

(2) Support multiple resources (of same type) per process (for launch efficiency)

framework can keep a table of discovered resources of a given MIME type and when it has enough (10? 20?) launch the 
right process.  maybe in the future we grade each indexer as lightweight or piggy and we decide to launch several sets 
of processes for several MIME types in parallel.

(3) Support chains of processing per resource.

Why? So as not to rely on having to re-implement features of previous indexer.  Say I write an mpeg 4 parser to extract 
closed caption text; I do not have to reimplement Trueg's TV Show stuff. 

Order of operation might be important - post processing seems like something that several people have asked about and 
I'm certainly interested in "hooking" onto indexer to capture each freshly completed file.

(4) Perhaps hand each process a handle (socket? dbus?) to write to

Yeah, I've been reading about 'systemd' :-)

Imagine the simplest indexer that adds only resource/tag/value triplets - it just becomes just two nested loops:
 -  iterate over resources
 -- iterate over meta data items.
 --- Test if resource contains item 1 (eg: jpeg/exif exposure), output triple for item 1
 --- Test if resource contains item 2 (eg: jpeg/exif iso), output triple for item 2
 - exit.

What I'm trying to get at here is that if I have some document type that I am expert in or that good library support 
already exists (eg: JPEG, PDF, mp3 are good examples) then all I need to do is take a list of files and spit out 
triples, rather than understand how to plug into the framework.

The only Nepomuk domain specific knowledge I need is the correct property URI and the appropriate format for the values 
of such properties.

Anyway, enough already :-)

dean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/nepomuk/attachments/20120912/0f50e6b8/attachment-0001.html>