starting to look at indexing...

Sun Nov 7 04:15:29 CET 2004

Ok, so I'm finally starting to get back to working on things.  I committed 
some more skeleton type of stuff today so that the code that's actually in 
CVS actually builds, though it doesn't do anything yet.

My rough plan is something like this:

Short term I want to build up some demo stuff based on some of the concepts 
that we've kicked around in the past.  I think the linkage ideas will 
probably come through in this, though the indexing will almost certainly have 
to be reworked a couple of times once we get more of the tools put together 
and such.

Right now I'm thinking about how to build up indexes.  In my first demos I'll 
probably just do word based indexing building up a reverse map from a 
dictionary to nodes.

The basic structures that I've got in mind right now are:

*) KLink::Word will correspond to a word in the dictionary (which is a type of 
KLink::Node)

*) KLink::Link will be a triplet containing a source node (for text a 
KLink::Word) a target node (in the simple case a file) and optionally a 
KLink::Anchor (which says how to index into target)

*) Everything will have an ID -- probably a 64 bit unsigned integer, so the 
above will be 3 IDs, anchors are likely to be link specific, so that adds 
some overhead, but I don't think they'll be used most of the time

*) Phases will be searched for using online search (searching the document on 
demand) with possible hints from anchors.  I'm still not sure on this one and 
may change my mind later, it may be useful to arbitrarily create anchors for 
documents over a certain size to chunk them

Full text isn't there at the moment -- though I'll probably patch kdelibs 
shortly so that we can start working with full text data in a few selected 
KFileMetaInfo plugins.

I'm hoping to have a toy demo in a few weeks for a presentation that I'm doing 
and I think this should be pretty doable given that I wrote a throw away 
mock-up for the KDE conference in about 3 days.  Things are much more 
complicated now, but I'm optimistic.

The other thing that I've been thinking on is how to handle non-Western 
languages where word separation isn't clear from a pure bytestream.

The thing that I've been thinking of is two fold:

*) First, having a function that does word separation that allows us to do 
basic text tokenization differently in different parts of the Unicode table.  
Chinese is the obvious example that comes to mind here that has a clear 
concept of words, but no character for separating them.

*) Second, building up two additional tries (character trees) for prefixes and 
suffixes.  So, for example "foobar" would have multiple entries:

  Prefix trie:
     f, fo, foo, foob, fooba, foobar
  Suffix trie:
     foobar, oobar, obar, bar, ar, a

This would hopefully allow for better indexing of languages that tend to use a 
lot of prefixes and suffixes; German, Slavic languages, and well, quite a few 
other language families do this a lot.  One thing that this structure would 
lend itself to is inherited links -- i.e. everything in the suffix tree 
that's contained in "oobar" wouldn't need to be duplicated in "obar" and 
"bar".

This would seem to offer a way to find words that are parts of compounds.  
i.e. my favorite German word of the last few weeks has been:

  "Halbjahreskartezuzahlungsfrei"

Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie" (all 
distinct words).  With the structure above this would allow matching on 
"zahlung" or "karte".  This would naturally be weighted somewhat less in the 
results than a word match, but should still be useful.  (Thanks to Maks for 
helping to flesh out this idea.)

Anyway -- so those are the things going through my head.  I'd certainly be 
interested in hearing especially from the IR folks on here...

At some point we'll very likely want to pull in either all or parts of the 
code coming from an external too, but for the moment this is just kicking 
around ideas that will get some code created to play with.

-Scott

-- 
If the answer is "more lawyers" then the question shouldn't have been asked.