starting to look at indexing...

Mon Nov 8 03:07:18 CET 2004

On Sun, Nov 07, 2004 at 04:15:29AM +0100, Scott Wheeler wrote:
> Ok, so I'm finally starting to get back to working on things.  I committed 
> some more skeleton type of stuff today so that the code that's actually in 
> CVS actually builds, though it doesn't do anything yet.
> 
> My rough plan is something like this:

Initial thoughts: this is a good way to start.  More:

> Short term I want to build up some demo stuff based on some of the concepts 
> that we've kicked around in the past.  I think the linkage ideas will 
> probably come through in this, though the indexing will almost certainly have 
> to be reworked a couple of times once we get more of the tools put together 
> and such.
> 
> Right now I'm thinking about how to build up indexes.  In my first demos I'll 
> probably just do word based indexing building up a reverse map from a 
> dictionary to nodes.
> 
> The basic structures that I've got in mind right now are:
> 
> *) KLink::Word will correspond to a word in the dictionary (which is a type of 
> KLink::Node)

Yes.

> *) KLink::Link will be a triplet containing a source node (for text a 
> KLink::Word) a target node (in the simple case a file) and optionally a 
> KLink::Anchor (which says how to index into target)

I'm vague on the use of Anchor.  Is this a document type?

> *) Everything will have an ID -- probably a 64 bit unsigned integer, so the 
> above will be 3 IDs, anchors are likely to be link specific, so that adds 
> some overhead, but I don't think they'll be used most of the time

Yes, this is critical.  It gives you fixed-length records all-around.
Since 64bit is 2x 32bit in size, I would consider using 32bit if
the primary destination is desktop and server systems (versus
indexing the Web or an enterprise).  ~2G (signed) or ~4G
is a lot of words, documents or anchors.

FWIW, a good estimate is that each new document will add one new
word.  So, if you're not going to have > 2G documents, 32bit is
enough.

It's OK to use 64 bit, and you can often get very good
compression on indexes so that the high bits, when unneeded,
will compress.  But don't do it just because you think it's likely
that this will be a limitation.  Unless we're getting beyond
the individual desktop level, it's not that likely.  Beyond that,
I'd say it's a good choice.

> *) Phases will be searched for using online search (searching the document on 
> demand) with possible hints from anchors.  I'm still not sure on this one and 
> may change my mind later, it may be useful to arbitrarily create anchors for 
> documents over a certain size to chunk them

Chunking documents into subdocuments is desirable.  It doesn't
reduce the need for phrase search (which adds overhead for both
indexing & retrieval, but is very useful).  But it does help to
get higher precision in non-phrase searches because the context
for matching documents is smaller.  (That is, you can easily
do retrieval at the paragraph level [or whatever subdocument
chunk you choose], instead of just the whole document level.)

> Full text isn't there at the moment -- though I'll probably patch kdelibs 
> shortly so that we can start working with full text data in a few selected 
> KFileMetaInfo plugins.
> 
> I'm hoping to have a toy demo in a few weeks for a presentation that I'm doing 
> and I think this should be pretty doable given that I wrote a throw away 
> mock-up for the KDE conference in about 3 days.  Things are much more 
> complicated now, but I'm optimistic.

Great!  Is there a source code repository we'll be using?

> The other thing that I've been thinking on is how to handle non-Western 
> languages where word separation isn't clear from a pure bytestream.

It's a tough problem.  The general approach I recommend is to have
drop-in parsing rules.  These will be needed for both languages and
for document types.  

It could help to not do any stemming or truncation (like changing
"housing" into "house" or "hous") (Google does this, supposedly), but
in a fully-featured system stemming and/or truncation should be an
option for selection by the user (at index time).

> The thing that I've been thinking of is two fold:
> 
> *) First, having a function that does word separation that allows us to do 
> basic text tokenization differently in different parts of the Unicode table.  
> Chinese is the obvious example that comes to mind here that has a clear 
> concept of words, but no character for separating them.

My Chinese friends have shown me that words are often ambiguous, too :-(

Two good approaches to this are to use a dictionary of known
words, and apply a greedy algorithm to match these "known" terms.
Limitation is that dictionaries are incomplete.

Second is to divide into bigrams or trigrams (that is, groups
or two or three characters).  This needs to be done to the
query, as well, of course.  It's worked pretty well in evaluation
experiments, but of course turns documents into mush.

> *) Second, building up two additional tries (character trees) for prefixes and 
> suffixes.  So, for example "foobar" would have multiple entries:
> 
>   Prefix trie:
>      f, fo, foo, foob, fooba, foobar
>   Suffix trie:
>      foobar, oobar, obar, bar, ar, a
> 
> This would hopefully allow for better indexing of languages that tend to use a 
> lot of prefixes and suffixes; German, Slavic languages, and well, quite a few 
> other language families do this a lot.  One thing that this structure would 
> lend itself to is inherited links -- i.e. everything in the suffix tree 
> that's contained in "oobar" wouldn't need to be duplicated in "obar" and 
> "bar".

(This is one way of doing truncation.)

My advice is to take a general approach to coding, and assume that
getting a set of rules for particular languages & document types
will happen later in many cases.  (Or that the approaches you choose
will be superseded by others' work.)

Essentially, you're talking about the low-level tokenizer (or
parser - the distinction isn't too clear for this type of application)
that will take a document and divide it into WORDS (each of which
has a position in the document) as well as some context information
for the words, such as HTML/XML document structure, .rtf/.swf/etc.
headings, underlines, etc.

Calling a different tokenizer for different document types is
clearly necessary (well, you could transform everything
to text first - then, different transformation engines are needed
instead).  Using different rule sets for how to identify words
from different languages and different document types will probably
be a user requirement.  Plus, as I said, eventually other folks
will want to offer alternatives.

> This would seem to offer a way to find words that are parts of compounds.  
> i.e. my favorite German word of the last few weeks has been:
> 
>   "Halbjahreskartezuzahlungsfrei"
> 
> Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie" (all 
> distinct words).  With the structure above this would allow matching on 
> "zahlung" or "karte".  This would naturally be weighted somewhat less in the 
> results than a word match, but should still be useful.  (Thanks to Maks for 
> helping to flesh out this idea.)

One quick note to augment what I mentioned above: different 
domains/disciplines have different needs.  Consider someone doing
medical/biological searching, versus someone with a lot of humanities
documents.  What about journal articles (say, in XML) with lots of
tables of numeric data, headings, etc.?  

> Anyway -- so those are the things going through my head.  I'd certainly be 
> interested in hearing especially from the IR folks on here...

Sorry for being quiet earlier.  Nassib had some good comments
earlier, and we're both still excited at getting this rolling.
  -- Greg

Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center
Univ of Alaska Fairbanks-909 Koyukuk Dr-PO Box 756020-Fairbanks-AK 99775-6020
e: newby AT arsc.edu v: 907-450-8663 f: 907-450-8601 w: www.arsc.edu/~newby