starting to look at indexing...

Mon Nov 8 05:04:21 CET 2004

On Monday 08 November 2004 3:07, Gregory Newby wrote:
> > *) KLink::Link will be a triplet containing a source node (for text a
> > KLink::Word) a target node (in the simple case a file) and optionally a
> > KLink::Anchor (which says how to index into target)
>
> I'm vague on the use of Anchor.  Is this a document type?

It's a generic mechanism for indexing into a resource, somewhat analogous to 
anchors in HTML.  However in the current API sketch it's just a block of 
binary data where the interpretation of that data is left up to the 
application associated with the link.

I'm not sure if the same structure could be used for chunking or not -- I'll 
have to think on that a bit.  One example though would be a link going into a 
mail with KMail as the associated application.  You'd want some way to 
indicate "this came from the subject of the message with this message ID" in 
a generic way; that's what the anchors are for.

Apps that supported anchors would need to have their own handlers for decoding 
and interpreting them.  On 

> > *) Everything will have an ID -- probably a 64 bit unsigned integer, so
> > the above will be 3 IDs, anchors are likely to be link specific, so that
> > adds some overhead, but I don't think they'll be used most of the time
>
> Yes, this is critical.  It gives you fixed-length records all-around.
> Since 64bit is 2x 32bit in size, I would consider using 32bit if
> the primary destination is desktop and server systems (versus
> indexing the Web or an enterprise).  ~2G (signed) or ~4G
> is a lot of words, documents or anchors.

Yes, the issue that I thought of was that we'll have a generic concept of 
nodes and the indexes will be per node rather than per word.  A node can be a 
lot of things -- a file, a word, a URL, etc.  I wasn't sure how safe it would 
be to assume 2^32 of these, but it may be safe enough.  Maybe we should start 
with 32 bits and start running some tests and see how that grows in the time 
before this stuff is ready to be released.

> FWIW, a good estimate is that each new document will add one new
> word.  So, if you're not going to have > 2G documents, 32bit is
> enough.
>
> It's OK to use 64 bit, and you can often get very good
> compression on indexes so that the high bits, when unneeded,
> will compress.  But don't do it just because you think it's likely
> that this will be a limitation.  Unless we're getting beyond
> the individual desktop level, it's not that likely.  Beyond that,
> I'd say it's a good choice.

Yeah -- that's the other side of things.  At some point this framework will 
likely be used for indexing things like an entire LAN; at least conceptually 
that would extend the search space and make things a lot more interesting.

But that has a lot of other associated issues and isn't something that I'm 
ready to tackle conceptually just yet anyway.

> > *) Phases will be searched for using online search (searching the
> > document on demand) with possible hints from anchors.  I'm still not sure
> > on this one and may change my mind later, it may be useful to arbitrarily
> > create anchors for documents over a certain size to chunk them
>
> Chunking documents into subdocuments is desirable.  It doesn't
> reduce the need for phrase search (which adds overhead for both
> indexing & retrieval, but is very useful).  But it does help to
> get higher precision in non-phrase searches because the context
> for matching documents is smaller.  (That is, you can easily
> do retrieval at the paragraph level [or whatever subdocument
> chunk you choose], instead of just the whole document level.)

Yeah -- I've also been thinking about the pros and cons of storing word 
position vs. online search.  Today I'm thinking that we'll probably need word 
position too, just because retrival speed for text is relatively slow with 
many formats (i.e. a KOffice document or something).

Originally I was thinking of chunking as being useful for online search, but 
that means that our layer for extracting text would need to be aware of the 
chunking mechanism, which I'm not sure is practical.  I'm still not sure how 
we're going to handle really big texts.  For the moment I'll probably just 
ignore them and put that on the conceptual TODO.

> > Full text isn't there at the moment -- though I'll probably patch kdelibs
> > shortly so that we can start working with full text data in a few
> > selected KFileMetaInfo plugins.
> >
> > I'm hoping to have a toy demo in a few weeks for a presentation that I'm
> > doing and I think this should be pretty doable given that I wrote a throw
> > away mock-up for the KDE conference in about 3 days.  Things are much
> > more complicated now, but I'm optimistic.
>
> Great!  Is there a source code repository we'll be using?

Yep -- I mentioned this a while back, but the API sketch is in KDE's CVS 
repository under kdenonbeta/klink.  There are already some issues that I want 
to rework, but that's where it'll be happening...

There are some instructions on using it here:

http://developer.kde.org/source/anoncvs.html

Of course at some point if you're interested in write access that's no problem 
to set up either.

> > The other thing that I've been thinking on is how to handle non-Western
> > languages where word separation isn't clear from a pure bytestream.
>
> It's a tough problem.  The general approach I recommend is to have
> drop-in parsing rules.  These will be needed for both languages and
> for document types.

Yeah, that's what I was thinking of.  I'll probably create a factory class for 
instantiating word splitters.  I'm fine with having a pretty dumb default 
implementation at first so that a little show-and-tell is possible, but I'm 
trying to keep the big picure in mind for further down the line.

If I understand what you're getting at with documents, well, at least there I 
think we'll be able to leverage our KFileMetaInfo system and just add an 
extension for retriving full text; though it may need to be in buffered 
blocks to avoid crazy levels of memory consumption in some cases.

> It could help to not do any stemming or truncation (like changing
> "housing" into "house" or "hous") (Google does this, supposedly), but
> in a fully-featured system stemming and/or truncation should be an
> option for selection by the user (at index time).

Yeah -- I've now got Modern Information Retrieval and it said that the 
benefits of stemming are somewhat debated.  However I've also noticed that 
the book tends to mostly ignore non-Western languages, which we can't really 
do.

> > The thing that I've been thinking of is two fold:
> >
> > *) First, having a function that does word separation that allows us to
> > do basic text tokenization differently in different parts of the Unicode
> > table. Chinese is the obvious example that comes to mind here that has a
> > clear concept of words, but no character for separating them.
>
> My Chinese friends have shown me that words are often ambiguous, too :-(
>
> Two good approaches to this are to use a dictionary of known
> words, and apply a greedy algorithm to match these "known" terms.
> Limitation is that dictionaries are incomplete.
>
> Second is to divide into bigrams or trigrams (that is, groups
> or two or three characters).  This needs to be done to the
> query, as well, of course.  It's worked pretty well in evaluation
> experiments, but of course turns documents into mush.

Hmm.  Interesting.  I'll try to keep the first approach in mind for later -- 
but I suppose that'll be encapsulated in the word splitter.  Do you know if 
there are equivalents to /usr/share/dict/words on UNIXes in those locales?  
That would certainly simplify finding such a dictionary.

The second one sounds like it could get pretty expensive.  Do you have any 
papers or anything that might be relevant to look at on the topic?

> > *) Second, building up two additional tries (character trees) for
> > prefixes and suffixes.  So, for example "foobar" would have multiple
> > entries:
> >
> >   Prefix trie:
> >      f, fo, foo, foob, fooba, foobar
> >   Suffix trie:
> >      foobar, oobar, obar, bar, ar, a
> >
> > This would hopefully allow for better indexing of languages that tend to
> > use a lot of prefixes and suffixes; German, Slavic languages, and well,
> > quite a few other language families do this a lot.  One thing that this
> > structure would lend itself to is inherited links -- i.e. everything in
> > the suffix tree that's contained in "oobar" wouldn't need to be
> > duplicated in "obar" and "bar".
>
> (This is one way of doing truncation.)

Yeah -- I think all of this is a bit funny because I think we've got a handful 
of folks with a background in a variety of classes of algorithms, but almost 
no IR -- so we're kind of guessing at a lot of this.  :-)

> My advice is to take a general approach to coding, and assume that
> getting a set of rules for particular languages & document types
> will happen later in many cases.  (Or that the approaches you choose
> will be superseded by others' work.)

Well, I'm kind of trying to assume both right now.  I've said this to most of 
the other folks as well.  I expect almost everything that we're coding now to 
be thrown away in the next year, but that's a natural step.  I'm targeting 
having a system where the base is at least pretty stable for KDE 4.0 -- 
probably in about a year and a half.

> Essentially, you're talking about the low-level tokenizer (or
> parser - the distinction isn't too clear for this type of application)
> that will take a document and divide it into WORDS (each of which
> has a position in the document) as well as some context information
> for the words, such as HTML/XML document structure, .rtf/.swf/etc.
> headings, underlines, etc.

I had thought of at some point just coming up with a simplified rich text for 
the full text output; don't know if that's common or not.

> Calling a different tokenizer for different document types is
> clearly necessary (well, you could transform everything
> to text first - then, different transformation engines are needed
> instead).  Using different rule sets for how to identify words
> from different languages and different document types will probably
> be a user requirement.  Plus, as I said, eventually other folks
> will want to offer alternatives.

At least up until now I've assumed some separation between the extraction and 
indexing layers.  Is that something that I should be less certain of?  As I 
noted above I was thinking that if the extraction layer returned a simple 
rich text that we could probably generalize indexing from there; that's at 
least just what's been going around in my head.

> > This would seem to offer a way to find words that are parts of compounds.
> > i.e. my favorite German word of the last few weeks has been:
> >
> >   "Halbjahreskartezuzahlungsfrei"
> >
> > Which can be broken into "halb" "jahr" "karte" "zu" "zahlung" and "frie"
> > (all distinct words).  With the structure above this would allow matching
> > on "zahlung" or "karte".  This would naturally be weighted somewhat less
> > in the results than a word match, but should still be useful.  (Thanks to
> > Maks for helping to flesh out this idea.)
>
> One quick note to augment what I mentioned above: different
> domains/disciplines have different needs.  Consider someone doing
> medical/biological searching, versus someone with a lot of humanities
> documents.  What about journal articles (say, in XML) with lots of
> tables of numeric data, headings, etc.?

Yeah -- searching is something that we'll definitely need to be modular.  With 
the classes of applications that we're kicking around using this stuff for 
that's quite clear.  But I've previously been assuming that indexing will be 
more static; ideally we'd be able to encapsulate the data well enough in the 
indexed layer that various search types could be build on the same data 
store.

> > Anyway -- so those are the things going through my head.  I'd certainly
> > be interested in hearing especially from the IR folks on here...
>
> Sorry for being quiet earlier.  Nassib had some good comments
> earlier, and we're both still excited at getting this rolling.

Ah -- no problems.  I'm just now getting around to working on this again, and 
I hope to see it picking up a little steam since we now have a slightly 
larger group interested in it and well, I want to show some stuff before too 
long.  :-)

-Scott