starting to look at indexing...

Tue Nov 9 04:27:20 CET 2004

On Mon, Nov 08, 2004 at 05:04:21AM +0100, Scott Wheeler wrote:
...
> On Monday 08 November 2004 3:07, Gregory Newby wrote:
> ...
> > > *) Phases will be searched for using online search (searching the
> > > document on demand) with possible hints from anchors.  I'm still not sure
> > > on this one and may change my mind later, it may be useful to arbitrarily
> > > create anchors for documents over a certain size to chunk them
> >
> > Chunking documents into subdocuments is desirable.  It doesn't
> > reduce the need for phrase search (which adds overhead for both
> > indexing & retrieval, but is very useful).  But it does help to
> > get higher precision in non-phrase searches because the context
> > for matching documents is smaller.  (That is, you can easily
> > do retrieval at the paragraph level [or whatever subdocument
> > chunk you choose], instead of just the whole document level.)
> 
> Yeah -- I've also been thinking about the pros and cons of storing word 
> position vs. online search.  Today I'm thinking that we'll probably need word 
> position too, just because retrival speed for text is relatively slow with 
> many formats (i.e. a KOffice document or something).

You'll need word position to do phrase searching.
(Or, you'll need to index phrases as chunks, but that doesn't
allow for arbitrary phrases.  One strategy I've seen is
noun-noun phrases.  Personally, I like including position.).

> ...
> > It could help to not do any stemming or truncation (like changing
> > "housing" into "house" or "hous") (Google does this, supposedly), but
> > in a fully-featured system stemming and/or truncation should be an
> > option for selection by the user (at index time).
> 
> Yeah -- I've now got Modern Information Retrieval and it said that the 
> benefits of stemming are somewhat debated.  However I've also noticed that 
> the book tends to mostly ignore non-Western languages, which we can't really 
> do.
> 
> > > The thing that I've been thinking of is two fold:
> > >
> > > *) First, having a function that does word separation that allows us to
> > > do basic text tokenization differently in different parts of the Unicode
> > > table. Chinese is the obvious example that comes to mind here that has a
> > > clear concept of words, but no character for separating them.
> >
> > My Chinese friends have shown me that words are often ambiguous, too :-(
> >
> > Two good approaches to this are to use a dictionary of known
> > words, and apply a greedy algorithm to match these "known" terms.
> > Limitation is that dictionaries are incomplete.
> >
> > Second is to divide into bigrams or trigrams (that is, groups
> > or two or three characters).  This needs to be done to the
> > query, as well, of course.  It's worked pretty well in evaluation
> > experiments, but of course turns documents into mush.
> 
> Hmm.  Interesting.  I'll try to keep the first approach in mind for later -- 
> but I suppose that'll be encapsulated in the word splitter.  Do you know if 
> there are equivalents to /usr/share/dict/words on UNIXes in those locales?  
> That would certainly simplify finding such a dictionary.

I don't know.

> The second one sounds like it could get pretty expensive.  Do you have any 
> papers or anything that might be relevant to look at on the topic?

No, bigrams & trigrams are very easy & cheap to implement.  This
was an approach taken by many TREC experiments (http://trec.nist.gov ,
in the Publications area) when there was a Chinese track.  Today,
there are separate conferences for CKJ (Chinese-Korean-Japanese)
languages, and I'm not so sure of the state of the art.

There are only about 5K individual characters in simplified 
Chinese, though over 25K for traditional Chinese.  This is far
smaller than the number of English words, but I do not know
the number of "valid" combinations of characters to make "words".

> ...
> > Essentially, you're talking about the low-level tokenizer (or
> > parser - the distinction isn't too clear for this type of application)
> > that will take a document and divide it into WORDS (each of which
> > has a position in the document) as well as some context information
> > for the words, such as HTML/XML document structure, .rtf/.swf/etc.
> > headings, underlines, etc.
> 
> I had thought of at some point just coming up with a simplified rich text for 
> the full text output; don't know if that's common or not.

This is all fairly new.  Google does it, of course, but it's
not that commonly seen elsewhere.  I think that something like
"simplified rich text" is the right approach, but there is not
a lot of experimental evidence.

  -- Greg

PS: This year's TREC is next week, so I might pick up a
few new ideas :-)