Anything about Tenor? & creating a content system -> RDF

Wed Aug 10 10:44:55 CEST 2005

Hi guys,

as previously (and repeatedly) said to Scott, you guys are rebuilding
the Semantic Web. And your problem is that you want to redesign the
standards for querying, storage and text similarity. Consider a look at
the RDF (resource description framework) of the W3C now.

I want to point out three matters to all of you:
* text similarity is no big thing
* your architecture will be something like Kat combined with an RDF graph

= text similarity is no big thing =
First of all: (by Roberto)

It will allow searches like: "find a document similar to this". No 
matter if
the words used in the documents differ, the linguistic fingerprint of
document belonging to the same domain, are similar. We want to exploit this
property of language.
I would like you to notice that this feature is missing from ALL the
competitors.

Thats counts which competitors you know. Commercial tools like Convera,
Autonomy, Opentext, Brainfiler,... all support this. It is already so
normal, that even the Lucene search engine core (as used in beagle)
supports it. What you fall "fingerprint" is usually a Vector Space model
and what you call language similarity is a TF/IDF matrix, which is
implemented in Lucene. The idea is so old, documented and tought in
university. Lucene also has multilingual stemmers and other stuff needed
for text similarity (like a broad open source developer community).

= your architecture will be something like Kat combined with an RDF graph  =

I agree with this:

so we have kat which has lots of code.
we have tenor which has lots of design.
there are four layers to be considered, from bottom to top:
0. storage
1. API
2. population
3. user interface

Now lets look how a result might look like:

to see what a finished product looks like, download
http://aduna.biz/products/autofocus/index.html
It roughly does what you have in mind, without the interfaces and APIs.
d/l it, let it build an index. Then go to the folder where the index is
and look at it. I will now describe how they did it and thats how you
(and all the others) should do it.

0. Storage:
=For text similarity you will always need a text-index. Point.=
The text index has to be seperated from the database OR included on a
very deep and tough level (like the FULLTEXT index available in MySQL
4.2. This fulltext stuff involved major changes in architecture of mysql
and of the querying format, which is somehow a big hack you don't want
to do).
Much fun to you, if you want to reimplement that: its bone-hard work

The Lucene format is effective for this. Take it as example, or use
it(there is a c port). It is used by autofocus, look at the folder
"index" in your filesystem.

=For the graph, you should use RDF=
and any serialization or storage of it. RDF describes graphs, as you can
read here:
http://www.w3.org/TR/rdf-primer/

The advantage is, that RDF is used in RSS, PDF files, the whole Mozilla
XUL is based on it, DMOZ directory, newsfeeds, ..... you will find this
technology all over the place. google for it. If you have questions,
don't hesitate to ask me
leo at gnowsis.com

or the hundreds of mailing lists:
semantic-web at w3.org
rdfweb-dev at vapours.rdfweb.org - ok
semanticweb at yahoogroups.com
www-rdf-interest at w3.org

this is another interesting starting point:
http://www.w3.org/2001/sw/

Fact is: before you now begin to discuss on a graph  serialization
format and its query language and so on, you lose time that you would
otherwise write code. Its really a waste of time to discuss graph
programming interfaces while I teach hundreds of students how to use
existing APIs.

To prove the idea of RDF: it is used in Aduna Autofocus in the other
directory called "repository". There the structured data (RDF) is stored
as binary format.
It is also used by thousands of other projects. google for them.

1. API
A generic java api to handle RDF is here:
http://www-db.stanford.edu/~melnik/rdf/api.html

a widely used java api is this here:
jena.sf.net

There are C implementations of RDF apis - most of them on SF. redland,
raptor, threestore, ....

querying these graphs can be done via various query languages, the best
is this here:
http://www.w3.org/TR/rdf-sparql-query/
(it also supports full text search and you can extend it. We have done
something like this in the gnowsis project)

2. population
Thats up to you.
But I would reuse the KAT extractors/filters for fulltext and expand
them to also extract core metadata and express it as DublinCore data.
This is the usual whay to go:

3. user interface
Thats up to you.
Again, I can point you to Aduna Autofocus for a full-screen search app.
For a "search textfield only" you have to respect the "google-affine"
user and provide the minimal
"search: ____ ok" text field somewhere in your applications
http://www.useit.com/alertbox/20050509.html

preferably, the gui would be embedded in konqueror.

Again, if you two begin discussing about graph serialization and
standardization, storage and so on,
it makes my neck hair raise and my stomach ache.

so, hope that helps to bring the projects forward,
Leo

btw: My Job is to standardize all of this stuff on european level, so if
you want to know what the others do, contact me.