Anything about Tenor? & creating a content system -> RDF

Wed Aug 10 17:44:12 CEST 2005

On Wednesday 10 August 2005 02:44, Leo Sauermann wrote:
> as previously (and repeatedly) said to Scott, you guys are rebuilding
> the Semantic Web.

actually, we're not =) this is one of the troubles of having had a several 
month gap in this project's progress is that people quickly return to their 
preconceived notions if they aren't continuously reinforced not to. 

after all, when what you've got is a hammer, or in this case RDF....

i think that RDF may be a useful export/import format, but for internal usage 
it's not the game we need. this is not about creating key/value tags for 
discreet units of information (e.g. documents or emails) but representing the 
common elements between them.

more importantly, there are times when two units of information directly 
related to each other but not in a key/value type relationship. when i make a 
sticky note and "stick" it to an email we could abuse key/value and say 
"note:$NOTEID" but that requires the knowledge of notes to use them and that 
simply should not be necessary. implicit linkage seems to be poorly serviced 
by RDF. please correct me if i'm wrong here.

> * your architecture will be something like Kat combined with an RDF graph

yes, but it won't be RDF. to be perfectly honest, for desktop use RDF is 
impractical, lossy and boring. and by "boring" i don't mean "i have the 
innate need to invent something new". i mean "the people who use such a 
system will find is unuseful and therefore boring".

people don't want search tools. they want to look at that photo their aunt 
sent them. this is an important understanding.

> I agree with this:
>
> so we have kat which has lots of code.
> we have tenor which has lots of design.
> there are four layers to be considered, from bottom to top:
> 0. storage
> 1. API
> 2. population
> 3. user interface

yes, i think this is a great starting point (if i may say so myself ;) and am 
happy to see the discussion emerging around points 0 and 1

> to see what a finished product looks like, download
> http://aduna.biz/products/autofocus/index.html

> 0. Storage:
> =For text similarity you will always need a text-index. Point.=
> The text index has to be seperated from the database OR included on a
> very deep and tough level (like the FULLTEXT index available in MySQL
> 4.2. This fulltext stuff involved major changes in architecture of mysql
> and of the querying format, which is somehow a big hack you don't want
> to do).
> Much fun to you, if you want to reimplement that: its bone-hard work

which is why i targeted pgsql originally since it has a powerful set of SQL to 
tap and a full text engine to boot. you are VERY correct that we should not 
get into the unnecessary business of writing fulltext indexes.

> The Lucene format is effective for this. Take it as example, or use
> it(there is a c port). It is used by autofocus, look at the folder
> "index" in your filesystem.

this is another option, whereby we would point from the link graph into the 
lucene generated files. the only annoyance is that the opposite direction 
would be more difficult. however, in my original db schema, there was a 
specific pool for "locations" which could easily be used for this as lucene 
could return a URL for the file indexed which would then be used to find our 
place in the graph.

> =For the graph, you should use RDF=
> and any serialization or storage of it. RDF describes graphs, as you can
> read here:
> http://www.w3.org/TR/rdf-primer/
>
> The advantage is, that RDF is used in RSS, PDF files, the whole Mozilla
> XUL is based on it, DMOZ directory, newsfeeds, .....

and virtually nowhere useful to us =(

> Fact is: before you now begin to discuss on a graph  serialization
> format and its query language and so on, you lose time that you would
> otherwise write code. Its really a waste of time to discuss graph
> programming interfaces while I teach hundreds of students how to use
> existing APIs.

if only i could find an API that does what i want.

> To prove the idea of RDF: it is used in Aduna Autofocus in the other
> directory called "repository". There the structured data (RDF) is stored
> as binary format.

> 2. population
> Thats up to you.
> But I would reuse the KAT extractors/filters for fulltext and expand
> them to also extract core metadata and express it as DublinCore data.

this is my hope in bringing KAT and Tenor together, yes.

> Again, if you two begin discussing about graph serialization and
> standardization, storage and so on,
> it makes my neck hair raise and my stomach ache.

my stomach aches every time i look at the state of the search field, both 
academic and commercial, and see how they completely miss the point of what 
users actually want and what would really enable desktop software as a whole 
to move forward. so much effort and so little understanding of the targets 
they should be aiming at.

i believe the term is "ivory towers"

-- 
Aaron J. Seigo
GPG Fingerprint: 8B8B 2209 0C6F 7C47 B1EA  EE75 D6B7 2EB1 A7F1 DB43

Full time KDE developer sponsored by Trolltech (http://www.trolltech.com)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/klink/attachments/20050810/97f13cf1/attachment.pgp