Anything about Tenor? & creating a content system -> RDF

Kévin Ottens ervin at ipsquad.net
Wed Aug 10 14:21:45 CEST 2005


Le Mercredi 10 Août 2005 10:44, Leo Sauermann a écrit :
> Hi guys,

Hello,

> I want to point out three matters to all of you:
> * text similarity is no big thing

I disagree, really depends on the result quality you are aiming.

> * your architecture will be something like Kat combined with an RDF graph

Possibly. I don't know if RDF will be used or not and I'll surely won't be the 
one taking the decision here. Because of some bias I could have I think it's 
healthier if I stay apart from this kind of decision. I'll surely express 
some thoughts on it from time to time (like in this mail).

> Thats counts which competitors you know. Commercial tools like Convera,
> Autonomy, Opentext, Brainfiler,... all support this. It is already so
> normal, that even the Lucene search engine core (as used in beagle)
> supports it. What you fall "fingerprint" is usually a Vector Space model
> and what you call language similarity is a TF/IDF matrix, which is
> implemented in Lucene.

Just to be a bit more precise TF/IDF is far from being a linguistic criterion, 
and works particularly well with a big amount of textual data.

> The idea is so old, documented and tought in 
> university.

Right, and people still search for something better because of the unexpected 
results it gives sometime, and also because it doesn't work well with a 
reduced corpus.

> Lucene also has multilingual stemmers and other stuff needed 
> for text similarity (like a broad open source developer community).

As for Lucene itself, I really agree that it's a great tool. I just want to 
point that it's nice for full-text indexing but anything dealing with 
"semantic" is currently doomed to fail with the current state of knowledge 
IMHO. Of course my position surely comes from the fact that I don't think 
semantic can "emerge" from statistical criteria, it's more an interpretative 
process that require validation by a user.

> = your architecture will be something like Kat combined with an RDF graph 
> =

I agree about the "something like Kat combined with a graph", will it be RDF 
or not... we'll see.

> to see what a finished product looks like, download
> http://aduna.biz/products/autofocus/index.html
> It roughly does what you have in mind, without the interfaces and APIs.

Do we agree that it could only miss a lot of information then?
Like the author of some pure text files I have, etc.
That's really the big advantage I see in Tenor.

> 0. Storage:
> =For text similarity you will always need a text-index. Point.=

I fully agree with this.

> =For the graph, you should use RDF=
> and any serialization or storage of it. RDF describes graphs, as you can
> read here:
> http://www.w3.org/TR/rdf-primer/

That's where I'm not sure to agree.

> The advantage is, that RDF is used in RSS, PDF files, the whole Mozilla
> XUL is based on it, DMOZ directory, newsfeeds, ..... you will find this
> technology all over the place.

Ok, it's all over the place... does it means that's the best tool for our 
task? That's far from being sure, it really depends the type of relationships 
will be used.

> Fact is: before you now begin to discuss on a graph  serialization
> format and its query language and so on, you lose time that you would
> otherwise write code. Its really a waste of time to discuss graph
> programming interfaces while I teach hundreds of students how to use
> existing APIs.

I fully agree with this.

> To prove the idea of RDF: it is used in Aduna Autofocus in the other
> directory called "repository". There the structured data (RDF) is stored
> as binary format.
> It is also used by thousands of other projects. google for them.

In this case, I'd like to be able to have a textual format in order to have a 
clearer opinion about RDF usage in Aduna Autofocus. I can't find such an 
export unfortunately.

> 1. API
> A generic java api to handle RDF is here:
> http://www-db.stanford.edu/~melnik/rdf/api.html
>
> a widely used java api is this here:
> jena.sf.net
>
> There are C implementations of RDF apis - most of them on SF. redland,
> raptor, threestore, ....

For a good start:
http://librdf.org/

> querying these graphs can be done via various query languages, the best
> is this here:
> http://www.w3.org/TR/rdf-sparql-query/
> (it also supports full text search and you can extend it. We have done
> something like this in the gnowsis project)

Could you give more insight on this?

> 2. population
> Thats up to you.
> But I would reuse the KAT extractors/filters for fulltext and expand
> them to also extract core metadata and express it as DublinCore data.

That sounds reasonable to me, once again using DublinCore is directly tied to 
RDF use or not IMHO (like any other RDF vocabulary of course).

As a conclusion, I'll clarify one point, I'm not strongly against RDF, but 
before using it, it must be clearly stated which advantages it could give to 
something like Tenor. My worries about seeing RDF in Tenor, is that it's the 
first step toward having more semantic web technics entering... which then 
generally means using ontologies. It naturally raises the question about how 
will the ontologies be built?

Ontologies are generally considered as easy to obtain by lot of people in the 
semantic web field. But as someone working on the ontologies building and 
maintenance topic I can confirm that that's far from being an easy task : 
it's really time consuming when done by hand, and it's currently impossible 
to automate the process efficiently.

So, what would be the gain for using RDF in Tenor if it's not to process 
ontologies? If it's just about having a way to describe a graph, then other 
solutions should be evaluated as well, nothing guarantee that RDF is the best 
solution in our case.

Regards.

PS: Fortunately you didn't push for OWL yet. ;-)
-- 
Kévin 'ervin' Ottens, http://ervin.ipsquad.net
"Ni le maître sans disciple, Ni le disciple sans maître,
Ne font reculer l'ignorance."
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.kde.org/pipermail/klink/attachments/20050810/0df30eda/attachment.pgp


More information about the Klink mailing list