Anything about Tenor? & creating a content system -> RDF

Leo Sauermann leo at gnowsis.com
Wed Aug 10 16:44:59 CEST 2005


Hi guys,

>>* text similarity is no big thing
>>    
>>
>
>I disagree, really depends on the result quality you are aiming.
>  
>
yup, but for your case I would start with stable technology and 
implement the conventional stuff mentioned before. A good kept TF/IDF 
matrix can be used for clustering of documents and other nice stuff and 
developers and end users will like it. Something with more 20% more 
quality could mean 80% more programming and research effort.

>  
>
>>The idea is so old, documented and tought in 
>>university.
>>    
>>
>
>Right, and people still search for something better because of the unexpected 
>results it gives sometime, and also because it doesn't work well with a 
>reduced corpus.
>  
>
ok, that is right. For smaller corpus and for desktop search you will 
surely find improvements.

>point that it's nice for full-text indexing but anything dealing with 
>"semantic" is currently doomed to fail with the current state of knowledge 
>IMHO. Of course my position surely comes from the fact that I don't think 
>semantic can "emerge" from statistical criteria, it's more an interpretative 
>process that require validation by a user.
>  
>
well, if you look at it closer, you can add metadata (and any triple) to 
a lucene document. they provide property/value pairs, which is a start. 
A real graph needs something else, but lucene is quite ok to add author, 
date, downloadedFrom, lastChange, thisAndThat to documents.
-> I don't want to convince anybody here to use lucene, it does not 
fulfill all your needs.
But if you can, rip the code apart and copy/paste as much as you can.

>>to see what a finished product looks like, download
>>http://aduna.biz/products/autofocus/index.html
>>It roughly does what you have in mind, without the interfaces and APIs.
>>    
>>
>
>Do we agree that it could only miss a lot of information then?
>Like the author of some pure text files I have, etc.
>That's really the big advantage I see in Tenor.
>  
>
Sure, but in principle these facts can be entered in Aduna Autofocus 
(they just don't provide you the api, but I speak to their developers 
from time to time and will ask them if they could publish apis)
at all, it won't solve your problem, because Aduna is JAva and you are C/C++
the interfaces may be a good source for copy/paste


>>=For the graph, you should use RDF=
>>and any serialization or storage of it. RDF describes graphs, as you can
>>read here:
>>http://www.w3.org/TR/rdf-primer/
>>    
>>
>
>That's where I'm not sure to agree.
>  
>
ok, that is good to do. Drill down your requirements and your wishes to 
the system and then see why RDF doesn't work. e.g. the performance may 
be bad or the Meta-Description language (RDFS/OWL) too complicated, ....

>Ok, it's all over the place... does it means that's the best tool for our 
>task? That's far from being sure, it really depends the type of relationships 
>will be used.
>  
>
no, as RDF is the most generic data format there is. point. You can 
express anything in RDF. All relationships.  Any other language will be 
either restrictive (meaning that it supports less relationships or only 
relationships X and Y) or it will be isomorphic to RDF (meaning that you 
reimplemented RDF)

there are some points in RDF that suck, like sequences or lists, but 
these can be handled.

You could even use your complete own format, with own database mappings etc.
But if you design your whole thing that it is inside "KAT/TENOR" enabled 
and outside provides RDF import/export you might get some good ideas for 
the thing.

>>To prove the idea of RDF: it is used in Aduna Autofocus in the other
>>directory called "repository". There the structured data (RDF) is stored
>>as binary format.
>>It is also used by thousands of other projects. google for them.
>>    
>>
>
>In this case, I'd like to be able to have a textual format in order to have a 
>clearer opinion about RDF usage in Aduna Autofocus. I can't find such an 
>export unfortunately.
>  
>
ok, I'll bug the developers for this one.

>>querying these graphs can be done via various query languages, the best
>>is this here:
>>http://www.w3.org/TR/rdf-sparql-query/
>>(it also supports full text search and you can extend it. We have done
>>something like this in the gnowsis project)
>>    
>>
>
>Could you give more insight on this?
>
>  
>
Full text search interfaces are usually like
google: "tenor kde"  (a)
real query: SELECT ALL websites X WHERE X containsText "tenor" AND X 
containsText "kde" CASE INSENSITIVE (b)

so to query your engine you will have an end user format (a) and a full 
parsed and represented query (b)
In RDF we have for (b) the SPARQL language.
to make a fulltext search we use: (just copy/pasted from a code snippet)
SPARQL:
"PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>  " +
                    "PREFIX rdfs:    
<http://www.w3.org/2000/01/rdf-schema#>           " +
                    "SELECT ?x ?label ?type WHERE {GRAPH ?source { \n" +
                    "?x rdfs:label ?label. \n" +
                    "?x rdf:type ?type \n" +
                    "FILTER REGEX(?label, \""+ text + "\", \"i\")}} \n";

The answer would be (like in a SQL engine) a table of results, each row 
in the table having bindings for ?x, ?label and ?type where `?x is a URI 
identifying the document, ?label the title or name of the document and 
?type a URI identifying the mime/type or other type of the document.

so we parse (a) to (b) in sparql (thats acutally where i got the code 
from) and then use the results from the SPARQL server.
I use a sparql server that uses the MySQL FULLTEXT feature to do the 
FILTER REGEX stuff. We just hacked it so that it ignores the idea of 
REGEX and uses conventional text similarity for the search.

So SPARQL supports both graph querying (querying data and metadata of 
objects and links) and also fulltext search. The implementation is 
independent - do it as you whish.

>>2. population
>>Thats up to you.
>>But I would reuse the KAT extractors/filters for fulltext and expand
>>them to also extract core metadata and express it as DublinCore data.
>>    
>>
>
>That sounds reasonable to me, once again using DublinCore is directly tied to 
>RDF use or not IMHO (like any other RDF vocabulary of course).
>  
>
No, it is not.
Microsoft Word supports DublinCore without knowing of RDF.
Dublin core primarily says:
People of earth, if you are a programmer, a bibliographer or any other 
dude working with document,
have the title of a document, name the variable "TITLE" and not "titel" 
or "name" . It is "title" and "title" alone. Its some kind of metadata 
standard.

The core elements (title, creator, etc.) are defined here:
http://dublincore.org/documents/dcmi-terms/

RDF is just a pretty neat representation of DublinCore. (DC)
You can even use the DC tags in websites meta-tags, so no RDF here.

>As a conclusion, I'll clarify one point, I'm not strongly against RDF, but 
>before using it, it must be clearly stated which advantages it could give to 
>something like Tenor. My worries about seeing RDF in Tenor, is that it's the 
>first step toward having more semantic web technics entering... which then 
>generally means using ontologies. It naturally raises the question about how 
>will the ontologies be built?
>  
>
Don't worry about ontologies.
Old researchers and high-nosed people try to scare free developers with 
the word ontology (late night at campfires), but in real life, you treat 
an ontology like any other data format description. You can even ignore 
all metadata standards and say:

well, we give a f**** on Dublin core, lets call it
tenor:title
tenor:author

and it will also be RDF.
the only thing you might want to do (as an ontology) is to say things 
like this:

@prefix tenor: <http://www.kde.org/tenor/metadata/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
<http://www.kde.org/tenor/metadata/> a owl:Ontology.
tenor:title a rdf:Property;
    rdfs:comment "here we put the title of the document. If you are 
displaying documents in a list, use this string as a visual thing for 
the user to understand what the user looks at.";
    rdfs:label "title".

(this is RDF, written in the N3 syntax using intellidemension's infered 
editor, in 3 mins, but you could use protege or a text file also)

with this "ontology" of tenor metadata, programmers would know that 
http://www.kde.org/tenor/metadata/title is the uri identifier for a 
value, and that this value should have the label "title" in the GUI 
(i8n: you could add labels in en, fr, de, languages so support 
multilingual GUIs) and that it is useful to look at the other possible 
propertys also.

So the onotlogy comes from you (you write it with an ontology editor or 
a text editor)
 or from a public ontology place where you can download them, like here:

http://www.schemaweb.info

(dublin core:
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=36
serialized, computer readable RDF:
http://www.schemaweb.info/webservices/rest/GetRDFByID.aspx?id=36

For beer drinking (a good relation for people) there would be this one, 
defining possible beer drinking links between people:
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=206

for kissing:
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=203

and so on.....

>Ontologies are generally considered as easy to obtain by lot of people in the 
>semantic web field. But as someone working on the ontologies building and 
>maintenance topic I can confirm that that's far from being an easy task : 
>it's really time consuming when done by hand, and it's currently impossible 
>to automate the process efficiently.
>  
>
sure, but you will need a metadata format anyhow.
The big ontology work is not sitting down and writing an OWL file, the 
problem is to agree with people like the nifty guys in this email-group 
if we now call the thing "title" or "name" or "label". That is really 
the hard work.

So usually I take an existing ontology (like MP3 id3 tags) and convert 
it to something like this:
https://gnowsis.opendfki.de/cgi-bin/trac.cgi/file/trunk/gnowsis/src/org/gnowsis/adapters/MP3/ontology.rdfs

the discussion work has already been done, you just sit down and rewrite 
it to a new document (which wasn't there for MP3 - there was only a HTML 
page with some text. A formal description is nicer)

So why not use N3/RDF and RDFS to say what your metadata is like.
You could then write code generators that take the N3 file and generate 
.h and .cpp files automatically,
like we have done in
http://rdf2java.opendfki.de/cgi-bin/trac.cgi

or others have done in
http://ontoware.org/projects/rdfreactor/

>So, what would be the gain for using RDF in Tenor if it's not to process 
>ontologies? If it's just about having a way to describe a graph, then other 
>solutions should be evaluated as well, nothing guarantee that RDF is the best 
>solution in our case.
>  
>
So, what would be the gain for using C++ in Tenor if we don't use 
Factories/lateBinding/libraries/freakyC++directivesthatwereneverused. 
Its just a programming language and C does also the trick. Nothing 
guarantees that C++ is the best solution in our case.

Yes, but the world at a whole is moving towards RDF and you don't need 
ontologies, you just need the graph representation and good 
import/export to RDF. You need the ideas and the commmunity of 
researchers, hackers and projects.
there may be a moment in time, when you look at your product, tenor, and 
say:

gee, I whish we had used C++ that day in 2005, then we would have 
namespaces now and C lames.

gee, I whish we had used RDF that day in 2005, then we would have  ....

* the freaky GUI from http://www.mindswap.org/
* the cool metadata (kissing and beer drinking) from 
http://www.schemaweb.info
* that lovely worldwide RDF search engine on http://swoogle.umbc.edu/
* that lovely worldwide RDF search engine on 
http://www.semanticwebsearch.com/
* that metadata extractors/file filters from aduna, gnowsis, kowari, 
intellidimension,  ...
* that nice community of developers where i can sleep for free in all 
major towns

.. now and the triple framework we built from scratch lames.

>Regards.
>
>PS: Fortunately you didn't push for OWL yet. ;-)
>  
>

jup, that i did now. You note that little owl thing above. don't use 
OWL, just use RDF.

cheers
Leo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/klink/attachments/20050810/7a40b194/attachment-0001.html


More information about the Klink mailing list