Anything about Tenor? & creating a content system -> RDF
Leo Sauermann
leo at gnowsis.com
Wed Aug 10 16:44:59 CEST 2005
Hi guys,
>>* text similarity is no big thing
>>
>>
>
>I disagree, really depends on the result quality you are aiming.
>
>
yup, but for your case I would start with stable technology and
implement the conventional stuff mentioned before. A good kept TF/IDF
matrix can be used for clustering of documents and other nice stuff and
developers and end users will like it. Something with more 20% more
quality could mean 80% more programming and research effort.
>
>
>>The idea is so old, documented and tought in
>>university.
>>
>>
>
>Right, and people still search for something better because of the unexpected
>results it gives sometime, and also because it doesn't work well with a
>reduced corpus.
>
>
ok, that is right. For smaller corpus and for desktop search you will
surely find improvements.
>point that it's nice for full-text indexing but anything dealing with
>"semantic" is currently doomed to fail with the current state of knowledge
>IMHO. Of course my position surely comes from the fact that I don't think
>semantic can "emerge" from statistical criteria, it's more an interpretative
>process that require validation by a user.
>
>
well, if you look at it closer, you can add metadata (and any triple) to
a lucene document. they provide property/value pairs, which is a start.
A real graph needs something else, but lucene is quite ok to add author,
date, downloadedFrom, lastChange, thisAndThat to documents.
-> I don't want to convince anybody here to use lucene, it does not
fulfill all your needs.
But if you can, rip the code apart and copy/paste as much as you can.
>>to see what a finished product looks like, download
>>http://aduna.biz/products/autofocus/index.html
>>It roughly does what you have in mind, without the interfaces and APIs.
>>
>>
>
>Do we agree that it could only miss a lot of information then?
>Like the author of some pure text files I have, etc.
>That's really the big advantage I see in Tenor.
>
>
Sure, but in principle these facts can be entered in Aduna Autofocus
(they just don't provide you the api, but I speak to their developers
from time to time and will ask them if they could publish apis)
at all, it won't solve your problem, because Aduna is JAva and you are C/C++
the interfaces may be a good source for copy/paste
>>=For the graph, you should use RDF=
>>and any serialization or storage of it. RDF describes graphs, as you can
>>read here:
>>http://www.w3.org/TR/rdf-primer/
>>
>>
>
>That's where I'm not sure to agree.
>
>
ok, that is good to do. Drill down your requirements and your wishes to
the system and then see why RDF doesn't work. e.g. the performance may
be bad or the Meta-Description language (RDFS/OWL) too complicated, ....
>Ok, it's all over the place... does it means that's the best tool for our
>task? That's far from being sure, it really depends the type of relationships
>will be used.
>
>
no, as RDF is the most generic data format there is. point. You can
express anything in RDF. All relationships. Any other language will be
either restrictive (meaning that it supports less relationships or only
relationships X and Y) or it will be isomorphic to RDF (meaning that you
reimplemented RDF)
there are some points in RDF that suck, like sequences or lists, but
these can be handled.
You could even use your complete own format, with own database mappings etc.
But if you design your whole thing that it is inside "KAT/TENOR" enabled
and outside provides RDF import/export you might get some good ideas for
the thing.
>>To prove the idea of RDF: it is used in Aduna Autofocus in the other
>>directory called "repository". There the structured data (RDF) is stored
>>as binary format.
>>It is also used by thousands of other projects. google for them.
>>
>>
>
>In this case, I'd like to be able to have a textual format in order to have a
>clearer opinion about RDF usage in Aduna Autofocus. I can't find such an
>export unfortunately.
>
>
ok, I'll bug the developers for this one.
>>querying these graphs can be done via various query languages, the best
>>is this here:
>>http://www.w3.org/TR/rdf-sparql-query/
>>(it also supports full text search and you can extend it. We have done
>>something like this in the gnowsis project)
>>
>>
>
>Could you give more insight on this?
>
>
>
Full text search interfaces are usually like
google: "tenor kde" (a)
real query: SELECT ALL websites X WHERE X containsText "tenor" AND X
containsText "kde" CASE INSENSITIVE (b)
so to query your engine you will have an end user format (a) and a full
parsed and represented query (b)
In RDF we have for (b) the SPARQL language.
to make a fulltext search we use: (just copy/pasted from a code snippet)
SPARQL:
"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> " +
"PREFIX rdfs:
<http://www.w3.org/2000/01/rdf-schema#> " +
"SELECT ?x ?label ?type WHERE {GRAPH ?source { \n" +
"?x rdfs:label ?label. \n" +
"?x rdf:type ?type \n" +
"FILTER REGEX(?label, \""+ text + "\", \"i\")}} \n";
The answer would be (like in a SQL engine) a table of results, each row
in the table having bindings for ?x, ?label and ?type where `?x is a URI
identifying the document, ?label the title or name of the document and
?type a URI identifying the mime/type or other type of the document.
so we parse (a) to (b) in sparql (thats acutally where i got the code
from) and then use the results from the SPARQL server.
I use a sparql server that uses the MySQL FULLTEXT feature to do the
FILTER REGEX stuff. We just hacked it so that it ignores the idea of
REGEX and uses conventional text similarity for the search.
So SPARQL supports both graph querying (querying data and metadata of
objects and links) and also fulltext search. The implementation is
independent - do it as you whish.
>>2. population
>>Thats up to you.
>>But I would reuse the KAT extractors/filters for fulltext and expand
>>them to also extract core metadata and express it as DublinCore data.
>>
>>
>
>That sounds reasonable to me, once again using DublinCore is directly tied to
>RDF use or not IMHO (like any other RDF vocabulary of course).
>
>
No, it is not.
Microsoft Word supports DublinCore without knowing of RDF.
Dublin core primarily says:
People of earth, if you are a programmer, a bibliographer or any other
dude working with document,
have the title of a document, name the variable "TITLE" and not "titel"
or "name" . It is "title" and "title" alone. Its some kind of metadata
standard.
The core elements (title, creator, etc.) are defined here:
http://dublincore.org/documents/dcmi-terms/
RDF is just a pretty neat representation of DublinCore. (DC)
You can even use the DC tags in websites meta-tags, so no RDF here.
>As a conclusion, I'll clarify one point, I'm not strongly against RDF, but
>before using it, it must be clearly stated which advantages it could give to
>something like Tenor. My worries about seeing RDF in Tenor, is that it's the
>first step toward having more semantic web technics entering... which then
>generally means using ontologies. It naturally raises the question about how
>will the ontologies be built?
>
>
Don't worry about ontologies.
Old researchers and high-nosed people try to scare free developers with
the word ontology (late night at campfires), but in real life, you treat
an ontology like any other data format description. You can even ignore
all metadata standards and say:
well, we give a f**** on Dublin core, lets call it
tenor:title
tenor:author
and it will also be RDF.
the only thing you might want to do (as an ontology) is to say things
like this:
@prefix tenor: <http://www.kde.org/tenor/metadata/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
<http://www.kde.org/tenor/metadata/> a owl:Ontology.
tenor:title a rdf:Property;
rdfs:comment "here we put the title of the document. If you are
displaying documents in a list, use this string as a visual thing for
the user to understand what the user looks at.";
rdfs:label "title".
(this is RDF, written in the N3 syntax using intellidemension's infered
editor, in 3 mins, but you could use protege or a text file also)
with this "ontology" of tenor metadata, programmers would know that
http://www.kde.org/tenor/metadata/title is the uri identifier for a
value, and that this value should have the label "title" in the GUI
(i8n: you could add labels in en, fr, de, languages so support
multilingual GUIs) and that it is useful to look at the other possible
propertys also.
So the onotlogy comes from you (you write it with an ontology editor or
a text editor)
or from a public ontology place where you can download them, like here:
http://www.schemaweb.info
(dublin core:
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=36
serialized, computer readable RDF:
http://www.schemaweb.info/webservices/rest/GetRDFByID.aspx?id=36
For beer drinking (a good relation for people) there would be this one,
defining possible beer drinking links between people:
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=206
for kissing:
http://www.schemaweb.info/schema/SchemaDetails.aspx?id=203
and so on.....
>Ontologies are generally considered as easy to obtain by lot of people in the
>semantic web field. But as someone working on the ontologies building and
>maintenance topic I can confirm that that's far from being an easy task :
>it's really time consuming when done by hand, and it's currently impossible
>to automate the process efficiently.
>
>
sure, but you will need a metadata format anyhow.
The big ontology work is not sitting down and writing an OWL file, the
problem is to agree with people like the nifty guys in this email-group
if we now call the thing "title" or "name" or "label". That is really
the hard work.
So usually I take an existing ontology (like MP3 id3 tags) and convert
it to something like this:
https://gnowsis.opendfki.de/cgi-bin/trac.cgi/file/trunk/gnowsis/src/org/gnowsis/adapters/MP3/ontology.rdfs
the discussion work has already been done, you just sit down and rewrite
it to a new document (which wasn't there for MP3 - there was only a HTML
page with some text. A formal description is nicer)
So why not use N3/RDF and RDFS to say what your metadata is like.
You could then write code generators that take the N3 file and generate
.h and .cpp files automatically,
like we have done in
http://rdf2java.opendfki.de/cgi-bin/trac.cgi
or others have done in
http://ontoware.org/projects/rdfreactor/
>So, what would be the gain for using RDF in Tenor if it's not to process
>ontologies? If it's just about having a way to describe a graph, then other
>solutions should be evaluated as well, nothing guarantee that RDF is the best
>solution in our case.
>
>
So, what would be the gain for using C++ in Tenor if we don't use
Factories/lateBinding/libraries/freakyC++directivesthatwereneverused.
Its just a programming language and C does also the trick. Nothing
guarantees that C++ is the best solution in our case.
Yes, but the world at a whole is moving towards RDF and you don't need
ontologies, you just need the graph representation and good
import/export to RDF. You need the ideas and the commmunity of
researchers, hackers and projects.
there may be a moment in time, when you look at your product, tenor, and
say:
gee, I whish we had used C++ that day in 2005, then we would have
namespaces now and C lames.
gee, I whish we had used RDF that day in 2005, then we would have ....
* the freaky GUI from http://www.mindswap.org/
* the cool metadata (kissing and beer drinking) from
http://www.schemaweb.info
* that lovely worldwide RDF search engine on http://swoogle.umbc.edu/
* that lovely worldwide RDF search engine on
http://www.semanticwebsearch.com/
* that metadata extractors/file filters from aduna, gnowsis, kowari,
intellidimension, ...
* that nice community of developers where i can sleep for free in all
major towns
.. now and the triple framework we built from scratch lames.
>Regards.
>
>PS: Fortunately you didn't push for OWL yet. ;-)
>
>
jup, that i did now. You note that little owl thing above. don't use
OWL, just use RDF.
cheers
Leo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.kde.org/pipermail/klink/attachments/20050810/7a40b194/attachment-0001.html
More information about the Klink
mailing list