<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Hi guys,<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <blockquote type="cite">

    <pre wrap="">

* text similarity is no big thing

    </pre>

  </blockquote>

  <pre wrap=""><!---->

I disagree, really depends on the result quality you are aiming.

  </pre>

</blockquote>

yup, but for your case I would start with stable technology and

implement the conventional stuff mentioned before. A good kept TF/IDF

matrix can be used for clustering of documents and other nice stuff and

developers and end users will like it. Something with more 20% more

quality could mean 80% more programming and research effort.<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">The idea is so old, documented and tought in 

university.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Right, and people still search for something better because of the unexpected 

results it gives sometime, and also because it doesn't work well with a 

reduced corpus.

  </pre>

</blockquote>

ok, that is right. For smaller corpus and for desktop search you will

surely find improvements.<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite"><!---->

  <pre wrap="">point that it's nice for full-text indexing but anything dealing with 

"semantic" is currently doomed to fail with the current state of knowledge 

IMHO. Of course my position surely comes from the fact that I don't think 

semantic can "emerge" from statistical criteria, it's more an interpretative 

process that require validation by a user.

  </pre>

</blockquote>

well, if you look at it closer, you can add metadata (and any triple)

to a lucene document. they provide property/value pairs, which is a

start. A real graph needs something else, but lucene is quite ok to add

author, date, downloadedFrom, lastChange, thisAndThat to documents. <br>

-&gt; I don't want to convince anybody here to use lucene, it does not

fulfill all your needs.<br>

But if you can, rip the code apart and copy/paste as much as you can.<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <blockquote type="cite">

    <pre wrap="">to see what a finished product looks like, download

<a class="moz-txt-link-freetext" href="http://aduna.biz/products/autofocus/index.html">http://aduna.biz/products/autofocus/index.html</a>

It roughly does what you have in mind, without the interfaces and APIs.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Do we agree that it could only miss a lot of information then?

Like the author of some pure text files I have, etc.

That's really the big advantage I see in Tenor.

  </pre>

</blockquote>

Sure, but in principle these facts can be entered in Aduna Autofocus

(they just don't provide you the api, but I speak to their developers

from time to time and will ask them if they could publish apis)<br>

at all, it won't solve your problem, because Aduna is JAva and you are

C/C++<br>

the interfaces may be a good source for copy/paste<br>

<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <blockquote type="cite">

    <pre wrap="">=For the graph, you should use RDF=

and any serialization or storage of it. RDF describes graphs, as you can

read here:

<a class="moz-txt-link-freetext" href="http://www.w3.org/TR/rdf-primer/">http://www.w3.org/TR/rdf-primer/</a>

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That's where I'm not sure to agree.

  </pre>

</blockquote>

ok, that is good to do. Drill down your requirements and your wishes to

the system and then see why RDF doesn't work. e.g. the performance may

be bad or the Meta-Description language (RDFS/OWL) too complicated, ....<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap="">

Ok, it's all over the place... does it means that's the best tool for our 

task? That's far from being sure, it really depends the type of relationships 

will be used.

  </pre>

</blockquote>

no, as RDF is the most generic data format there is. point. You can

express anything in RDF. All relationships.&nbsp; Any other language will be

either restrictive (meaning that it supports less relationships or only

relationships X and Y) or it will be isomorphic to RDF (meaning that

you reimplemented RDF)<br>

<br>

there are some points in RDF that suck, like sequences or lists, but

these can be handled.<br>

<br>

You could even use your complete own format, with own database mappings

etc.<br>

But if you design your whole thing that it is inside "KAT/TENOR"

enabled and outside provides RDF import/export you might get some good

ideas for the thing.<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <blockquote type="cite">

    <pre wrap="">To prove the idea of RDF: it is used in Aduna Autofocus in the other

directory called "repository". There the structured data (RDF) is stored

as binary format.

It is also used by thousands of other projects. google for them.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

In this case, I'd like to be able to have a textual format in order to have a 

clearer opinion about RDF usage in Aduna Autofocus. I can't find such an 

export unfortunately.

  </pre>

</blockquote>

ok, I'll bug the developers for this one.<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <blockquote type="cite">

    <pre wrap="">querying these graphs can be done via various query languages, the best

is this here:

<a class="moz-txt-link-freetext" href="http://www.w3.org/TR/rdf-sparql-query/">http://www.w3.org/TR/rdf-sparql-query/</a>

(it also supports full text search and you can extend it. We have done

something like this in the gnowsis project)

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Could you give more insight on this?

  </pre>

</blockquote>

Full text search interfaces are usually like<br>

google: "tenor kde"&nbsp; (a)<br>

real query: SELECT ALL websites X WHERE X containsText "tenor" AND X

containsText "kde" CASE INSENSITIVE (b)<br>

<br>

so to query your engine you will have an end user format (a) and a full

parsed and represented query (b)<br>

In RDF we have for (b) the SPARQL language. <br>

to make a fulltext search we use: (just copy/pasted from a code snippet)<br>

SPARQL:<br>

"PREFIX rdf:&nbsp;&nbsp;&nbsp;&nbsp; <a class="moz-txt-link-rfc2396E" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#">&lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;</a>&nbsp; "

+<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "PREFIX rdfs:&nbsp;&nbsp;&nbsp;

<a class="moz-txt-link-rfc2396E" href="http://www.w3.org/2000/01/rdf-schema#">&lt;http://www.w3.org/2000/01/rdf-schema#&gt;</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; " +<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "SELECT ?x ?label ?type WHERE {GRAPH ?source { \n" +<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "?x rdfs:label ?label. \n" +<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "?x rdf:type ?type \n" +<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "FILTER REGEX(?label, \""+ text + "\", \"i\")}} \n";<br>

<br>

The answer would be (like in a SQL engine) a table of results, each row

in the table having bindings for ?x, ?label and ?type where `?x is a

URI identifying the document, ?label the title or name of the document

and ?type a URI identifying the mime/type or other type of the document.<br>

<br>

so we parse (a) to (b) in sparql (thats acutally where i got the code

from) and then use the results from the SPARQL server.<br>

I use a sparql server that uses the MySQL FULLTEXT feature to do the

FILTER REGEX stuff. We just hacked it so that it ignores the idea of

REGEX and uses conventional text similarity for the search.<br>

<br>

So SPARQL supports both graph querying (querying data and metadata of

objects and links) and also fulltext search. The implementation is

independent - do it as you whish.<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap=""></pre>

  <blockquote type="cite">

    <pre wrap="">2. population

Thats up to you.

But I would reuse the KAT extractors/filters for fulltext and expand

them to also extract core metadata and express it as DublinCore data.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That sounds reasonable to me, once again using DublinCore is directly tied to 

RDF use or not IMHO (like any other RDF vocabulary of course).

  </pre>

</blockquote>

No, it is not.<br>

Microsoft Word supports DublinCore without knowing of RDF.<br>

Dublin core primarily says:<br>

People of earth, if you are a programmer, a bibliographer or any other

dude working with document, <br>

have the title of a document, name the variable "TITLE" and not "titel"

or "name" . It is "title" and "title" alone. Its some kind of metadata

standard.<br>

<br>

The core elements (title, creator, etc.) are defined here:<br>

<a class="moz-txt-link-freetext" href="http://dublincore.org/documents/dcmi-terms/">http://dublincore.org/documents/dcmi-terms/</a><br>

<br>

RDF is just a pretty neat representation of DublinCore. (DC)<br>

You can even use the DC tags in websites meta-tags, so no RDF here.<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap="">

As a conclusion, I'll clarify one point, I'm not strongly against RDF, but 

before using it, it must be clearly stated which advantages it could give to 

something like Tenor. My worries about seeing RDF in Tenor, is that it's the 

first step toward having more semantic web technics entering... which then 

generally means using ontologies. It naturally raises the question about how 

will the ontologies be built?

  </pre>

</blockquote>

Don't worry about ontologies. <br>

Old researchers and high-nosed people try to scare free developers with

the word ontology (late night at campfires), but in real life, you

treat an ontology like any other data format description. You can even

ignore all metadata standards and say: <br>

<br>

well, we give a f**** on Dublin core, lets call it <br>

tenor:title <br>

tenor:author<br>

<br>

and it will also be RDF.<br>

the only thing you might want to do (as an ontology) is to say things

like this:<br>

<br>

@prefix tenor: <a class="moz-txt-link-rfc2396E" href="http://www.kde.org/tenor/metadata/">&lt;http://www.kde.org/tenor/metadata/&gt;</a>.<br>

@prefix rdf: <a class="moz-txt-link-rfc2396E" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#">&lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;</a>.<br>

@prefix rdfs: <a class="moz-txt-link-rfc2396E" href="http://www.w3.org/2000/01/rdf-schema#">&lt;http://www.w3.org/2000/01/rdf-schema#&gt;</a>.<br>

<a class="moz-txt-link-rfc2396E" href="http://www.kde.org/tenor/metadata/">&lt;http://www.kde.org/tenor/metadata/&gt;</a> a owl:Ontology.<br>

tenor:title a rdf:Property;<br>

&nbsp;&nbsp;&nbsp; rdfs:comment "here we put the title of the document. If you are

displaying documents in a list, use this string as a visual thing for

the user to understand what the user looks at.";<br>

&nbsp;&nbsp;&nbsp; rdfs:label "title".<br>

<br>

(this is RDF, written in the N3 syntax using intellidemension's infered

editor, in 3 mins, but you could use protege or a text file also)<br>

<br>

with this "ontology" of tenor metadata, programmers would know that

<a class="moz-txt-link-freetext" href="http://www.kde.org/tenor/metadata/title">http://www.kde.org/tenor/metadata/title</a> is the uri identifier for a

value, and that this value should have the label "title" in the GUI

(i8n: you could add labels in en, fr, de, languages so support

multilingual GUIs) and that it is useful to look at the other possible

propertys also.<br>

<br>

So the onotlogy comes from you (you write it with an ontology editor or

a text editor)<br>

&nbsp;or from a public ontology place where you can download them, like here:<br>

<br>

<a class="moz-txt-link-freetext" href="http://www.schemaweb.info">http://www.schemaweb.info</a><br>

<br>

(dublin core:<br>

<a class="moz-txt-link-freetext" href="http://www.schemaweb.info/schema/SchemaDetails.aspx?id=36">http://www.schemaweb.info/schema/SchemaDetails.aspx?id=36</a><br>

serialized, computer readable RDF:<br>

<a class="moz-txt-link-freetext" href="http://www.schemaweb.info/webservices/rest/GetRDFByID.aspx?id=36">http://www.schemaweb.info/webservices/rest/GetRDFByID.aspx?id=36</a><br>

<br>

For beer drinking (a good relation for people) there would be this one,

defining possible beer drinking links between people:<br>

<a class="moz-txt-link-freetext" href="http://www.schemaweb.info/schema/SchemaDetails.aspx?id=206">http://www.schemaweb.info/schema/SchemaDetails.aspx?id=206</a><br>

<br>

for kissing:<br>

<a class="moz-txt-link-freetext" href="http://www.schemaweb.info/schema/SchemaDetails.aspx?id=203">http://www.schemaweb.info/schema/SchemaDetails.aspx?id=203</a><br>

<br>

and so on.....<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap="">

Ontologies are generally considered as easy to obtain by lot of people in the 

semantic web field. But as someone working on the ontologies building and 

maintenance topic I can confirm that that's far from being an easy task : 

it's really time consuming when done by hand, and it's currently impossible 

to automate the process efficiently.

  </pre>

</blockquote>

sure, but you will need a metadata format anyhow. <br>

The big ontology work is not sitting down and writing an OWL file, the

problem is to agree with people like the nifty guys in this email-group

if we now call the thing "title" or "name" or "label". That is really

the hard work. <br>

<br>

So usually I take an existing ontology (like MP3 id3 tags) and convert

it to something like this:<br>

<a class="moz-txt-link-freetext" href="https://gnowsis.opendfki.de/cgi-bin/trac.cgi/file/trunk/gnowsis/src/org/gnowsis/adapters/MP3/ontology.rdfs">https://gnowsis.opendfki.de/cgi-bin/trac.cgi/file/trunk/gnowsis/src/org/gnowsis/adapters/MP3/ontology.rdfs</a><br>

<br>

the discussion work has already been done, you just sit down and

rewrite it to a new document (which wasn't there for MP3 - there was

only a HTML page with some text. A formal description is nicer)<br>

<br>

So why not use N3/RDF and RDFS to say what your metadata is like.<br>

You could then write code generators that take the N3 file and generate

.h and .cpp files automatically,<br>

like we have done in <br>

<a class="moz-txt-link-freetext" href="http://rdf2java.opendfki.de/cgi-bin/trac.cgi">http://rdf2java.opendfki.de/cgi-bin/trac.cgi</a><br>

<br>

or others have done in<br>

<a class="moz-txt-link-freetext" href="http://ontoware.org/projects/rdfreactor/">http://ontoware.org/projects/rdfreactor/</a><br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap="">

So, what would be the gain for using RDF in Tenor if it's not to process 

ontologies? If it's just about having a way to describe a graph, then other 

solutions should be evaluated as well, nothing guarantee that RDF is the best 

solution in our case.

  </pre>

</blockquote>

So, what would be the gain for using C++ in Tenor if we don't use

Factories/lateBinding/libraries/freakyC++directivesthatwereneverused.

Its just a programming language and C does also the trick. Nothing

guarantees that C++ is the best solution in our case.<br>

<br>

Yes, but the world at a whole is moving towards RDF and you don't need

ontologies, you just need the graph representation and good

import/export to RDF. You need the ideas and the commmunity of

researchers, hackers and projects.<br>

there may be a moment in time, when you look at your product, tenor,

and say:<br>

<br>

gee, I whish we had used C++ that day in 2005, then we would have

namespaces now and C lames.<br>

<br>

gee, I whish we had used RDF that day in 2005, then we would have&nbsp; ....<br>

<br>

* the freaky GUI from <a class="moz-txt-link-freetext" href="http://www.mindswap.org/">http://www.mindswap.org/</a><br>

* the cool metadata (kissing and beer drinking) from

<a class="moz-txt-link-freetext" href="http://www.schemaweb.info">http://www.schemaweb.info</a><br>

* that lovely worldwide RDF search engine on <a class="moz-txt-link-freetext" href="http://swoogle.umbc.edu/">http://swoogle.umbc.edu/</a><br>

* that lovely worldwide RDF search engine on

<a class="moz-txt-link-freetext" href="http://www.semanticwebsearch.com/">http://www.semanticwebsearch.com/</a><br>

* that metadata extractors/file filters from aduna, gnowsis, kowari,

intellidimension,&nbsp; ...<br>

* that nice community of developers where i can sleep for free in all

major towns<br>

<br>

.. now and the triple framework we built from scratch lames.<br>

<br>

<blockquote cite="mid200508101421.56690.ervin@ipsquad.net" type="cite">

  <pre wrap="">

Regards.

PS: Fortunately you didn't push for OWL yet. ;-)

  </pre>

</blockquote>

<br>

jup, that i did now. You note that little owl thing above. don't use

OWL, just use RDF.<br>

<br>

cheers<br>

Leo<br>

</body>

</html>