Regarding our language tools: Data

Mon Feb 10 18:24:23 UTC 2014

On Sunday, February 09, 2014 16:52:58 Inge Wallin wrote:

> What I will do is to describe my experiences with some online tools, what I
> learned from them and discuss how we could apply that to our own
> applications.

Ok, I am going to continue this thread by discussing some topics in more 
detail. The first mail was en overview of what I think and why and was not 
focussed on specific problems or solutions. This mail and possibly some more in 
this series are going to be.

This mail will be about our data.

I have thought a lot about who would use Parley in particular and come to the 
conclusion that the normal school kid is not it. Definitely not in its current 
incarnation but I don't think that s/he would even if we made it more focussed 
on the learning process.

Instead, I think that people like myself, adults who want to learn a new 
language should be our first target group. We are serious in our studies, we 
want to build a large vocabulary and we have the disciplin to see things 
through. But we want to be supported in our large-scale learning by the 
application, not use it mostly to create a file for today's homework of 15 
words and then study it during one evening.

I have a lot to say about the interaction design for this too, but I will 
leave that for another mail. Instead I will focus on the data design this 
time.

Now, the above means that we need to support learning large collections of 
words efficiently. And we need to support creating those too. And 
copy/transfer/download them.

First of all, I think multimedia is an immensely important part of the 
learning process. I would say it's impossible to learn a word well in an 
unknown language if you can't hear it pronounced by a native speaker. So sound 
files should always be part of a collection. (As a side note, I will use the 
word collection here instead of lesson - a lesson is something you have with a 
teacher; the collection is a ...well... collection of words).

And we need tools to be able to create collections quickly.  But first, let us 
take a step back.

I have looked at the dtd for kvtml, the current XML based file format for our 
language tools. And I notice that there are some things missing. First of all 
there is no way to describe a language. I think we need a separate way of 
describing each language. They vary a lot, not just in their vocabulary but in 
many other aspects too. For one thing, different languages have different word 
classes. Most Western languages use conjugation of verbs. Asian langugages 
mostly do not. The word classes verb and adjective are present in almost all 
languages but others are not, e.g. particles. Some languages use genders for 
their nouns, others do not. And it goes on.

In this regard, i think it's also important that we be able to describe 
variations of languages. For instance, american and brittish English are 
almost completely the same but the spellings of some words differ 
(color/colour). And in some cases there are different words for the same thing 
(lift/elevator). So you should be able to indicate which variation a word 
belongs to.

So we need a way to describe a language to make the UI of Parley relevant to 
the language that we study. For instance having a mode to study conjugation 
when I learn Thai does not make sense because Thai doesn't use conjugation at 
all.

I also think we should separate our vocabularies from our collections for 
efficiency reasons. The word "yellow" has the same pronounciation and would use 
the same same image whether it's part of a collection of colors or if it's 
part of a general collection of the 500 most common words. 

Naturally we should still have a file format that supports all of what we 
support now that supports easy download of a collection complete with 
everything needed for efficient learning. This is the current kvtml format and 
it is good for the end user to download and learn from.

But we should also have a kind of back-office storage of the full vocabulary of 
our supported languages. This could be kept in a central database that could 
be replicated in full or in part to any user. And the user could work on it 
and upload his or her extensions to it. In the end we would have a pretty 
extensive collection of words in many different languages. 

The second part of this central database would be a set of translations. If we 
consider the words as nodes in a graph, then the translations would be the 
edges. Some languages have one-to-one translation of certain words to other 
languages, some don't. For instance, I mentioned in my last mail the example 
of translating "clean" into thai. But it didn't indicate if it was the verb 
"to clean" or if it was the adjective "is clean". In Thai they are different 
words, and also in Swedish.

Now, if we have the above in place, i.e. vocabularies centrally stored 
(replicable to user's computers), and a set of translations of words in one 
language to another, we should be able to almost autogenerate collections. To 
create a lesson, you would specify a list of words in the target language 
(Thai in my case), say which language you want to go from (e.g. Swedish) and 
say to the database tool: Create a collection of these words, complete with 
images, sound bytes, etc and store to my hard disk. If some words don't have 
translations in the system, it could be forced to do transitive translations 
e.g. swedish-english-thai. Sometimes this would give the wrong result, but 
that could be regarded as a bug which could be fixed by just adding a specific 
translation swedish-thai for that particular word.

These collections could then be stored just like the current ones on GHNS 
server or in a bodega store or anywhere else.

Another good thing with this approach is that we will probably be able to 
autogenerate some of the vocabulary metadata (sound/images/...) from places 
like WikiMedia and other free resources. And we could create tools similar to 
what our translators use for maintaining the vocabularies and keep them up to 
date and extend them.

I will stop here and wait for any reactions.

	-Inge
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-edu/attachments/20140210/36c0bf73/attachment-0001.html>