Regarding our language tools: Data
inge at lysator.liu.se
Mon Feb 10 18:24:23 UTC 2014
On Sunday, February 09, 2014 16:52:58 Inge Wallin wrote:
> What I will do is to describe my experiences with some online tools, what I
> learned from them and discuss how we could apply that to our own
Ok, I am going to continue this thread by discussing some topics in more
detail. The first mail was en overview of what I think and why and was not
focussed on specific problems or solutions. This mail and possibly some more in
this series are going to be.
This mail will be about our data.
I have thought a lot about who would use Parley in particular and come to the
conclusion that the normal school kid is not it. Definitely not in its current
incarnation but I don't think that s/he would even if we made it more focussed
on the learning process.
Instead, I think that people like myself, adults who want to learn a new
language should be our first target group. We are serious in our studies, we
want to build a large vocabulary and we have the disciplin to see things
through. But we want to be supported in our large-scale learning by the
application, not use it mostly to create a file for today's homework of 15
words and then study it during one evening.
I have a lot to say about the interaction design for this too, but I will
leave that for another mail. Instead I will focus on the data design this
Now, the above means that we need to support learning large collections of
words efficiently. And we need to support creating those too. And
First of all, I think multimedia is an immensely important part of the
learning process. I would say it's impossible to learn a word well in an
unknown language if you can't hear it pronounced by a native speaker. So sound
files should always be part of a collection. (As a side note, I will use the
word collection here instead of lesson - a lesson is something you have with a
teacher; the collection is a ...well... collection of words).
And we need tools to be able to create collections quickly. But first, let us
take a step back.
I have looked at the dtd for kvtml, the current XML based file format for our
language tools. And I notice that there are some things missing. First of all
there is no way to describe a language. I think we need a separate way of
describing each language. They vary a lot, not just in their vocabulary but in
many other aspects too. For one thing, different languages have different word
classes. Most Western languages use conjugation of verbs. Asian langugages
mostly do not. The word classes verb and adjective are present in almost all
languages but others are not, e.g. particles. Some languages use genders for
their nouns, others do not. And it goes on.
In this regard, i think it's also important that we be able to describe
variations of languages. For instance, american and brittish English are
almost completely the same but the spellings of some words differ
(color/colour). And in some cases there are different words for the same thing
(lift/elevator). So you should be able to indicate which variation a word
So we need a way to describe a language to make the UI of Parley relevant to
the language that we study. For instance having a mode to study conjugation
when I learn Thai does not make sense because Thai doesn't use conjugation at
I also think we should separate our vocabularies from our collections for
efficiency reasons. The word "yellow" has the same pronounciation and would use
the same same image whether it's part of a collection of colors or if it's
part of a general collection of the 500 most common words.
Naturally we should still have a file format that supports all of what we
support now that supports easy download of a collection complete with
everything needed for efficient learning. This is the current kvtml format and
it is good for the end user to download and learn from.
But we should also have a kind of back-office storage of the full vocabulary of
our supported languages. This could be kept in a central database that could
be replicated in full or in part to any user. And the user could work on it
and upload his or her extensions to it. In the end we would have a pretty
extensive collection of words in many different languages.
The second part of this central database would be a set of translations. If we
consider the words as nodes in a graph, then the translations would be the
edges. Some languages have one-to-one translation of certain words to other
languages, some don't. For instance, I mentioned in my last mail the example
of translating "clean" into thai. But it didn't indicate if it was the verb
"to clean" or if it was the adjective "is clean". In Thai they are different
words, and also in Swedish.
Now, if we have the above in place, i.e. vocabularies centrally stored
(replicable to user's computers), and a set of translations of words in one
language to another, we should be able to almost autogenerate collections. To
create a lesson, you would specify a list of words in the target language
(Thai in my case), say which language you want to go from (e.g. Swedish) and
say to the database tool: Create a collection of these words, complete with
images, sound bytes, etc and store to my hard disk. If some words don't have
translations in the system, it could be forced to do transitive translations
e.g. swedish-english-thai. Sometimes this would give the wrong result, but
that could be regarded as a bug which could be fixed by just adding a specific
translation swedish-thai for that particular word.
These collections could then be stored just like the current ones on GHNS
server or in a bodega store or anywhere else.
Another good thing with this approach is that we will probably be able to
autogenerate some of the vocabulary metadata (sound/images/...) from places
like WikiMedia and other free resources. And we could create tools similar to
what our translators use for maintaining the vocabularies and keep them up to
date and extend them.
I will stop here and wait for any reactions.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the kde-edu