[kde-edu]: KVTML files for Mandarin Chinese

Fri Sep 21 17:11:51 CEST 2007

Jeremy wrote:
> Awesome, I've personally been looking for something
> like this for a while now.  CEDICT comes close for me, but not quite
> as nice as this.

Glad you like it. I've created them from my own study materials, and I
can imagine that many people could find them useful and shouldn't have
to go through the same troubles.

> Actually, someone
> in #kde-cn did a few kvtml files you might also be interested in.  They are
> in svn at /home/kde/trunk/l10n-kde4/zh_CN/data/kdeedu/kanagram/ .
> They were created for KAnagram's use, so are longer than a word
> per entry (One is Tang Poem, other 13 are chinese idioms).

The idioms sound extremely interesting, I'll have a look into them,
though I'm not at a level where I can invest much of my time into
learning them.

> I see your files are simplified characters, mind if I (or you) convert them
> to  traditional for zh_TW to also enjoy?

Of course I have nothing against that. The reasons why I haven't
invested much time into making the tables for traditional characters
are:

- The HSK tables in traditional characters make little sense as the
HSK is only conducted in simplified characters to my knowledge, so you
have to be able to at least read simplified characters to take the
test. These tables are also only available in simplified characters
and simp->trad conversion is not a trivial thing.

- The frequency tables for traditional characters are slightly
different from those for simplified characters, so I'd have to find
some other source for them.

- I had to manually touch up the automatically generated files to
clean up the duplicate characters and only pick the most common
pronounciation/meaning for a given character (these are meant for
beginners, after all). I don't have the knowledge to make these
decisions for traditional characters, at least not as much as I do for
simplified characters.

If you are interested, I can send you the python scripts I used for
generating these files. One script takes a list of characters (such as
the frequency table) and the utf8 cedict and outputs a tab-separated
value file which you can edit, and another script generates the kvtml
files from the edited tsv file.

> Also, are these appropriate for zh_CN and zh_HK locales?
>  If so I'll add them to both in svn.

It's a good question. I used them with LC_CTYPE=zh_CN.utf-8 and they
worked fine.

> Do you mean adding english and or chinese definitions for each entry?  That
> would also be nice I think.

The HSK has the required vocabulary sorted by difficulty levels (A-D).
For this first set, I have only picked out single-character words,
because they are useful for people who are trying to memorise hanzi
characters.

But these tables also include more complex words and phrases (ci)
containing 2-4 characters, which I didn't include. Think about it as
vocabulary lists that can also be learnt/revised using flashcards.

So it wouldn't be an improvement of the files I've submitted, but
additional files with additional vocabulary which is also very
important for beginner learners of Chinese.

> If you use irc, I'd like to discuss these and other possibilities with you
> sometime.  I'm jpwhiting on freenode most of the time.

I haven't really used irc in many years, but I'm usually very fast
with emails and we can gladly discuss things in more detail. I know
how much resources like this can help language learners (having
organised most of my knowledge in Kate :)) so I'm glad when I can
help.

cosmo