[Kde-accessibility] I need help to create a shadow dictionary in Catalan

Peter Grasch me at bedahr.org
Sun Jan 27 09:45:43 UTC 2013


Hi,

On Sat, 2013-01-26 at 13:57 +0100, Antoni Bella Pérez wrote:
>   I am creating a shadow dictionary for Catalan language (another collaborator 
> will do in Spanish) and I have doubts as to the format. (We are told that it 
> will review a linguist.) I think this is correct, see below, but do not know 
> how to add terminal information.
Cool stuff!
You are using the HTK format; For this, the word name needs to be
uppercased like so:
ALCANÓ  [Alcanó]  a l k a n O

For this format, the sorting is important. A valid HTK lexicon needs to
be sorted alphabetically but NOT using locale-aware sort. The sort
condition is just the ASCII value of the character; on Linux you can
sort it easily with "LC_ALL=C sort dictionary".
Also note that at least in my experience, UTF-8 does not work reliably
with the HTK. If possible, use ISO-8859-15 instead.

As you already realized, the HTK lexicon format does not allow to
specify any grammatical information.

Out of the five formats that Simon supports, there are two formats that
do: PLS and Julius.

Julius vocabulary files look like this:
% Terminal
Word   t r a n s c r i p t i o n
Foo    f o:
Bar    b a r
% Next
Foo    f o:
...

PLS dictionaries use XML files. You can find a lot of examples (that are
even built for Simon) here: http://spirit.blau.in/simon
He even has a Spanish one, btw. (although I'm not sure about the
quality)

If you're looking to build a dictionary with terminal information, I
think that PLS would probably be the best choice. It's even a W3C
recommendation: http://www.w3.org/TR/pronunciation-lexicon

However, in my opinion, terminal information is really not that
important for Simon dictionaries, though. For pretty much all setups,
users will specify custom terminals anyway.

If you don't need terminal information, I'd still recommend PLS. But if
that format is too much overhead, just go with SPHINX dictionaries and
skip the terminals. SPHINX dictionaries look like this:
BAR  b a r
WORD  w o r d
WORD(2)  w o: r d
...
(Google for "cmudict" for a large example)

Best regards,
Peter



More information about the kde-accessibility mailing list