[Kde-accessibility] I need help to create a shadow dictionary in Catalan

Antoni Bella Pérez antonibella5 at orange.es
Thu Jan 31 20:06:11 UTC 2013


El Diumenge, 27 de gener de 2013, a les 10:45:43, Peter Grasch va escriure:
> Hi,
> 
> On Sat, 2013-01-26 at 13:57 +0100, Antoni Bella Pérez wrote:
> >   I am creating a shadow dictionary for Catalan language (another
> >   collaborator> 
> > will do in Spanish) and I have doubts as to the format. (We are told that
> > it will review a linguist.) I think this is correct, see below, but do
> > not know how to add terminal information.
> 
> Cool stuff!
> You are using the HTK format; For this, the word name needs to be
> uppercased like so:
> ALCANÓ  [Alcanó]  a l k a n O
> 
> For this format, the sorting is important. A valid HTK lexicon needs to
> be sorted alphabetically but NOT using locale-aware sort. The sort
> condition is just the ASCII value of the character; on Linux you can
> sort it easily with "LC_ALL=C sort dictionary".
> Also note that at least in my experience, UTF-8 does not work reliably
> with the HTK. If possible, use ISO-8859-15 instead.
> 
> As you already realized, the HTK lexicon format does not allow to
> specify any grammatical information.
> 
> Out of the five formats that Simon supports, there are two formats that
> do: PLS and Julius.
> 
> Julius vocabulary files look like this:
> % Terminal
> Word   t r a n s c r i p t i o n
> Foo    f o:
> Bar    b a r
> % Next
> Foo    f o:
> ...
> 
> PLS dictionaries use XML files. You can find a lot of examples (that are
> even built for Simon) here: http://spirit.blau.in/simon
> He even has a Spanish one, btw. (although I'm not sure about the
> quality)

  These dictionaries are quite rare (without quality - for catalan).

> 
> If you're looking to build a dictionary with terminal information, I
> think that PLS would probably be the best choice. It's even a W3C
> recommendation: http://www.w3.org/TR/pronunciation-lexicon
> 
> However, in my opinion, terminal information is really not that
> important for Simon dictionaries, though. For pretty much all setups,
> users will specify custom terminals anyway.
> 

  It will not be fast but I will use PLS. Should I talk to other people and 
see what we do.

> If you don't need terminal information, I'd still recommend PLS. But if
> that format is too much overhead, just go with SPHINX dictionaries and
> skip the terminals. SPHINX dictionaries look like this:
> BAR  b a r
> WORD  w o r d
> WORD(2)  w o: r d
> ...
> (Google for "cmudict" for a large example)
> 

  Thank you for this extensive exhibition of the possibilities.

  Toni


More information about the kde-accessibility mailing list