[Kde-accessibility] KDE Text-to-speech API 1.0 Draft 1

Thu Apr 15 17:44:37 CEST 2004

Gary Cramblitt wrote:

>I have posted for comment a proposed new KDE Text-to-speech API at the 
>following URL.
>
>http://home.comcast.net/~garycramblitt/oss/apidocs/kttsd/html/classkspeech.html
>
>  
>
I promised some feedback last week; I do apologize but I've been very 
busy and the holiday slowed me down a bit too.

In gnome-speech we have found that a very useful user model is that of 
"speakers" (or you could call them "voices" if you prefer).  These are 
basically agents the user and/or client program sees as "the things that 
are talking".  In many cases there might only be one "default" speaker 
doing all the talking, but it's nice to have the option of more than one.

Clients can spare speakers, no problem, but if one client tells a 
speaker to shut up, it silences the other clients' messages too.  
Sometimes that's what you want, sometimes not.

If you have a speech client as complex as a screenreader, you often want 
to "know who you're talking to", so there needs to be a way to select a 
speaker or even a TTS "engine".  The gnome-speech "SynthesisDriver" 
interface is basically the "driver".  So the typical use from a client:

1) query for available speech servers/services (i.e. TTS driver 
backends) with name and version.
2) decide which service to use, and ask about capabilities.

Important capabilities are things like "what named voices do you have?" 
and "what language can you speak?", and "what gender do you emulate?".  
Since drivers can usually support multiple "voices", it's easiest to do 
this by making a request for a certain voice name (since many long-time 
text-to-speech users are accustomed to thinking of their preferred 
voices by name), gender, or language, and having the engine return a 
list of voices which match the criteria.  Then the client selects a 
voice and asks the engine to create a "Speaker" using that voice.  It's 
true that gender can usually be inferred from the voice name, but the 
name and language attributes are really important; the first one is 
important to the end user, and the second is vital to the speech client 
since it needs to know whether the voice will be able to properly speak 
the text it's about to be sent.

The Speaker object themselves have a relatively simple API as well;

* get supported parameters
* get/set parameter values
* query a parameter for range and description

[this allows for easy extensibility, support for driver-specific 
parameters, and means we don't need separate get/set methods for 
commonly used parameters: we only need to agree on a set of 
commonly-used parameter names]

* say some text   [long Speaker.say (in string)]
* stop speaking  [boolean Speaker.stop]
* pause speaking (but don't empty the speech queue) [Speaker.wait]

It's also useful to be able to ask if a speaker is busy: [boolean 
Speaker.isSpeaking]

Lastly, it's very important to be able to get callbacks/completion 
notification from TTS engines; no client of any size can get along 
without them.  Trust us :-) we've learned that implementing good quality 
screen reading without callbacks is basically impossible, because 
there's no way otherwise to support both serial queuing and interruption 
(and you need both in order to support messages of different levels of 
priority/urgency).

Our API for callbacks is simple:  Speaker.registerSpeechCallback (in 
SpeechCallback)
registers a callback for a particular speaker, so that notifications 
from the speaker in question are delivered to the client's speech 
callback.  As part of the notification, the client gets the "text id" 
corresponding to the return from the Speaker.say() command (you were 
wondering what that was for, weren't you?).  There are basically three 
types of notification:  start, progress, and end; not every TTS engine 
will support all three types of callback, but at the very least, "end" 
notification is important.

I think this is really a pretty simple API, it consists of two object types:
(SynthesisDriver aka "engine", and Speaker), and one interface 
(SpeechCallback), with a total of about 15 methods, if you include 
separate get/set methods for all attributes.  The fact that it's defined 
in IDL doesn't mean it has to be CORBA, and the Bonobo::Unknown type 
from which the objects derive is almost trivially simple - just 
reference counting and interface query.

I agree that it doesn't make sense for KDE clients to use an explicit 
CORBA API - fortunately nothing of the sort is required in order to map 
nicely onto the gnome-speech APIs.  If we harmonize our APIs from a 
functional perspective, this means:

* more prospects of reusing back-end code for supporting TTS engines;
* easy bridging from one protocol to another;
* easy to create a speech service/drivers that support both client 
frontends.
* easy to write clients that can use either back-end depending on the 
platform
where they're deployed, since the semantics and logic are the same, and 
only the transport details differ.

The kttsd API as proposed on the web page you list is much more complex 
than this.  Our experience in the gnome-speech world has been that it's 
much better for the client to handle prioritization and interruption, 
and harder to implement server-side reprioritization/etc. so I 
personally think the stop/pause/status calls should not be on the TTS 
"job" or utterance, but only on the voice/agent which an utterance was 
sent to; the client should re-send any text that has been purged from a 
Speaker's queue due to higher-priority interruption.  We have also 
determined that the client needs to maintain the speech queues, when 
dealing with multiple priorities.

SupportsMarkup seems useful, in gnome-speech I think a UseMarkup 
parameter would be returned by getSupportedParameters() for 
voices/engines which can do markup. 

I don't think setFile is particularly useful, since the client that sent 
the file would have to stay alive in order to allow interruption/pausing 
of the speech anyway (using the kttsd proposed API).  Better to let the 
client send the file contents to the engine in substrings - it's very 
low bandwidth anyway.

best regards

Bill

>Please note that this is a high-level API for KDE applications to interface 
>with KTTSD, the KDE Text-to-speech daemon.  It is not the same as the KTTSD  
>Plugin API that is also currently being discussed on this list, although it 
>is related of course.
>
>Some of the links on this page will take you to other pages that represent the 
>internal documentation for KTTSD.  Until I figure out how to keep Doxygen 
>from generating such links, please try to stay on page classkspeech.html in 
>your browser.
>
>Why the new API?
>------------------------
>
>There is a problem with the existing KTTSD API.  Applications currently have 3 
>choices for generating speech from text:
>
>  1.  sayWarning
>  2.  sayMessage
>  3.  setText
>
>sayWarning and sayMessage are intended for short, one sentence messages.  
>KMouth, for example, uses sayMessage.  Users do not have the capability to 
>rewind or replay these messages.  setText permits these capabilities, but 
>only one application at a time can call setText.  If application A calls 
>setText, and before KTTSD has finished speaking, application B calls setText, 
>then application A's speech is clobbered and replaced with application B's 
>text.  (Think of terms of much larger blocks of text.  For example, I'm 
>browsing the web and come across a good article.  I want my computer to read 
>the article to me, while I continue browsing elsewhere.)
>
>While it might have been possible to add a method or two that would have 
>enabled application B to detect that KTTSD was busy servicing application A, 
>I felt this placed an undo burdon on application programmers.  Most apps will 
>want to send some text to KTTSD to be spoken and forget it, i.e. set and 
>forget.
>
>Instead, the new API provides for a queue of text jobs, very much like a print 
>queue.  When the setText job of one application is finished speaking, the 
>next job (application B) begins.  Using the KTTSD GUI, the user will be able 
>to pause, stop, rewind, skip, re-order and delete speech jobs.
>
>Note that the new API is 100% backwards compatible with the existing KTTSD 
>API, and therefore should not break any existing applications that are using 
>it.
>
>In addition to solving the problem I mentioned, the new API also offers some 
>enhanced capabilities, such as providing signal feedback to applications.  It 
>should be possible for apps to use these enhancements for doing more complex 
>TTS functionality.
>
>I did take a look at the Gnome Speech API, with the intention of designing a 
>compatible KDE API.  However, IMHO, this was not practical because of GSAPI's 
>heavy reliance on CORBA, and overly-complex interface.
>
>I have already implemented much of this new API in code.  Unless there are 
>major objections, I intend to begin committing the new code to CVS in about 
>10 days (next weekend).  (In case you didn't know, KTTSD is currently in the 
>kdenonbeta module.)
>
>Please comment to this mailing list or e-mail me directly.  I look forward to 
>your input.
>
>  
>