[Kde-accessibility] KDE Text-to-speech API 1.0 Draft 1

Sat Apr 17 16:21:32 CEST 2004

On Thursday 15 April 2004 11:44 am, Bill Haneman wrote:
> Gary Cramblitt wrote:
> >I have posted for comment a proposed new KDE Text-to-speech API at the
> >following URL.
> >
> >http://home.comcast.net/~garycramblitt/oss/apidocs/kttsd/html/classkspeech
> >.html
>
> I promised some feedback last week; I do apologize but I've been very
> busy and the holiday slowed me down a bit too.

Thank you for your thoughtful comments.

>
> In gnome-speech we have found that a very useful user model is that of
> "speakers" (or you could call them "voices" if you prefer).  These are
> basically agents the user and/or client program sees as "the things that
> are talking".  In many cases there might only be one "default" speaker
> doing all the talking, but it's nice to have the option of more than one.
>
> Clients can spare speakers, no problem, but if one client tells a
> speaker to shut up, it silences the other clients' messages too.
> Sometimes that's what you want, sometimes not.

Can you explain this more please?  Is there a way to control whether speaking 
stops for all clients or not?  If one speaker is stopped, do other speakers 
continue? 

The issue as to whether all speech should stop because an application calls 
pause or stop is something that I'm currently debating with myself.  As 
currently written, the KTTSD API does not permit application A to stop 
application B's speech, but the user can stop or pause all speech in the 
KTTSD control panel.  I'd like to hear other people's opinions.

>
> If you have a speech client as complex as a screenreader, you often want
> to "know who you're talking to", so there needs to be a way to select a
> speaker or even a TTS "engine".  The gnome-speech "SynthesisDriver"
> interface is basically the "driver".  So the typical use from a client:
>
> 1) query for available speech servers/services (i.e. TTS driver
> backends) with name and version.
> 2) decide which service to use, and ask about capabilities.
>
> Important capabilities are things like "what named voices do you have?"
> and "what language can you speak?", and "what gender do you emulate?".
> Since drivers can usually support multiple "voices", it's easiest to do
> this by making a request for a certain voice name (since many long-time
> text-to-speech users are accustomed to thinking of their preferred
> voices by name), gender, or language, and having the engine return a
> list of voices which match the criteria.  Then the client selects a
> voice and asks the engine to create a "Speaker" using that voice.  It's
> true that gender can usually be inferred from the voice name, but the
> name and language attributes are really important; the first one is
> important to the end user, and the second is vital to the speech client
> since it needs to know whether the voice will be able to properly speak
> the text it's about to be sent.

KTTSD already has this sort of capability.  Users configure speech engines, 
picking a voice (which typically does include gender), and language code (en, 
es, etc).  It seems the difference between the KTTSD approach and GSAPI is 
that GSAPI requires the client applications to make these decisions, while in 
KTTSD, the user makes these decisions when they configure KTTSD.  
Applications need only specify which language they require (and even that is 
optional; if they don't specify a language code, a user-specified default is 
chosen.)

So I'm wondering why is it important to Gnome applications to pick voices and 
genders?  It seems this places additional burdons on the applications 
programmer and can lead to inconsistent implementations, since each 
application will handle user settings/defaults in its own way (if at all).  

>
> The Speaker object themselves have a relatively simple API as well;
>
> * get supported parameters
> * get/set parameter values
> * query a parameter for range and description
>
> [this allows for easy extensibility, support for driver-specific
> parameters, and means we don't need separate get/set methods for
> commonly used parameters: we only need to agree on a set of
> commonly-used parameter names]

I agree.  I'm waiting for the lower-level KTTSD plugin API to be developed 
before finalizing this part of the KTTSD API.  Are the parameters and format 
detailed anywhere in the GSAPI spec?  I don't recall seeing that.

>
> * say some text   [long Speaker.say (in string)]
> * stop speaking  [boolean Speaker.stop]
> * pause speaking (but don't empty the speech queue) [Speaker.wait]
>

If Speaker A is paused or stopped, does Speaker B start speaking if there are 
waiting items in the queue?  What is diffference between stop and pause?  
Does stop discard the text?  If there are additional items in the queue, does 
stop continue with those?

> It's also useful to be able to ask if a speaker is busy: [boolean
> Speaker.isSpeaking]

Yes.

>
> Lastly, it's very important to be able to get callbacks/completion
> notification from TTS engines; no client of any size can get along
> without them.  Trust us :-) we've learned that implementing good quality
> screen reading without callbacks is basically impossible, because
> there's no way otherwise to support both serial queuing and interruption
> (and you need both in order to support messages of different levels of
> priority/urgency).

KTTSD supports priority via sayWarning, sayMessage, and setText.  sayWarning 
his highest priority and gets spoken as soon as the currently speaking 
sentence completes (or job is paused or stopped).  sayMessage is next highest 
priority and gets spoken as soon as the currently speaking paragraph is 
finished.  setText is for regular text.

>
> Our API for callbacks is simple:  Speaker.registerSpeechCallback (in
> SpeechCallback)
> registers a callback for a particular speaker, so that notifications
> from the speaker in question are delivered to the client's speech
> callback.  As part of the notification, the client gets the "text id"
> corresponding to the return from the Speaker.say() command (you were
> wondering what that was for, weren't you?).  There are basically three
> types of notification:  start, progress, and end; not every TTS engine
> will support all three types of callback, but at the very least, "end"
> notification is important.

KTTSD API has sentenceStarted and sentenceFinished signals, as well as other 
job-related signals such as textStarted, textFinished, textStopped, etc.  The 
DCOP signal/slot methodogy is particularly useful for this because neither 
the app nor KTTSD need exist at the time a signal connection is requested.

>
> I think this is really a pretty simple API, it consists of two object
> types: (SynthesisDriver aka "engine", and Speaker), and one interface
> (SpeechCallback), with a total of about 15 methods, if you include
> separate get/set methods for all attributes.  The fact that it's defined
> in IDL doesn't mean it has to be CORBA, and the Bonobo::Unknown type
> from which the objects derive is almost trivially simple - just
> reference counting and interface query.
>
> I agree that it doesn't make sense for KDE clients to use an explicit
> CORBA API - fortunately nothing of the sort is required in order to map
> nicely onto the gnome-speech APIs.  If we harmonize our APIs from a
> functional perspective, this means:
>
> * more prospects of reusing back-end code for supporting TTS engines;
> * easy bridging from one protocol to another;
> * easy to create a speech service/drivers that support both client
> frontends.
> * easy to write clients that can use either back-end depending on the
> platform
> where they're deployed, since the semantics and logic are the same, and
> only the transport details differ.
>
> The kttsd API as proposed on the web page you list is much more complex
> than this.

This is an astonishing statement that makes me wonder if you read the KTTSD 
API at all.  I suppose our differing views of complexity come about because 
KTTSD and GSAPI have different goals each is trying to achieve.  More on this 
below.

> Our experience in the gnome-speech world has been that it's 
> much better for the client to handle prioritization and interruption,
> and harder to implement server-side reprioritization/etc. so I
> personally think the stop/pause/status calls should not be on the TTS
> "job" or utterance, but only on the voice/agent which an utterance was
> sent to; the client should re-send any text that has been purged from a
> Speaker's queue due to higher-priority interruption.  We have also
> determined that the client needs to maintain the speech queues, when
> dealing with multiple priorities.

See my comments below.

>
> SupportsMarkup seems useful, in gnome-speech I think a UseMarkup
> parameter would be returned by getSupportedParameters() for
> voices/engines which can do markup.

Yes.  I added the supportsMarkup method as a convenience method until the 
parameterized methods can be nailed down.

>
> I don't think setFile is particularly useful, since the client that sent
> the file would have to stay alive in order to allow interruption/pausing
> of the speech anyway (using the kttsd proposed API).  Better to let the
> client send the file contents to the engine in substrings - it's very
> low bandwidth anyway.

setFile is a convenience method that permits apps to speak a text file and 
forget it (set and forget).  In some cases it can reduce the in-memory 
bandwidth.  It will be useful in particular for KDE Service Menus.  (For 
example, it will be possible to write a simple shell script to convert an 
HTML page to plain text (or JSML) and forward it to KTTSD.)  In any case, I 
don't think it hurts.

As to making the KTTSD and GSAPI compatible.

Yes, GSAPI does not require CORBA, but is obviously designed with CORBA in 
mind, while KTTSD is designed with DCOP in mind.  This leads to fundamentally 
different approaches.  When the Gnome programmer thinks about inter-process 
communication, his first instinct is to create an object.  The KDE programmer 
thinks in terms of DCOP messages and signals.

More fundamentally however, I think KTTSD and GSAPI are trying to achieve 
different goals.  KTTSD is trying to make speech synthesis for apps as simple 
as possible, while still permitting control and feedback for more advanced 
speech apps.  GSAPI is exposing a rich set of capabilities to the app and 
leaving many of the decisions and implementation details up to that app.

There are two key areas where this difference is apparent.  GSAPI requires the 
app to configure a speaker with voice (and possibly gender) and language.  In 
KTTSD this is done by the user when they configure KTTSD.  The app need only 
specify the desired language code, if known, and even that is optional.  So, 
to speak some text, a KDE app need only decide whether to call sayWarning, 
sayMessage, or setText/startText and pass the text to be spoken along with a 
language code (if known).  Hence my astonishment at your statement that the 
KTTSD API is "complex".  

GSAPI requires the app to manage serialization, interruption, queueing (and 
re-queing).  In KTTSD, these functions are handled by KTTSD.  Applications 
need only decide the urgency of a message.  KTTSD provides a central control 
panel where users can pause, stop, restart, re-order, and delete speech jobs.  
Applications can control these functions if they need to, but my underlying 
assumption is that most apps won't want or need to.   It is difficult for me 
to see how burdoning the application with these functions is useful and 
coordination between multiple apps could become a nightmare.  You say that 
experience showed the necessity for this.  It might be helpful to me if you 
elaborated on this more.

Finally, I agree that were the KTTSD API and GSAPI compatible, it would 
facilitate interoperability between apps running on "foreign" desktops.  
However, in practice, I doubt that programers will implement this very often, 
and given the different goals and approaches, bending KTTSD to fit GSAPI 
sacrifices too much.

>
> best regards
>
> Bill

Thanks again for your feedback.

-- 
Gary Cramblitt (aka PhantomsDad)