[Kde-accessibility] Proposol (with implementation) for synthesizer interface

Wed Mar 31 02:42:50 CEST 2004

On Tuesday 30 March 2004 1:20 pm, Olaf Jan Schmidt wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Our extended API could maybe look like this:
>
> /**
>  * This is the template class for all the (processing) plug ins
>  * like Festival, FreeTTS or Command
>  */
> class PlugInProc : public QObject{
>     Q_OBJECT
>
>     public:
>         /**
>          * Constructor
>          */
>         PlugInProc( QObject *parent = 0, const char *name = 0);
>
>         /**
>          * Destructor
>          */
>         virtual ~PlugInProc();
>
>         /**
>          * Initializate the speech
>          * @param lang   The standard language code (e.g. "de" or "en_US")
>          * @param config   Theobject for storing configuration settings
>          */
>         virtual bool init(const QString &lang, KConfig *config);

In order to reduce delay from the first sentence to start of sound, should 
plugin load the speech engine on this call?  In most cases you probably would 
want it to, but if you have multiple plugins configured, maybe not.  Perhaps 
a per-plugin configuration option (Load speech engine on kttsd startup?")
>
>         /**
>          * Add a text to the queue of texts that are to be spoken
>          * @param text   The text to be spoken
>          */
>         virtual void sayText(const QString &text);

As we discussed on IRC, I don't think the plugins should be responsible for 
queue management.  That's the job of kttsd.  At most, a plugin might need to 
keep track of three "sentences" (buffers):

  3.  The wave file containing a synthesized sentence and currently being 
spoken on the audio device (arts).
  2.  A sentence currently being synthesized to a wave file.
  1.  A sentence waiting to be sent to the synthesizer.

When a sentence moves out of state 1, the plugin should request another 
sentence from kttsd by raising Ready() signal.

>
>         /**
>          * Start the actual talking
>          */
>         virtual void startTalking();

I don't think this is needed.  The plugin should assume that anything sent to 
it is to be spoken ASAP, but see below.

>
>         /**
>          * Stop synthesizing as fast as possible,
>          * delete any buffered speech.
>          */
>         virtual void stopTalking();

This stops the audio output.  The question is what to do with the 3 sentences 
in states 1 thru 3.  stopTalking will normally be called because the user 
clicked Pause or Stop in the GUI.  If they click Resume, it would be a shame 
to throw away the work that has been done.  On the other hand, the user may 
wish to back up a sentence or restart from the beginning.  So here's what I 
would suggest:

  /**
   * Stop audio output as fast as possible.
   * Continue speech synthesis, but do not send to audio device.
   * Do not request another sentence from kttsd until startTalking is called.
   */
  virtual void stopTalking();

  /**
   * Stop audio output as fast as possible (if in progress).
   * Stop synthesis (if in progress)
   * Clear all buffers.
   */
  virtual void clear();

  /**
   *  Resume (or start) processing.  If buffers contain sentences, speak them,
   *  otherwise emit Ready() signal to request next sentence.
   */
  virtual void startTalking();

  /**
   * sayText is only called in
   * response to a Ready signal from the plugin.  It is the next sentence
   * to be spoken.
   */
  virtual void sayText(const QString &text);

So the normal sequence of events would be:

  kttsd calls startTalking
  Ready() signal is received from plugin.
  kttsd calls sayText
  plugin begins synthesizing first sentence and emits Ready() signal.
  kttsd calls sayText
  plugin sends synthesized wave file to audio device, starts synth of next 
sentence and emits Ready() signal
  kttsd calls sayText
  etc.

  User clicks Pause
  kttsd calls stopTalking()
  User clicks Resume
  kttsd calls startTalking
  plugin continues audio output and synthesis of next sentence (if not already 
completed)
  when state 1 buffer is empty, plugin emits Ready() signal
  kttsd calls sayText
  etc.

  User clicks Pause
  kttsd calls stopTalking()
  User clicks Restart, PreviousSentence, or resets the text queues.
  kttsd calls clear()
  kttsd calls startTalking()
  plugin emits Ready() signal
  kttsd calls sayText
  etc.

>
>         /**
>          * Set a speech synthesis parameter
>          * @param name   The name of the parameter
>          * @param value   The value of the parameter
>          */
>         virtual void sayText(const QString &name, const QString &value);

I think you meant 

         virtual void setParameter(const QString &name, const QString &value);

There is a need for applications using kttsd to obtain information about a 
plugin/synthesis engine's capabilities.  For example, I'm assuming some level 
of support for speech markup languages such as Java Speech Markup Language 
(JSML), Sable, or VoiceXML.  The application needs to know if the plugin can 
support markup languages and which ones.  So my point is that synthesis 
parameters are two way.  kttsd needs to be able to set parameters, but it 
also needs to be able to request them from the plugin.  This needs more work.  
I'm thinking some sort of data structure might be appropriate similar to the 
GNOME Speech API.

Instead of a Ready() signal, it might be simpler if the plugin called a 
getNextSentence method in kttsd to retrieve the next sentence.   If we assume 
that plugins run in their own thread, that might avoid blocking issues.

In addition to the Ready() signal I already mentioned, the plugin should also 
emit signals in response to feedback from the audio device and from the 
synthesis engine.  For example, an error signal if either one gives the 
plugin an error.  

Also, all 3 markup languages I mentioned have a feature called "markers".  
Festival, for instance, when it processes a marker will emit a signal with 
the marker name.  The plugin should pass these signals on to kttsd.  I'm not 
sure what kttsd will do with them right now, probably just pass them on to 
the application, but we should plan for them.

One other thing.  Since synthesis engines can and do crash, the plugin should 
be prepared for that and attempt to gracefully handle it.  For instance, on 
the first crash, simply restart the engine and try again by resending the 
sentence in state 2.  If that causes a crash, discard the sentence in state 2 
and restart the engine, sending it the sentence in state 1.  If that produces 
a crash, stop altogether until clear() is called, then try again when 
startTalking is called.  (I'm assuming that the user will have noticed the 
interrupted speech and done something to fix the situation.)

Hope I've made at least some sense.

-- 
Gary Cramblitt (aka PhantomsDad)