[Kde-accessibility] KDE Speech API Draft 2 and new KTTSD

Fri May 21 05:47:01 CEST 2004

Thank you very much for the feedback!

On Thursday 20 May 2004 12:16 pm, Olaf Jan Schmidt wrote:

> There are mainly three types of usage for speech synthesis:
>
> 1. Speaking and navigationg through whole texts (can be interrupted by
> messages and screen reader speech)
> 2. Speaking single messages (can be interrupted by screen reader speech)

Why would you want to interrupt a single message with screen reader?

> 3. A screen-reader reading out whatever happens on the screen (can be cut
> off by new screen reader speech)

If "cut off", is the speaking text re-queued or is it canceled?

>
> The suggested API is very feature-rich for the first two uses, but the
> third use is not covered. 

What is not covered?  I admit I have zero experience with screen readers, but 
for the sake of discussion, let's imagine the following:  The screen reader 
is reading a page of text (let's say a web page.)  Focus moves to a button, 
so the screen reader wants to pause speaking and speak the label (or name) of 
the button.  After speaking the button, screen reader wants to continue 
speaking the page where it left off.  Focus moves to a title bar, so screen 
reader wants to pause speaking and speak the title bar contents.  While 
speaking the title bar contents, focus goes to an image, so screen reader 
wants to abandon speaking the title bar and speak the "alt" text of the 
image.  All of this can be accomplished using the current API as follows:
We assume the screen reader has two "levels" of talking -- a background job 
and a foreground job that represents the control on the screen that has 
current focus.  The background job can be pre-empted by the foreground job, 
but resumes when foreground is no longer speaking.  Foreground jobs can be 
pre-empted by another foreground job, but in this case the pre-empted job is 
canceled.  (Probably over-simplified, but for sake of discussion...) So we 
keep track of two job numbers -- backJobNum and foreJobNum.  Here's what the 
screen reader's code might look like:

uint backJobNum		// Job Number of background
uint foreJobNum		// Job Number of foreground
// Queue and start the page of text using default language.
backJobNum = setText(<text of the page>, Null)
startText(backJobNum)

// <In response to button gotfocus signal>
// Pause the page, queue and start the button.
pauseText(backJobNum)
foreJobNum = setText(<button name or label>, Null)
startText(foreJobNum)
// Resume the page when the button is finished speaking.
resumeText(backJobNum)

// In response to titlebar gotfocus signal>
// Pause the page, queue and start the titlebar contents.
pauseText(backJobNum)
// Cancel the button if still in progress (it isn't in our scenario, but that 
is OK.
removeText(foreJobNum)
foreJobNum = setText(<titlebar contents>, Null)
startText(foreJobNum)
// Resume the page when titlebar is finished speaking.
resumeText(backJobNum)

// <In response to image gotfocus signal>.
// Pause the page, cancel the titlebar, queue and start the image.
pauseText(backJobNum)
removeText(foreJobNum)
foreJobNum = setText(<image "alt" contents>, Null)
startText(foreJobNum)
// Resume the page when the image is finished speaking.
resumeText(backJobNum)

Other than combining setText and startText, I don't see how this can be much  
simpler!

BTW, I recognize that screen readers will want to use different voices for the 
fore- and background jobs.  I mentioned how that will be handled in the 
future by extending the second argument of setText with "talkers".  For 
example,

setText("Hello World", "en/male")
setText("Close", "American Female")
setText("Snapshot of my Mom", "en-GB/whisper")

> I must admit that I don't know much about 
> screen-readers, so I cannot really comment on this, but I am wondering
> whether it would make sense to simplify the API for the first two uses as
> much as possible in order to have it still manageable if we later add
> functionality for screen-readers as well.
>
> Keeping the API simply was also the reason why I was wondering whether
> only one application should be allowed to read out texts,

What would happen to App2 if App1 is currently speaking text?  Error?  Block 
App2?  Cancel App1 and replace with App2?  See my comments later.

> and whether all 
> navigation should be done in this application itself.

The API permits the application to provide navigation controls if desired.  
For those apps that do not, kttmsgr provides them.  It is burdonsome to 
*require* all apps to provide navigation controls.

> The parsing of 
> texts into paragraphs and sentences would be removed from the API.

The reasons for sentence parsing in KTTSD are 1) permit navigation in jobs for 
those apps that do not provide navigation controls, and 2) permit stopping 
speech for those plugins that do not support instant stopage.  All of the 
current plugins do not provide a capability to instantly stop speech in 
progress short of aborting the background process.  The next best solution is 
to finish speaking the current sentence, which is what it currently does.  As 
we've discussed, if KTTSD handles actual audio output via Arts, it should be 
possible to provide instant stoppage.  But there is still the need to 
navigate jobs submitted by apps that don't offer navigation controls.   So 
unless you are strongly opposed to sentence parsing in KTTSD, I think it 
should remain.

I'm less certain about paragraph parsing.  I suppose one might want to back up 
a paragraph or two (I was distracted by the dog barking at the front door.  
Back up two paragraphs please.)  But if you want to get rid of paragraph 
parsing, I won't argue it.

> This 
> would also allow to write clients that read out html or xml with speech
> mark-up rather than plain text.

It should be possible for KTTSD to detect that passed-in text is marked up.  
It would then change its sentence and paragraph parsing to match the markup 
language.  I haven't examined this in detail, but I don't see any obstacles.  
At the very least, it could abandon sentence and paragraph parsing altogether 
if the text has markup.

>
> Kttsd would then be sent a list of single sentences, and allow to jump to
> textpart number n ("markers"):
>
> virtual uint kspeech::newText (const QString &talker=NULL)
> (returns job number)
>
> virtual uint kspeech::addToText (const QString &textpart);
> (returns id)
>
> virtual void kspeech::jumpTo (uint job, uint id);
>

Now we have additional complexity.  We have "jobs" and "id"s within a job.  
Unless we assume that only one text job can be active at one time, in which 
case the jobNum argument might not be needed.  But as I explain more below, I 
think it is a bad idea to allow only one text job at a time.

I'm not convinced about the need for applications to advance or rewind large 
numbers of sentences/paragraphs at a time.  Most of the time, user will want 
to repeat last one or two sentences (call prevSenText once or twice), or skip 
ahead a paragraph or two (call nextParText once or twice).  Under what 
circumstances would we want to "Jump ahead to 52nd sentence" or "rewind to 
sentence 11"?  If you really want that capability, I suppose we can add an 
additional argument to the prev/next Sen/Par Text methods to advance or 
rewind N sentences/paragraphs.   Or, if you want to reduce the 4 relative 
motion methods to one, we can define

moveRelative(uint jobNum, int sen=0, int par=0)

Advances or rewinds the indicated number of paragraphs and/or sentences from 
the current sentence.  If _par_ is 0, advances or rewinds _sen_ sentences 
regardless of paragraphs.  If _sen_ is 0, advances or rewinds to the first 
sentence _par_ paragraphs from the current paragraph.  If both _par_ and 
_sen_ are 0, rewinds to the first sentence of the current paragraph.

As I mentioned, maybe we just get rid of the notion of paragraphs altogether.  

If you still want the absolute motion, we can define

void jumpTo(uint jobNum=0, uint sen=0, uint par=0)

Advances or rewinds to the indicated paragraph and sentence.  If _par_ is 0, 
advances or rewinds to sentence _sen_ regardless of paragraphs.  If _sen_ is 
0, advances or rewinds to the first sentence of paragraph _par_.  If _sen_ or 
_par_ are beyond the limits of the job, advances to the end of the job (job 
becomes finished).

But as I say, I don't see much need for either of these methods.  If an app 
*really* wanted to track every sentence individually, it could queue one 
sentence per job, as needed, i.e.,

QStringList sentences;
QStringList::Iterator it = sentences.begin();
uint currentJobNum;

// Start first sentence.
currentJobNum = setText(*it, Null);
startText(currentJobNum);

// sentenceFinished signal received from KTTSD
void slot_SentenceFinished();
{
	// Start next sentence.
	++it;
	currentJobNum = setText(*it, Null);
	startText(currentJobNum);
}

void SkipAhead(int n)
{
	// Stop current sentence (if any).
	removeText(currentJobNum);
	// Move text iterator ahead.
	// Probably better ways to code this.
	for ( ; n !=0; --n)
	{
		if it == sentences.end()) break;
		++it;
	}
	if (it != sentences.end())
	{
		// Start the sentence.
		currentJobNum = setText(*it, Null);
		startText(currentJobNum);
	}
}

>
> Another thing I was always wondering about was if it is really necessary
> to have two seperate queues for messages and warnings. Maybe the code
> would be smaller and easier to maintain with either having simply one
> queue, or having a priority flag instead. But I leave this to your
> decision as a maintainer, I don't really care much about this.

I wonder about this myself.  Right now, the distinction is that Warnings are 
spoken at the end of the current sentence, while Messages are spoken at the 
end of the current paragraph.  Kind of a subtle (and useless) difference 
really.  What makes better sense to me is for Warnings to be spoken *right 
now*, while Messages are spoken *as soon as practicable*, i.e, at the end of 
the next sentence.  If we enhance the plugin API to permit instant stopage, 
we should make this change.

The code for managing both Warnings and Messages isn't very complicated, so 
there isn't much benefit in eliminating one of them.

Now about limiting the API to only one text job at a time.  I've done a lot of 
thinking about this and strongly urge we not do that.  I assume we want to 
encourage KDE programmers to add speech capabilities to their apps.  If they 
look at the API and see:

setText(const QString& text, const QString& talker)

Queues a text message for speaking on the indicated talker.   If KTTSD is 
already speaking text, an error occurs.

Nobody will want to code a wait loop to wait until the current text job ends.  
They will naturally look to the sayWarning and sayMessage methods instead, 
which we want to discourage for normal use.  sayWarning and sayMessage should 
be reservied for high-priority messages.   If we provide a weak API for 
normal messages, then programmers will tend to treat everything as high 
priority.  

Also, consider what happens if an application fails to remove a text job, 
either due to bad code or a crash.  A "paused" text job would block any 
further text jobs.

If the rule is "the new text cancels text in progress", programmers will have 
the same reaction.  "You mean my speech job can be replaced by another 
application!  Uhm, maybe I should use sayMessage instead.."  And the 
programmer must code a signal handler if they need to queue more than one set 
of text.

So the API I've proposed provides the most robust set of capabilities and 
greatest flexibility for text jobs, reserving sayWarning and sayMessage for 
high-priority jobs as they are intended.

(BTW, multiple text jobs have already been implemented in the latest code in 
CVS.  kttsmgr includes a job manager fashioned closely after the print 
manager.  Take a look!)

Thanks again for your feedback.

Regards,

-- 
Gary Cramblitt (aka PhantomsDad)