[Kde-accessibility] KDE Speech API Draft 2 and new KTTSD
Gary Cramblitt
garycramblitt at comcast.net
Fri May 21 05:47:01 CEST 2004
Thank you very much for the feedback!
On Thursday 20 May 2004 12:16 pm, Olaf Jan Schmidt wrote:
> There are mainly three types of usage for speech synthesis:
>
> 1. Speaking and navigationg through whole texts (can be interrupted by
> messages and screen reader speech)
> 2. Speaking single messages (can be interrupted by screen reader speech)
Why would you want to interrupt a single message with screen reader?
> 3. A screen-reader reading out whatever happens on the screen (can be cut
> off by new screen reader speech)
If "cut off", is the speaking text re-queued or is it canceled?
>
> The suggested API is very feature-rich for the first two uses, but the
> third use is not covered.
What is not covered? I admit I have zero experience with screen readers, but
for the sake of discussion, let's imagine the following: The screen reader
is reading a page of text (let's say a web page.) Focus moves to a button,
so the screen reader wants to pause speaking and speak the label (or name) of
the button. After speaking the button, screen reader wants to continue
speaking the page where it left off. Focus moves to a title bar, so screen
reader wants to pause speaking and speak the title bar contents. While
speaking the title bar contents, focus goes to an image, so screen reader
wants to abandon speaking the title bar and speak the "alt" text of the
image. All of this can be accomplished using the current API as follows:
We assume the screen reader has two "levels" of talking -- a background job
and a foreground job that represents the control on the screen that has
current focus. The background job can be pre-empted by the foreground job,
but resumes when foreground is no longer speaking. Foreground jobs can be
pre-empted by another foreground job, but in this case the pre-empted job is
canceled. (Probably over-simplified, but for sake of discussion...) So we
keep track of two job numbers -- backJobNum and foreJobNum. Here's what the
screen reader's code might look like:
uint backJobNum // Job Number of background
uint foreJobNum // Job Number of foreground
// Queue and start the page of text using default language.
backJobNum = setText(<text of the page>, Null)
startText(backJobNum)
// <In response to button gotfocus signal>
// Pause the page, queue and start the button.
pauseText(backJobNum)
foreJobNum = setText(<button name or label>, Null)
startText(foreJobNum)
// Resume the page when the button is finished speaking.
resumeText(backJobNum)
// In response to titlebar gotfocus signal>
// Pause the page, queue and start the titlebar contents.
pauseText(backJobNum)
// Cancel the button if still in progress (it isn't in our scenario, but that
is OK.
removeText(foreJobNum)
foreJobNum = setText(<titlebar contents>, Null)
startText(foreJobNum)
// Resume the page when titlebar is finished speaking.
resumeText(backJobNum)
// <In response to image gotfocus signal>.
// Pause the page, cancel the titlebar, queue and start the image.
pauseText(backJobNum)
removeText(foreJobNum)
foreJobNum = setText(<image "alt" contents>, Null)
startText(foreJobNum)
// Resume the page when the image is finished speaking.
resumeText(backJobNum)
Other than combining setText and startText, I don't see how this can be much
simpler!
BTW, I recognize that screen readers will want to use different voices for the
fore- and background jobs. I mentioned how that will be handled in the
future by extending the second argument of setText with "talkers". For
example,
setText("Hello World", "en/male")
setText("Close", "American Female")
setText("Snapshot of my Mom", "en-GB/whisper")
> I must admit that I don't know much about
> screen-readers, so I cannot really comment on this, but I am wondering
> whether it would make sense to simplify the API for the first two uses as
> much as possible in order to have it still manageable if we later add
> functionality for screen-readers as well.
>
> Keeping the API simply was also the reason why I was wondering whether
> only one application should be allowed to read out texts,
What would happen to App2 if App1 is currently speaking text? Error? Block
App2? Cancel App1 and replace with App2? See my comments later.
> and whether all
> navigation should be done in this application itself.
The API permits the application to provide navigation controls if desired.
For those apps that do not, kttmsgr provides them. It is burdonsome to
*require* all apps to provide navigation controls.
> The parsing of
> texts into paragraphs and sentences would be removed from the API.
The reasons for sentence parsing in KTTSD are 1) permit navigation in jobs for
those apps that do not provide navigation controls, and 2) permit stopping
speech for those plugins that do not support instant stopage. All of the
current plugins do not provide a capability to instantly stop speech in
progress short of aborting the background process. The next best solution is
to finish speaking the current sentence, which is what it currently does. As
we've discussed, if KTTSD handles actual audio output via Arts, it should be
possible to provide instant stoppage. But there is still the need to
navigate jobs submitted by apps that don't offer navigation controls. So
unless you are strongly opposed to sentence parsing in KTTSD, I think it
should remain.
I'm less certain about paragraph parsing. I suppose one might want to back up
a paragraph or two (I was distracted by the dog barking at the front door.
Back up two paragraphs please.) But if you want to get rid of paragraph
parsing, I won't argue it.
> This
> would also allow to write clients that read out html or xml with speech
> mark-up rather than plain text.
It should be possible for KTTSD to detect that passed-in text is marked up.
It would then change its sentence and paragraph parsing to match the markup
language. I haven't examined this in detail, but I don't see any obstacles.
At the very least, it could abandon sentence and paragraph parsing altogether
if the text has markup.
>
> Kttsd would then be sent a list of single sentences, and allow to jump to
> textpart number n ("markers"):
>
> virtual uint kspeech::newText (const QString &talker=NULL)
> (returns job number)
>
> virtual uint kspeech::addToText (const QString &textpart);
> (returns id)
>
> virtual void kspeech::jumpTo (uint job, uint id);
>
Now we have additional complexity. We have "jobs" and "id"s within a job.
Unless we assume that only one text job can be active at one time, in which
case the jobNum argument might not be needed. But as I explain more below, I
think it is a bad idea to allow only one text job at a time.
I'm not convinced about the need for applications to advance or rewind large
numbers of sentences/paragraphs at a time. Most of the time, user will want
to repeat last one or two sentences (call prevSenText once or twice), or skip
ahead a paragraph or two (call nextParText once or twice). Under what
circumstances would we want to "Jump ahead to 52nd sentence" or "rewind to
sentence 11"? If you really want that capability, I suppose we can add an
additional argument to the prev/next Sen/Par Text methods to advance or
rewind N sentences/paragraphs. Or, if you want to reduce the 4 relative
motion methods to one, we can define
moveRelative(uint jobNum, int sen=0, int par=0)
Advances or rewinds the indicated number of paragraphs and/or sentences from
the current sentence. If _par_ is 0, advances or rewinds _sen_ sentences
regardless of paragraphs. If _sen_ is 0, advances or rewinds to the first
sentence _par_ paragraphs from the current paragraph. If both _par_ and
_sen_ are 0, rewinds to the first sentence of the current paragraph.
As I mentioned, maybe we just get rid of the notion of paragraphs altogether.
If you still want the absolute motion, we can define
void jumpTo(uint jobNum=0, uint sen=0, uint par=0)
Advances or rewinds to the indicated paragraph and sentence. If _par_ is 0,
advances or rewinds to sentence _sen_ regardless of paragraphs. If _sen_ is
0, advances or rewinds to the first sentence of paragraph _par_. If _sen_ or
_par_ are beyond the limits of the job, advances to the end of the job (job
becomes finished).
But as I say, I don't see much need for either of these methods. If an app
*really* wanted to track every sentence individually, it could queue one
sentence per job, as needed, i.e.,
QStringList sentences;
QStringList::Iterator it = sentences.begin();
uint currentJobNum;
// Start first sentence.
currentJobNum = setText(*it, Null);
startText(currentJobNum);
// sentenceFinished signal received from KTTSD
void slot_SentenceFinished();
{
// Start next sentence.
++it;
currentJobNum = setText(*it, Null);
startText(currentJobNum);
}
void SkipAhead(int n)
{
// Stop current sentence (if any).
removeText(currentJobNum);
// Move text iterator ahead.
// Probably better ways to code this.
for ( ; n !=0; --n)
{
if it == sentences.end()) break;
++it;
}
if (it != sentences.end())
{
// Start the sentence.
currentJobNum = setText(*it, Null);
startText(currentJobNum);
}
}
>
> Another thing I was always wondering about was if it is really necessary
> to have two seperate queues for messages and warnings. Maybe the code
> would be smaller and easier to maintain with either having simply one
> queue, or having a priority flag instead. But I leave this to your
> decision as a maintainer, I don't really care much about this.
I wonder about this myself. Right now, the distinction is that Warnings are
spoken at the end of the current sentence, while Messages are spoken at the
end of the current paragraph. Kind of a subtle (and useless) difference
really. What makes better sense to me is for Warnings to be spoken *right
now*, while Messages are spoken *as soon as practicable*, i.e, at the end of
the next sentence. If we enhance the plugin API to permit instant stopage,
we should make this change.
The code for managing both Warnings and Messages isn't very complicated, so
there isn't much benefit in eliminating one of them.
Now about limiting the API to only one text job at a time. I've done a lot of
thinking about this and strongly urge we not do that. I assume we want to
encourage KDE programmers to add speech capabilities to their apps. If they
look at the API and see:
setText(const QString& text, const QString& talker)
Queues a text message for speaking on the indicated talker. If KTTSD is
already speaking text, an error occurs.
Nobody will want to code a wait loop to wait until the current text job ends.
They will naturally look to the sayWarning and sayMessage methods instead,
which we want to discourage for normal use. sayWarning and sayMessage should
be reservied for high-priority messages. If we provide a weak API for
normal messages, then programmers will tend to treat everything as high
priority.
Also, consider what happens if an application fails to remove a text job,
either due to bad code or a crash. A "paused" text job would block any
further text jobs.
If the rule is "the new text cancels text in progress", programmers will have
the same reaction. "You mean my speech job can be replaced by another
application! Uhm, maybe I should use sayMessage instead.." And the
programmer must code a signal handler if they need to queue more than one set
of text.
So the API I've proposed provides the most robust set of capabilities and
greatest flexibility for text jobs, reserving sayWarning and sayMessage for
high-priority jobs as they are intended.
(BTW, multiple text jobs have already been implemented in the latest code in
CVS. kttsmgr includes a job manager fashioned closely after the print
manager. Take a look!)
Thanks again for your feedback.
Regards,
--
Gary Cramblitt (aka PhantomsDad)
More information about the kde-accessibility
mailing list