Randa Meeting: Notes on Voice Control in KDE
Frederik Gladhorn
gladhorn at kde.org
Fri Sep 15 08:39:13 BST 2017
We here at Randa had a little session about voice recognition and control of
applications.
We tried to roughly define what we mean by that - a way of talking to the
computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate,
conversational interfaces. We agreed that want this and people expect it more
and more.
Striking a balance between privacy and getting some data to enable this is a
big concern, see later.
While there is general interest (almost everyone here went out of their way to
join the disussion), it didn't seem like anyone here at the moment wanted to
drive this forward themselves, so it may just not go anywhere due to lack of
people willing to put in time. Otherwise it may be something worth considering
as a community goal.
The term "intent" seems to be OK for the event that arrives at the
application. More on that later.
We tried to break down the problem and arrived at two possible scenarios:
1) voice recognition -> string representation in user's language
1.1) translation to English -> string representation in English
2) English sentence -> English string to intent
or alternatively:
1) voice recognition -> string representation in user's language
2) user language sentence -> user language string to intent
3) appliations get "intents" and react to them.
So basically one open question is if we need a translation step or if we can
directly translate from a string in any language to an intent.
We do not think it feasible nor desirable to let every app do its own magic.
Thus a central "daemon" processes does step 1, listenting to audio and
translating to a string representation.
Then, assuming we want to do a translation step 1.1 we need to find a way to do
the translation.
For step 1 mozilla deep voice seems like a candidate, it seems to be quickly
progressing.
We assume that mid-term we need machine learning for step 2 - gather sample
sentences (somewhere between thousands and millions) to enable the step of
going from sentence to intent.
We might get away with a set of simple heuristics to get this kick-started,
but over time we would want to use machine learning to do this step. Here it's
important to gather enough sample sentences to be able to train a model. We
basically assume we need to encourage people to participate and send us the
recognized sentences to get enough raw material to work with.
On interesting point is that ideally we can keep context, so that the users
can do follow up queries/commands.
Some of the context may be expressed with state machines (talk to Emanuelle
about that).
Clearly the whole topic needs research, we want to build on other people's
stuff and cooperate as much as possible.
Hopefully we can find some centralized daemon thing to run on Linux and do a
lot of the work in step 1 and 2 for us.
Step 3 requires work on our side (in Qt?) for sure.
What should intents look like? lists of property bags?
Should apps have a way of saying which intents they support?
A starting point could be to use the common media player interface to control
the media player using voice.
Should exposing intents be a dbus thing to start with?
For querying data, we may want to interface with wikipedia, music brainz, etc,
but is that more part of the central daemon or should there be an app?
We probably want to be able to start applications when the appropriate command
arrives "write a new email to Volker" launches Kube with the composer open and
ideally the receiver filled out, or it may ask the user "I don't know who that
is, please help me...".
So how do applications define what intents they process?
How can applications ask for details? after receiving an intent they may need
to ask for more data.
There is also the kpurpose framework, I have no idea what it does, should read
up on it.
This is likely to be completely new input, while app is in some state, may
have an open modal dialog, new crashes because we're not prepared?
Are there patterns/building blocks to make it easier when an app is in a
certain state?
Maybe we should look at transactional computing and finite state machines? We
could look at network protocols as example, they have error recovery etc.
How would integration for online services look like? A lot of this is about
querying information.
Should it be by default offline, delegate stuff to online when the user asks for
it?
We need to build for example public transport app integration.
For centralized AI join other projects.
Maybe Qt will provide the connection to 3rd party engines on Windows and
macOS, good testing ground.
And to end with a less serious idea, we need a big bike-shed discussion about
wake up words.
We already came up with: OK KDE (try saying that out loud), OK Konqui or Oh
Kate!
I hope some of this makes sense, I'd love to see more people stepping up and
start figuring out what is needed and move it forward :)
Cheers,
Frederik
More information about the kde-community
mailing list