Randa Meeting: Notes on Voice Control in KDE

Fri Sep 15 08:39:13 BST 2017

We here at Randa had a little session about voice recognition and control of 
applications.
We tried to roughly define what we mean by that - a way of talking to the 
computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate, 
conversational interfaces. We agreed that want this and people expect it more 
and more.
Striking a balance between privacy and getting some data to enable this is a 
big concern, see later.
While there is general interest (almost everyone here went out of their way to 
join the disussion), it didn't seem like anyone here at the moment wanted to 
drive this forward themselves, so it may just not go anywhere due to lack of 
people willing to put in time. Otherwise it may be something worth considering 
as a community goal.

The term "intent" seems to be OK for the event that arrives at the 
application. More on that later.

We tried to break down the problem and arrived at two possible scenarios:
1) voice recognition -> string representation in user's language
1.1) translation to English -> string representation in English
2) English sentence -> English string to intent

or alternatively:
1) voice recognition -> string representation in user's language
2) user language sentence -> user language string to intent

3) appliations get "intents" and react to them.

So basically one open question is if we need a translation step or if we can 
directly translate from a string in any language to an intent.

We do not think it feasible nor desirable to let every app do its own magic.
Thus a central "daemon" processes does step 1, listenting to audio and 
translating to a string representation.
Then, assuming we want to do a translation step 1.1 we need to find a way to do 
the translation.

For step 1 mozilla deep voice seems like a candidate, it seems to be quickly 
progressing.

We assume that mid-term we need machine learning for step 2 - gather sample 
sentences (somewhere between thousands and millions) to enable the step of 
going from sentence to intent.
We might get away with a set of simple heuristics to get this kick-started, 
but over time we would want to use machine learning to do this step. Here it's 
important to gather enough sample sentences to be able to train a model. We 
basically assume we need to encourage people to participate and send us the 
recognized sentences to get enough raw material to work with.

On interesting point is that ideally we can keep context, so that the users 
can do follow up queries/commands.
Some of the context may be expressed with state machines (talk to Emanuelle 
about that).
Clearly the whole topic needs research, we want to build on other people's 
stuff and cooperate as much as possible.

Hopefully we can find some centralized daemon thing to run on Linux and do a 
lot of the work in step 1 and 2 for us.
Step 3 requires work on our side (in Qt?) for sure.
What should intents look like? lists of property bags?
Should apps have a way of saying which intents they support?

A starting point could be to use the common media player interface to control 
the media player using voice.
Should exposing intents be a dbus thing to start with?

For querying data, we may want to interface with wikipedia, music brainz, etc, 
but is that more part of the central daemon or should there be an app?

We probably want to be able to start applications when the appropriate command 
arrives "write a new email to Volker" launches Kube with the composer open and 
ideally the receiver filled out, or it may ask the user "I don't know who that 
is, please help me...".
So how do applications define what intents they process?
How can applications ask for details? after receiving an intent they may need 
to ask for more data.

There is also the kpurpose framework, I have no idea what it does, should read 
up on it.

This is likely to be completely new input, while app is in some state, may 
have an open modal dialog, new crashes because we're not prepared?
Are there patterns/building blocks to make it easier when an app is in a 
certain state?
Maybe we should look at transactional computing and finite state machines? We 
could look at network protocols as example, they have error recovery etc.

How would integration for online services look like? A lot of this is about 
querying information.
Should it be by default offline, delegate stuff to online when the user asks for 
it?

We need to build for example public transport app integration.
For centralized AI join other projects.
Maybe Qt will provide the connection to 3rd party engines on Windows and 
macOS, good testing ground.

And to end with a less serious idea, we need a big bike-shed discussion about 
wake up words.
We already came up with: OK KDE (try saying that out loud), OK Konqui or Oh 
Kate!

I hope some of this makes sense, I'd love to see more people stepping up and 
start figuring out what is needed and move it forward :)

Cheers,
Frederik