Randa Meeting: Notes on Voice Control in KDE

Fri Sep 15 18:23:03 BST 2017

> On 15. Sep 2017, at 12:54, Sebastian Kügler <sebas at kde.org> wrote:
> 
> Hey!
> 
> Interesting discussion. Did you guys factor in the work done by Mycroft
> on that front? I think there's a great deal of overlap, and already
> some really interesting results shown for example in the Mycroft
> Plasmoid:

Exactly. Please do not reinvent the wheel here. This is a job for Mycroft, which has already solved the vast majority of problems you’d need to solve, and is already proven to work in Plasma.
Duplicating that work would just be a waste.

The big problem that Mycroft currently has is that it uses Google for the voice recognition, but our goal there should be to push for adoption of Mozilla Common Voice in Mycroft, instead of redoing everything Mycroft does.

So yea, I’m 1.000% for allowing voice control in KDE applications as well as Plasma, but I’m 99% sure that the way to go there is Mycroft.

Cheers,
Thomas

> On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote:
>> We here at Randa had a little session about voice recognition and
>> control of applications.
>> We tried to roughly define what we mean by that - a way of talking to
>> the computer as Siri/Cortana/Alexa/Google Now and other projects
>> demonstrate, conversational interfaces. We agreed that want this and
>> people expect it more and more.
>> Striking a balance between privacy and getting some data to enable
>> this is a big concern, see later.
>> While there is general interest (almost everyone here went out of
>> their way to join the disussion), it didn't seem like anyone here at
>> the moment wanted to drive this forward themselves, so it may just
>> not go anywhere due to lack of people willing to put in time.
>> Otherwise it may be something worth considering as a community goal.
>> 
>> 
>> The term "intent" seems to be OK for the event that arrives at the
>> application. More on that later.
>> 
>> We tried to break down the problem and arrived at two possible
>> scenarios: 1) voice recognition -> string representation in user's
>> language 1.1) translation to English -> string representation in
>> English 2) English sentence -> English string to intent
>> 
>> or alternatively:
>> 1) voice recognition -> string representation in user's language
>> 2) user language sentence -> user language string to intent
>> 
>> 3) appliations get "intents" and react to them.
>> 
>> So basically one open question is if we need a translation step or if
>> we can directly translate from a string in any language to an intent.
>> 
>> We do not think it feasible nor desirable to let every app do its own
>> magic. Thus a central "daemon" processes does step 1, listenting to
>> audio and translating to a string representation.
>> Then, assuming we want to do a translation step 1.1 we need to find a
>> way to do the translation.
>> 
>> For step 1 mozilla deep voice seems like a candidate, it seems to be
>> quickly progressing.
>> 
>> We assume that mid-term we need machine learning for step 2 - gather
>> sample sentences (somewhere between thousands and millions) to enable
>> the step of going from sentence to intent.
>> We might get away with a set of simple heuristics to get this
>> kick-started, but over time we would want to use machine learning to
>> do this step. Here it's important to gather enough sample sentences
>> to be able to train a model. We basically assume we need to encourage
>> people to participate and send us the recognized sentences to get
>> enough raw material to work with.
>> 
>> On interesting point is that ideally we can keep context, so that the
>> users can do follow up queries/commands.
>> Some of the context may be expressed with state machines (talk to
>> Emanuelle about that).
>> Clearly the whole topic needs research, we want to build on other
>> people's stuff and cooperate as much as possible.
>> 
>> Hopefully we can find some centralized daemon thing to run on Linux
>> and do a lot of the work in step 1 and 2 for us.
>> Step 3 requires work on our side (in Qt?) for sure.
>> What should intents look like? lists of property bags?
>> Should apps have a way of saying which intents they support?
>> 
>> A starting point could be to use the common media player interface to
>> control the media player using voice.
>> Should exposing intents be a dbus thing to start with?
>> 
>> For querying data, we may want to interface with wikipedia, music
>> brainz, etc, but is that more part of the central daemon or should
>> there be an app?
>> 
>> We probably want to be able to start applications when the appropriate
>> command arrives "write a new email to Volker" launches Kube with the
>> composer open and ideally the receiver filled out, or it may ask the
>> user "I don't know who that is, please help me...".
>> So how do applications define what intents they process?
>> How can applications ask for details? after receiving an intent they
>> may need to ask for more data.
>> 
>> There is also the kpurpose framework, I have no idea what it does,
>> should read up on it.
>> 
>> This is likely to be completely new input, while app is in some
>> state, may have an open modal dialog, new crashes because we're not
>> prepared? Are there patterns/building blocks to make it easier when
>> an app is in a certain state?
>> Maybe we should look at transactional computing and finite state
>> machines? We could look at network protocols as example, they have
>> error recovery etc.
>> 
>> How would integration for online services look like? A lot of this is
>> about querying information.
>> Should it be by default offline, delegate stuff to online when the
>> user asks for it?
>> 
>> We need to build for example public transport app integration.
>> For centralized AI join other projects.
>> Maybe Qt will provide the connection to 3rd party engines on Windows
>> and macOS, good testing ground.
>> 
>> And to end with a less serious idea, we need a big bike-shed
>> discussion about wake up words.
>> We already came up with: OK KDE (try saying that out loud), OK Konqui
>> or Oh Kate!
>> 
>> I hope some of this makes sense, I'd love to see more people stepping
>> up and start figuring out what is needed and move it forward :)
>> 
>> Cheers,
>> Frederik
> 
> 
> -- 
> sebas
> 
> http://www.kde.org | http://vizZzion.org