Randa Meeting: Notes on Voice Control in KDE

Sebastian Kügler sebas at kde.org
Fri Sep 15 11:54:47 BST 2017


Hey!

Interesting discussion. Did you guys factor in the work done by Mycroft
on that front? I think there's a great deal of overlap, and already
some really interesting results shown for example in the Mycroft
Plasmoid:

https://www.youtube.com/watch?v=sUhvKTq6c40 (somewhat dated, but gives
a decent impression)

Cheers,
-- sebas

On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote:
> We here at Randa had a little session about voice recognition and
> control of applications.
> We tried to roughly define what we mean by that - a way of talking to
> the computer as Siri/Cortana/Alexa/Google Now and other projects
> demonstrate, conversational interfaces. We agreed that want this and
> people expect it more and more.
> Striking a balance between privacy and getting some data to enable
> this is a big concern, see later.
> While there is general interest (almost everyone here went out of
> their way to join the disussion), it didn't seem like anyone here at
> the moment wanted to drive this forward themselves, so it may just
> not go anywhere due to lack of people willing to put in time.
> Otherwise it may be something worth considering as a community goal.
> 
> 
> The term "intent" seems to be OK for the event that arrives at the
> application. More on that later.
> 
> We tried to break down the problem and arrived at two possible
> scenarios: 1) voice recognition -> string representation in user's
> language 1.1) translation to English -> string representation in
> English 2) English sentence -> English string to intent
> 
> or alternatively:
> 1) voice recognition -> string representation in user's language
> 2) user language sentence -> user language string to intent
> 
> 3) appliations get "intents" and react to them.
> 
> So basically one open question is if we need a translation step or if
> we can directly translate from a string in any language to an intent.
> 
> We do not think it feasible nor desirable to let every app do its own
> magic. Thus a central "daemon" processes does step 1, listenting to
> audio and translating to a string representation.
> Then, assuming we want to do a translation step 1.1 we need to find a
> way to do the translation.
> 
> For step 1 mozilla deep voice seems like a candidate, it seems to be
> quickly progressing.
> 
> We assume that mid-term we need machine learning for step 2 - gather
> sample sentences (somewhere between thousands and millions) to enable
> the step of going from sentence to intent.
> We might get away with a set of simple heuristics to get this
> kick-started, but over time we would want to use machine learning to
> do this step. Here it's important to gather enough sample sentences
> to be able to train a model. We basically assume we need to encourage
> people to participate and send us the recognized sentences to get
> enough raw material to work with.
> 
> On interesting point is that ideally we can keep context, so that the
> users can do follow up queries/commands.
> Some of the context may be expressed with state machines (talk to
> Emanuelle about that).
> Clearly the whole topic needs research, we want to build on other
> people's stuff and cooperate as much as possible.
> 
> Hopefully we can find some centralized daemon thing to run on Linux
> and do a lot of the work in step 1 and 2 for us.
> Step 3 requires work on our side (in Qt?) for sure.
> What should intents look like? lists of property bags?
> Should apps have a way of saying which intents they support?
> 
> A starting point could be to use the common media player interface to
> control the media player using voice.
> Should exposing intents be a dbus thing to start with?
> 
> For querying data, we may want to interface with wikipedia, music
> brainz, etc, but is that more part of the central daemon or should
> there be an app?
> 
> We probably want to be able to start applications when the appropriate
> command arrives "write a new email to Volker" launches Kube with the
> composer open and ideally the receiver filled out, or it may ask the
> user "I don't know who that is, please help me...".
> So how do applications define what intents they process?
> How can applications ask for details? after receiving an intent they
> may need to ask for more data.
> 
> There is also the kpurpose framework, I have no idea what it does,
> should read up on it.
> 
> This is likely to be completely new input, while app is in some
> state, may have an open modal dialog, new crashes because we're not
> prepared? Are there patterns/building blocks to make it easier when
> an app is in a certain state?
> Maybe we should look at transactional computing and finite state
> machines? We could look at network protocols as example, they have
> error recovery etc.
> 
> How would integration for online services look like? A lot of this is
> about querying information.
> Should it be by default offline, delegate stuff to online when the
> user asks for it?
> 
> We need to build for example public transport app integration.
> For centralized AI join other projects.
> Maybe Qt will provide the connection to 3rd party engines on Windows
> and macOS, good testing ground.
> 
> And to end with a less serious idea, we need a big bike-shed
> discussion about wake up words.
> We already came up with: OK KDE (try saying that out loud), OK Konqui
> or Oh Kate!
> 
> I hope some of this makes sense, I'd love to see more people stepping
> up and start figuring out what is needed and move it forward :)
> 
> Cheers,
> Frederik


-- 
sebas

http://www.kde.org | http://vizZzion.org



More information about the kde-community mailing list