Randa Meeting: Notes on Voice Control in KDE
Sebastian Kügler
sebas at kde.org
Fri Sep 15 11:54:47 BST 2017
Hey!
Interesting discussion. Did you guys factor in the work done by Mycroft
on that front? I think there's a great deal of overlap, and already
some really interesting results shown for example in the Mycroft
Plasmoid:
https://www.youtube.com/watch?v=sUhvKTq6c40 (somewhat dated, but gives
a decent impression)
Cheers,
-- sebas
On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote:
> We here at Randa had a little session about voice recognition and
> control of applications.
> We tried to roughly define what we mean by that - a way of talking to
> the computer as Siri/Cortana/Alexa/Google Now and other projects
> demonstrate, conversational interfaces. We agreed that want this and
> people expect it more and more.
> Striking a balance between privacy and getting some data to enable
> this is a big concern, see later.
> While there is general interest (almost everyone here went out of
> their way to join the disussion), it didn't seem like anyone here at
> the moment wanted to drive this forward themselves, so it may just
> not go anywhere due to lack of people willing to put in time.
> Otherwise it may be something worth considering as a community goal.
>
>
> The term "intent" seems to be OK for the event that arrives at the
> application. More on that later.
>
> We tried to break down the problem and arrived at two possible
> scenarios: 1) voice recognition -> string representation in user's
> language 1.1) translation to English -> string representation in
> English 2) English sentence -> English string to intent
>
> or alternatively:
> 1) voice recognition -> string representation in user's language
> 2) user language sentence -> user language string to intent
>
> 3) appliations get "intents" and react to them.
>
> So basically one open question is if we need a translation step or if
> we can directly translate from a string in any language to an intent.
>
> We do not think it feasible nor desirable to let every app do its own
> magic. Thus a central "daemon" processes does step 1, listenting to
> audio and translating to a string representation.
> Then, assuming we want to do a translation step 1.1 we need to find a
> way to do the translation.
>
> For step 1 mozilla deep voice seems like a candidate, it seems to be
> quickly progressing.
>
> We assume that mid-term we need machine learning for step 2 - gather
> sample sentences (somewhere between thousands and millions) to enable
> the step of going from sentence to intent.
> We might get away with a set of simple heuristics to get this
> kick-started, but over time we would want to use machine learning to
> do this step. Here it's important to gather enough sample sentences
> to be able to train a model. We basically assume we need to encourage
> people to participate and send us the recognized sentences to get
> enough raw material to work with.
>
> On interesting point is that ideally we can keep context, so that the
> users can do follow up queries/commands.
> Some of the context may be expressed with state machines (talk to
> Emanuelle about that).
> Clearly the whole topic needs research, we want to build on other
> people's stuff and cooperate as much as possible.
>
> Hopefully we can find some centralized daemon thing to run on Linux
> and do a lot of the work in step 1 and 2 for us.
> Step 3 requires work on our side (in Qt?) for sure.
> What should intents look like? lists of property bags?
> Should apps have a way of saying which intents they support?
>
> A starting point could be to use the common media player interface to
> control the media player using voice.
> Should exposing intents be a dbus thing to start with?
>
> For querying data, we may want to interface with wikipedia, music
> brainz, etc, but is that more part of the central daemon or should
> there be an app?
>
> We probably want to be able to start applications when the appropriate
> command arrives "write a new email to Volker" launches Kube with the
> composer open and ideally the receiver filled out, or it may ask the
> user "I don't know who that is, please help me...".
> So how do applications define what intents they process?
> How can applications ask for details? after receiving an intent they
> may need to ask for more data.
>
> There is also the kpurpose framework, I have no idea what it does,
> should read up on it.
>
> This is likely to be completely new input, while app is in some
> state, may have an open modal dialog, new crashes because we're not
> prepared? Are there patterns/building blocks to make it easier when
> an app is in a certain state?
> Maybe we should look at transactional computing and finite state
> machines? We could look at network protocols as example, they have
> error recovery etc.
>
> How would integration for online services look like? A lot of this is
> about querying information.
> Should it be by default offline, delegate stuff to online when the
> user asks for it?
>
> We need to build for example public transport app integration.
> For centralized AI join other projects.
> Maybe Qt will provide the connection to 3rd party engines on Windows
> and macOS, good testing ground.
>
> And to end with a less serious idea, we need a big bike-shed
> discussion about wake up words.
> We already came up with: OK KDE (try saying that out loud), OK Konqui
> or Oh Kate!
>
> I hope some of this makes sense, I'd love to see more people stepping
> up and start figuring out what is needed and move it forward :)
>
> Cheers,
> Frederik
--
sebas
http://www.kde.org | http://vizZzion.org
More information about the kde-community
mailing list