Randa Meeting: Notes on Voice Control in KDE

Tue Sep 19 14:07:16 BST 2017

Hi Frederik,

It's awesome that you are trying out mycroft, do check out some of the cool plasma skills mycroft already has to control your workspace, these are installable directly from the plasmoid.

In addition to that I can understand currently that the plasmoid isn't yet packaged and can be a long procedure to install manually from git, but if you are running Kubuntu 17.04 or higher / KDE Neon or Fedora 25/26 Spin, I have written a small installer to make installations easier for whoever wants to try out Mycroft and the Plasmoid on Plasma. This installs mycroft and the plasmoid together including the plasma desktop skills

Its still new and might have bugs if you want to give it a go you can get the Appimage for the mycroft installer here: https://github.com/AIIX/mycroft-installer/releases/

I think it would be great if more people in the community gave mycroft and the plasmoid a go it certainly would help with looking at the finer details of where improvements can be made with mycroft.

I am also available for a discussion at any time or to answer any queries, installation issues etc. You can ping me on Mycroft's chat channels (userhandle: @aix) or over email.

Regards,
Aditya

________________________________
From: Frederik Gladhorn <gladhorn at kde.org>
Sent: Tuesday, September 19, 2017 2:24:53 AM
To: Aditya Mehra; kde-community at kde.org
Cc: Thomas Pfeiffer
Subject: Re: Randa Meeting: Notes on Voice Control in KDE

Hello Aditya :)

thanks for your mail. I have tried Mycroft a little and am very interested in
it as well (I didn't manage to get the plasmoid up and running, but that's
more due to lack of effort than anything else). Your talk and demo at Akademy
was very impressive.

We did briefly touch on Mycroft, and it certainly is a project that we should
cooperate with in my opinion. I like to start looking at the big picture and
trying to figure out the details from that sometimes, if Mycroft covers a lot
of what we inted to do then that's perfect. I just started looking around and
simply don't feel like I can recommend anything yet, since I'm pretty new to
the topic.

Your mail added one more component to the list that I didn't think about at
all: networking and several devices working together in some form.

On lørdag 16. september 2017 00.08.10 CEST Aditya Mehra wrote:
> Hi Everyone :),
>
>
> Firstly i would like to start of by introducing myself, I am Aditya, i have
> been working on the Mycroft - Plasma integration project since some time
> which includes the front-end work like having a plasmoid as well as
> back-end integration with various plasma desktop features (krunner,
> activities, kdeconnect, wallpapers etc) .
>
Nice, I didn't know that there was more thant the Plasmoid! This is very
interesting to here, I'll have to have a look at what you did so far.

>
> I have carefully read through the email and would like to add some points to
> this discussion (P.S Please don't consider me partial to the mycroft
> project in anyway, I am not employed by them but am contributing full time
> out of my romantics for Linux as a platform and the will to have voice
> control over my own plasma desktop environment in general). Apologies for
> the long email in advance but here are some of my thoughts and points i
> would like to add to the discussion:
>
>
> a)  Mycroft AI is an open source digital assistant trying to bridge the gap
> between proprietary operating systems and their AI assistant / voice
> control platforms such as "Google Now, Siri, Cortanta, Bixbi" etc in an
> open source environment.
>
Yes, that does align well.
>
> b) The mycroft project is based on the same principals as having a
> conversational interface with your computer but by  maintaining privacy and
> independence based on the "Users" own choice. (explained ahead)
>
>
> c) The basic ways how mycroft works:
>
> Mycroft AI is based of python and runs four services mainly:
>
>     i) websocket server more commonly referred to as the messagebus which is
> responsible for accepting and creating websocket server and connections to
> talk between clients(example: plasmoid, mobile, hardware etc)
>
>     ii) The second service is called the 'Adapt' intent parser that acts
> like an platform to understand the users intent for example "open firefox"
> or "create a new tab" or "dict mode"  with multi language support that
> performs the action that a user states.

I'd like to learn more about this part, I guess it's under heavy development.
It did work nicely for me with the raspberry pi Mycroft version. But glancing
at the code, this is based on a few heuristics at the moment, or is there a
collection of data and machine learning involved?

>
>     iii) The third service is the STT (Speech to text service): This service
> is responsible for the speech to text actions that are sent over to adapt
> interface after conversion  to text for performing the' specified intent
>
>     iv.) The fourth service is called "Mimic" that much like the  "espeak
> TTS engine"  performs the action of converting text to speech, except mimic
> does it with customized voices with support for various formats.
>
Technically espeak has a bunch of voices as well, but it's good to see TTS
evolving as well, very good.
>
> d) The mycroft project is based on the Apache license which means its
> completely open and customizable by every interested party in  forking
> their own customizable environment or even drastically rewriting parts of
> the back end that they feel would be suitable for their own user case
> environment and including the ability to host their own instance if they
> feel mycroft-core upstream is not able to reach those levels of
> functionality. Additionally mycroft can also be configured to run headless
>
>
> e) With regards to privacy concerns and the use of Google STT, the upstream
> mycroft community is already working towards moving to Mozilla deep voice /
> speech as their main STT engine as it gets more mature (one of their top
> ranked goals), but on the side lines there  are already forks that are
> using STT interfaces completely offline for example the "jarbas ai fork"
> and everyone is the community is trying to integrate with more open source
> voice trained models like CMU sphinx etc.  This sadly i would call a battle
> of data availability and community contribution to voice vs the already
> having a google trained engine with advantages of propitiatory multi
> language support and highly trained voice models.
>
This is indeed super interesting, we just saw the Mozilla project as a likely
contender, if other projects are taking the pole position, that's just as fine
by me. I just want something that is open source and can be used privately
without sending all data around the globe, I do think privacy is something we
should aim for, so this sounds like we're aligned.
>
> f) The upstream mycroft community is currently very new in terms of larger
> open source projects but is very open to interacting with everyone from the
> KDE community and developers to extend their platform to the plasma desktop
> environment and are committed to providing this effort and their support in
> all ways, including myself who is constantly looking forward to integrating
> even more with plasma and KDE applications and projects in all fronts
> including cool functionality accessibility and dictation mode etc.
>
It's encouraging to hear that you have positive experiences interacting with
them :)
>
> g) Some goodies about mycroft i would like to add: The "hey mycroft" wake
> word is completely customizable and you can name it to whatever suits your
> taste (what ever phonetic names pocket sphinx accepts) additionally as a
> community you can also decide to not use mycroft servers or services to
> interact at all and can define your own api settings for stuff like wolfram
> alpha wake words and other api calls etc including data telemetric's and
> STT there is no requirements to follow Google STT or default Mycroft Home
> Api services even currently.
>
>
> h) As the project is based on python, the best way i have come across is
> interacting with all plasma services is through Dbus interfaces and the
> more applications are ready to open up their functionality over dbus the
> more faster we can integrate voice control on the desktop. This approach on
> the technical side is also not only limited to dbus but also developers who
> prefer to not wanting to interact with dbus can choose to directly expose
> functionality by using C types in their functions they would like to expose
> to voice interaction.

I do think DBus can work just fine, I'd love to hear your thoughts about
intents, conversational interfaces and what apps should do to enable this. For
me that is actually the most pressing question for KDE - what do we need as
interface between applications and the voice controlled service (e.g.
Mycroft). Do you agree that some form of "intents" is the right thing and what
should they contain? Is there some structure that Mycroft uses today?

>
>
> i) There are already awesome mycroft skills being developed by the open
> source community which includes interaction with plasma desktop and stuff
> like home-assistant, mopidy, amarok,  wikipedia (migrating to wiki data) ,
> open weather, other desktop applications and many cloud services like image
> recognition and more at: https://github.com/MycroftAI/mycroft-skills
>
Great, that answers my previous question to some degree, I'll have a look.
>
> j) I  personally and on the behalf of upstream would like to invite everyone
> interested in taking voice control and interaction with digital assistants
> forward on the plasma desktop and plasma mobile platform to come and join
> the mattermost mycroft chat area: https://chat.mycroft.ai where we can
> create our own KDE channel and directly discuss and talk to the upstream
> mycroft team (they are more than happy to interact directly with everyone
> from KDE on one to one basis and queries and concerns and also to take
> voice control and digital assistance to the next level) or through some IRC
> channel where everyone including myself and upstream can all interact to
> take this forward.
>

Thanks a lot for your mail :)

Cheers,
Frederik

>
>
> Regards,
>
> Aditya
>
> ________________________________
> From: kde-community <kde-community-bounces at kde.org> on behalf of Frederik
> Gladhorn <gladhorn at kde.org> Sent: Friday, September 15, 2017 1:09 PM
> To: kde-community at kde.org
> Subject: Randa Meeting: Notes on Voice Control in KDE
>
> We here at Randa had a little session about voice recognition and control of
> applications.
> We tried to roughly define what we mean by that - a way of talking to the
> computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate,
> conversational interfaces. We agreed that want this and people expect it
> more and more.
> Striking a balance between privacy and getting some data to enable this is a
> big concern, see later.
> While there is general interest (almost everyone here went out of their way
> to join the disussion), it didn't seem like anyone here at the moment
> wanted to drive this forward themselves, so it may just not go anywhere due
> to lack of people willing to put in time. Otherwise it may be something
> worth considering as a community goal.
>
>
> The term "intent" seems to be OK for the event that arrives at the
> application. More on that later.
>
> We tried to break down the problem and arrived at two possible scenarios:
> 1) voice recognition -> string representation in user's language
> 1.1) translation to English -> string representation in English
> 2) English sentence -> English string to intent
>
> or alternatively:
> 1) voice recognition -> string representation in user's language
> 2) user language sentence -> user language string to intent
>
> 3) appliations get "intents" and react to them.
>
> So basically one open question is if we need a translation step or if we can
> directly translate from a string in any language to an intent.
>
> We do not think it feasible nor desirable to let every app do its own magic.
> Thus a central "daemon" processes does step 1, listenting to audio and
> translating to a string representation.
> Then, assuming we want to do a translation step 1.1 we need to find a way to
> do the translation.
>
> For step 1 mozilla deep voice seems like a candidate, it seems to be quickly
> progressing.
>
> We assume that mid-term we need machine learning for step 2 - gather sample
> sentences (somewhere between thousands and millions) to enable the step of
> going from sentence to intent.
> We might get away with a set of simple heuristics to get this kick-started,
> but over time we would want to use machine learning to do this step. Here
> it's important to gather enough sample sentences to be able to train a
> model. We basically assume we need to encourage people to participate and
> send us the recognized sentences to get enough raw material to work with.
>
> On interesting point is that ideally we can keep context, so that the users
> can do follow up queries/commands.
> Some of the context may be expressed with state machines (talk to Emanuelle
> about that).
> Clearly the whole topic needs research, we want to build on other people's
> stuff and cooperate as much as possible.
>
> Hopefully we can find some centralized daemon thing to run on Linux and do a
> lot of the work in step 1 and 2 for us.
> Step 3 requires work on our side (in Qt?) for sure.
> What should intents look like? lists of property bags?
> Should apps have a way of saying which intents they support?
>
> A starting point could be to use the common media player interface to
> control the media player using voice.
> Should exposing intents be a dbus thing to start with?
>
> For querying data, we may want to interface with wikipedia, music brainz,
> etc, but is that more part of the central daemon or should there be an app?
>
> We probably want to be able to start applications when the appropriate
> command arrives "write a new email to Volker" launches Kube with the
> composer open and ideally the receiver filled out, or it may ask the user
> "I don't know who that is, please help me...".
> So how do applications define what intents they process?
> How can applications ask for details? after receiving an intent they may
> need to ask for more data.
>
> There is also the kpurpose framework, I have no idea what it does, should
> read up on it.
>
> This is likely to be completely new input, while app is in some state, may
> have an open modal dialog, new crashes because we're not prepared?
> Are there patterns/building blocks to make it easier when an app is in a
> certain state?
> Maybe we should look at transactional computing and finite state machines?
> We could look at network protocols as example, they have error recovery
> etc.
>
> How would integration for online services look like? A lot of this is about
> querying information.
> Should it be by default offline, delegate stuff to online when the user asks
> for it?
>
> We need to build for example public transport app integration.
> For centralized AI join other projects.
> Maybe Qt will provide the connection to 3rd party engines on Windows and
> macOS, good testing ground.
>
> And to end with a less serious idea, we need a big bike-shed discussion
> about wake up words.
> We already came up with: OK KDE (try saying that out loud), OK Konqui or Oh
> Kate!
>
> I hope some of this makes sense, I'd love to see more people stepping up and
> start figuring out what is needed and move it forward :)
>
> Cheers,
> Frederik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-community/attachments/20170919/dc488cdb/attachment.htm>