Meeting

Peter Grasch peter at grasch.net
Mon Feb 17 19:18:01 UTC 2014


Hey,

thanks everyone for coming.

Minutes (I'll keep them short):
* Sadly little has happened in the last few months, but Simon (the
person) has been very active in related projects and may be able to open
source / share a few results with us, especially related to audio alignment.
* Our main, immediate goals so far are:
** Setting up a large (ish) scale audio alignment system for acoustic
modeling (me and, if he can, Simon)
** Setting up the KDE speech website (Mario Fux)
** Working on a user-deployable technology prototype Simon dictation
software (me, mostly)

Please find the full log of todays meeting attached to this email.

Best regards,
Peter
-------------- next part --------------
[15:51:08] <bedahr> ralfherzog: hey there
[15:51:12] <bedahr> long time no see
[15:51:16] <bedahr> how are you?
[15:52:22] <fregl> hi
[15:52:46] <ralfherzog> hello all, hi bedahr. thanks, I am fine.
[15:52:59] <bedahr> hey Frederik! nice to see you too!
[15:54:28] <unormal_> Ah, now I get it. And the "afraidness" disappears...
[15:54:35] <fregl> hey :) please ping me when you do meetings, I'm generally interested (while I won't get around to doing much work directly)
[15:54:39] <bedahr> unormal_: haha
[15:54:51] <bedahr> fregl: ping
[15:54:52] <bedahr> :)
[15:54:59] <bedahr> so it's 4 pm by my clock
[15:55:12] <bedahr> let me check the list
[15:55:18] <bedahr> is Mathias Lenz here?
[15:55:20] <fregl> :D
[15:55:25] <bedahr> Peter Bouda?
[15:56:02] <bedahr> okay let's give them 5 minutes
[15:56:55] <fregl> what is the topic of the meeting? I have to admit that I'm not following that closely ;)
[15:57:10] <bedahr> fregl: basically "what the hell is going on?"
[15:57:44] <bedahr> a general where we are and where we are going (in the near future) meeting
[15:57:50] <fregl> hehe, sounds good, I keep asking myself something like that
[15:57:59] <bedahr> haha yeah
[15:58:27] <fregl> did you decide on only working on speech recognition btw?
[15:58:46] <bedahr> well it's my immediate focus
[15:58:49] <fregl> in a sense tts is solved on a certain level I guess
[15:58:50] <fregl> ok
[15:58:52] <bedahr> ah
[15:58:54] <bedahr> yeah, that's true
[15:58:58] --> skpvox (~simon at 182-248.63-188.cust.bluewin.ch) has joined #kde-speech
[15:59:04] <fregl> that's cool
[15:59:04] <bedahr> I mean Simon does do TTS to a certain extend
[15:59:10] <skpvox> hi
[15:59:23] <fregl> what are you using for TTS? jovie?
[15:59:29] <bedahr> but the truth is that tts is imho fairly usable already using only foss stuff. asr, however,...
[15:59:41] <ralfherzog> I saw your dictation demonstration video in your blog, bedahr. it is working so much better ... :)
[16:00:02] <bedahr> we have backends for Jovie, web services (specifically OpenMARY), and pre-recorded text snippets
[16:00:11] <bedahr> okay, Simon is here. Hello skpvox!
[16:00:13] <fregl> cool
[16:00:17] <fregl> I
[16:00:22] <bedahr> and the 5 minutes are over
[16:00:22] <fregl> I'll shut up now :)
[16:00:24] <bedahr> let's just start
[16:00:27] <bedahr> haha, no that's fine
[16:00:46] <bedahr> okay, first point of our super detailed agenda: "Status report of the participants"
[16:00:50] <bedahr> I'll start, alright?
[16:01:57] <skpvox> ok
[16:01:59] <bedahr> so as most of you probably know, I was abroad / traveling for about the last 6 months now. That meant that the amount of free time I had to spent on KDE speech was really, really low.
[16:03:05] <bedahr> I tried my best to stay the communications hub and keep the team I had built up over the summer engaged but without me constantly bugging everyone, I'm sad to report that most of the team dispersed fairly quickly
[16:03:35] <bedahr> okay enough sad news
[16:03:38] <bedahr> now the good stuff
[16:03:47] <bedahr> I did make / attend a small KDE sprint in October where I tried to finish up the robust aligner subproject
[16:04:41] <bedahr> for those who don't know: the robust audio aligner is a project to allow us to align potentially very recordings to their rough transcriptions (think audiobooks and their associated ebook). The output would be individual sentence chunks with their associated transcription
[16:06:17] <bedahr> I'm happy to report that I have a basically working prototype, but I am not yet entirely convinced by it's efficiency. It does not produce any false transcriptions (yay) but depending on the input, it disregards a lot of actual matches so you end up with very little audio extracted compared to the computational complexity. However, I believe this to be solvable with simple parameter tuning
[16:06:57] <bedahr> so that's fairly good. On the outreach / PR side, I have been more active
[16:07:55] <bedahr> I got Simon accepted to Facebook's Open Akademy program and I now mentor 3 students of the University of Texas on a project to extend Simon's dialog system to take advantage of the new large vocabulary capability and to allow much more natural dialog sequences (think Siri)
[16:08:32] <bedahr> also, in a few days time, I will leave to India to attend conf.kde.in to give a talk about the integration of Simon in 3rd party applications.
[16:09:04] <bedahr> okay, that's about all I did recently. I'm just gonna ping people for their report now. skpvox: you're up!
[16:09:09] <skpvox> ok
[16:09:25] <skpvox> i have been mostly busy with other projects since your absence
[16:09:35] <skpvox> nevertheless, I managed to get quite some stuff done in regards to ASR
[16:10:03] <skpvox> I have done a lot of work on a project for language learners
[16:10:19] <skpvox> and for this project I also wrote a robust audio aligner
[16:10:30] <skpvox> it's still work in progress, but here are some of the features
[16:10:36] <skpvox> * it is parallized and can do a 20 hour audio book in about 10-15 minutes
[16:10:44] <skpvox> * it works on large audio files (tested on 20+ hours audiobooks)
[16:10:59] <skpvox> * it works for nearly every language (as long as it has at least an espeak dictionary)
[16:11:12] <skpvox> * bad alignments can be re-processed after decoding has finished
[16:11:16] <skpvox> * it calculates a confidence for each sentence based on the ASR
[16:11:21] <skpvox> * when used in semi-supervised mode it does live acoustic adoption to improve results
[16:11:25] <skpvox> * it identifies sentences that do not begin/end with a period of silence (these are most likely cut off)
[16:12:33] <bedahr> side question: still the decoding + moses alignment route?
[16:12:38] <skpvox> yes
[16:12:41] <skpvox> not moses, hunalign
[16:12:56] <bedahr> ah
[16:12:57] <bedahr> yeah
[16:12:58] <bedahr> okay
[16:13:11] <bedahr> how do you do decoding for "nearly every language"?
[16:13:18] <skpvox> i generate a dictionary using espeak
[16:13:25] <skpvox> it works well enough
[16:13:35] <skpvox> (except for french, because the dictionary is a mess)
[16:13:38] <bedahr> haha
[16:13:52] <bedahr> yeah I mean I get that part, but where do you get the acoustic model that covers all those phonemes?
[16:13:59] <bedahr> or does espeak use a heavily reduced set?
[16:14:33] <skpvox> I use an english acoustic model. Missing phonemes so far have not been an issue for the alignment
[16:14:44] <bedahr> interesting
[16:14:55] <bedahr> that explains why espeak sounds so crappy xD
[16:15:00] <skpvox> it works for: german, english, swedish, italian, spanish, polish
[16:15:15] <bedahr> anyway, that is fairly impressive and certainly much, much farther along than my version
[16:15:47] <skpvox> i will meet with one of the KALDI developers for a 2 day session tomorrow
[16:15:51] <bedahr> the reason I wanted to go the alignment route over decoding was performance but if you say that your system is fast enough (and it certainly looks like it), it may not be worth the hassle
[16:16:05] <bedahr> yeah, sorry for interrupting, carry on
[16:16:27] <skpvox> it's fast enough if parallelized, but resources are relative cheap ($5 a month per box. 32 boxes are enough)
[16:17:22] <bedahr> okay
[16:17:28] <skpvox> that's basically it from my side
[16:17:28] <bedahr> what are you meeting about?
[16:17:43] <skpvox> he will give me a small introduction to KALDI (on a paid basis)
[16:17:59] <bedahr> okay
[16:18:01] <skpvox> he was one of the researchers working on SIRI
[16:18:15] <bedahr> nice, that's surely going to be interesting
[16:18:31] <skpvox> former DARPA CALO
[16:19:36] <bedahr> is your aligner system open source?
[16:20:12] <skpvox> right now not
[16:20:26] <bedahr> :(
[16:20:29] <bedahr> planning to open source?
[16:20:55] <skpvox> considering it, shouldn't be a problem as it's quite simple
[16:21:15] <bedahr> yeah, you should
[16:21:19] <bedahr> obviously :)
[16:21:31] <unormal_> skpvox: You can even open source complex stuff (SCNR ;-)
[16:21:32] <skpvox> it would also need some clean up, as otherwise i'd probably be the only person understanding it ;)
[16:21:47] <bedahr> haha, that can be done in the open too, though
[16:22:08] <bedahr> okay
[16:22:17] <bedahr> up next: ralfherzog; wanna tell us a bit about your situation?
[16:22:24] <ralfherzog> sure.
[16:22:44] <ralfherzog> My focus always has been pls dictionary development. 
[16:23:09] <ralfherzog> Here are the available dictionaries: http://spirit.blau.in/simon/import-pls-dictionary/
[16:23:58] <skpvox> ralfherzog: combining your dictionaries with the aligner could be interesting
[16:24:14] <ralfherzog> I would like to help if someone is interested to improve one of the dictionaries for his own language. There is with little work very much improvement possible
[16:24:33] <bedahr> skpvox: phonetically rich dictionaries (ralfherzog's use the IPA) would require dedicated acoustic models, though
[16:25:05] <skpvox> we can probably do a rough alignment with the english one first and then build a dedicated model once we have sufficient data
[16:25:18] <bedahr> sure, for English we don't have an issue
[16:25:26] <bedahr> the problem is bootstrapping
[16:25:50] <bedahr> but I love that espeak apparently uses a very, very small set of (common) phonemes (otherwise it wouldn't work)
[16:25:53] <skpvox> i meant we use the english one for other languages as initial dictionary and then build upon the high confidence utterances that we extracted fro m that
[16:26:21] <bedahr> ah, sure, but that means you never ever cover the phonemes that aren't covered by the English one
[16:26:29] <bedahr> so it doesn't really help the bootstrapping
[16:26:53] <skpvox> why not? i guess even sentences with non-english phonemes would be aligned via the english AM
[16:27:29] <bedahr> the decoder can't produce a hypothesis containing a word that contains a phoneme that's not in your AM.
[16:27:39] <skpvox> the majority of sentences that fail during alignment are due to sloppy pronounciation, missing words, ambiguities (affects confidence) and no pause between sentences
[16:28:06] <bedahr> yeah
[16:28:29] <skpvox> bedahr: the decoder would use the espeak dictionary for the initial alignment, if a phoneme is missing the dict entry would be "odd" (because it is missing one phoneme), but it still exists
[16:28:58] <bedahr> yes, rewriting the transcriptions to a "save" subset at first would work
[16:29:16] <bedahr> and then built a proper am with the result of that (and the real dict)
[16:29:18] <bedahr> and then re-align
[16:29:20] <bedahr> sounds good
[16:29:26] <skpvox> my alignment alignes all sentences, so also the ones with non-english phonemes
[16:29:45] <skpvox> i'd be surprised of all those with non-english phonemes have low confidence
[16:29:55] <skpvox> ok
[16:30:48] <bedahr> what do you mean? if the transcription is "A. B! X." and "B" contains a phoneme that is not in your dictionary than at best your hypothesis can be "A. K! X." how would that not garner a low confidence score?
[16:31:46] <bedahr> but anyway, it's certainly doable if you preprocess the dictionary (potentially even manually to rewrite it to "acoustically close" phones)
[16:31:48] <skpvox> the confidence is taken from hunalign
[16:32:13] <skpvox> it's based on how many words align with the reference
[16:32:15] <bedahr> you mean it'd be high because "K" would sound similar to "B" and would thus be recognized in most cases where "B" would be used?
[16:32:16] <skpvox> and how different they are
[16:32:41] <skpvox> yes
[16:33:01] <skpvox> the hypothesis might be "A. B. X."
[16:33:21] <skpvox> the confidence is using a number of factors, if it's e.g. a 20 word sentence
[16:33:33] <bedahr> hm, I want to talk a bit more about this but I feel we should do it after the main meeting. could we leave it for now and come back to it later?
[16:33:38] <skpvox> words in the middle of a sentence have less impact on confidence
[16:33:43] <skpvox> ok
[16:33:56] <bedahr> alright. ralfherzog: are you still active at building your dictionaries?
[16:34:26] <ralfherzog> No, at the moment not. But I would like to help if someone is interested in improving one of my dictionaries
[16:34:32] <bedahr> okay
[16:34:32] <skpvox> bedahr: i have to leave right now to a meeting. I will check the backlog later.
[16:34:44] <bedahr> skpvox: okay, ping me later
[16:34:53] <skpvox> overall I suggest we use small tasks on trello to organize this project
[16:35:05] <bedahr> yeah, I'd like to talk to you anyway
[16:35:07] <skpvox> we use trello and it works really well if the cards are small enough
[16:35:09] <skpvox> ok
[16:35:11] <skpvox> let's talk alter
[16:35:12] <skpvox> later
[16:35:13] <bedahr> bye
[16:35:45] <bedahr> okay, since Peter and Mathias did apparently not turn up, the only one left on my list is unormal_
[16:35:52] <bedahr> (just for the record, Mario)
[16:36:12] <unormal_> Not much from me. 
[16:36:40] <unormal_> Beneath family, work-work, studies and the Randa Meetings I've just time to lurk a bit and staying uptodate what you guys are doing.
[16:36:47] <bedahr> okay
[16:36:50] <bedahr> yeah, no worries
[16:37:04] <bedahr> alright, then let's move on to short and medium term planning
[16:37:21] <bedahr> ugh, would be good if Simon would still be around. I'll definitely talk to him later about this as well
[16:38:23] <bedahr> basically, I have a few tasks that I think would be very important short term. First of all is the aligner / transcriber project that we already touched on today. it's by far our best chance to build a sustainable system of creating high quality acoustic models
[16:40:30] <bedahr> it appears that Simon already has something (close to?) production ready, so if we could get access to that, that would be great. Otherwise, I will have to complete my version (different appraoch - different benefits and drawbacks). The main problem with this is infrastructure which isn't free and - at scale - also not *that* cheap. Again, Simon repeatedly said that he had systems available for that so I need to talk to him about that
[16:41:16] <bedahr> the other side are 2 promo tasks: 1. I think we should get a website up for the kde-speech project. It doesn't have to be at all fancy but should say what it is, what it is about, how to join, where to find us, etc.
[16:41:52] <unormal_> bedahr: Would it make sense to contact the new visual design team of KDE for this?
[16:41:56] <bedahr> the other promo thing is that I would like to get a tech preview of Simon as a dictation system out in the coming months, probably around this summer
[16:42:10] <bedahr> unormal_: I honestly don't know, I didn't follow that thread too closely. would it?
[16:43:03] <unormal_> I you want something fancy, why not. But still need written content I think anyway.
[16:43:31] <bedahr> yeah
[16:43:39] <bedahr> honestly, I think we'd be fine with a stock drupal page as a start
[16:43:50] <bedahr> we'd just need someone to write a bit of stuff down
[16:44:23] -*- unormal_ agress
[16:44:27] <unormal_> -s+e
[16:44:58] <bedahr> unormal_: do you think that is something you could help with?
[16:45:53] <unormal_> Writting something up? What's your timeline for this?
[16:46:02] <bedahr> to infinity and beyond!
[16:46:05] <bedahr> scnr
[16:46:21] <bedahr> no, honestly, it's been blank for months now, it's not really something we'll need tomorrow
[16:46:50] <bedahr> it's really just to have something official to point people to
[16:47:12] --> mahula (~mahula at e178029047.adsl.alicedsl.de) has joined #kde-speech
[16:47:22] <bedahr> mahula: hey, who are you?
[16:48:03] <unormal_> So you think of something like a simon.kde.org sub page or more like openspeech.kde.org or something different. And there some pages with the people and what we do and plan?
[16:48:23] <bedahr> http://speech.kde.org
[16:48:29] <bedahr> and yeah
[16:48:34] <bedahr> I reserved the domain a long while ago
[16:48:39] <bedahr> it has a standard KDE drupal instance on it
[16:49:21] <unormal_> Ah, ok. I add it to my todo list then.
[16:50:15] <bedahr> awesome; I gave you admin rights
[16:50:15] <mahula> bedahr: i'm Mathias, and I#m sorry for being late
[16:50:46] <bedahr> ah Mathias Lenz?
[16:50:51] <mahula> yes
[16:50:54] <unormal_> bedahr: And how (where) do I execute this admin rights?
[16:50:58] <bedahr> okay, do you want to introduce yourself real quick?
[16:51:13] <bedahr> unormal_: http://speech.kde.org/?q=user
[16:52:10] <mahula> ok. i'm a a berlin based computational linguist, focusing on speech recognition, interested in linux, open source software and development in C++
[16:53:01] <bedahr> mahula: could you please tell us a bit more about your previous work in the field?
[16:55:33] <mahula> i did my diploma thesis about speech recognition with open source software an handheld devices for elder peaple. while working on that i used and kind of studied pocketsphinx, the smaller sister of CMU sphinx. 
[16:57:17] <bedahr> okay
[16:57:46] <bedahr> so do you have a specific area you would like to see yourself work in? research interest maybe?
[17:00:52] <bedahr> while we wait for mahula, ralfherzog: you don't have any free time to spend, right?
[17:01:33] <ralfherzog> I could spend a few hours per week, but I got to work. So not that much time, but a little
[17:01:43] <mahula> maybe working on a training database for at least the german language and also working on software development ... for example i could imagine to use the C written pocketsphinx, shinxbase and related training tools for the open speech initiative's purposes
[17:02:19] <bedahr> mahula: okay, so working on the acoustic models / decoders if needed. sounds good
[17:03:03] <bedahr> I'll post the backlog afterwards; we talked a bit about our strategy of using forced alignment of e.g. audio books to generate large corpora of training data for acoustic model generation
[17:03:49] <bedahr> this is something we are currently working on. Especially skpvox has made some great inroads in that area already. We're supposed to meet later (he's in another meeting now), I'd invite you to stick around if you're interested
[17:04:54] <mahula> that sounds good
[17:05:18] <bedahr> organization wise, we do have a trello board already: https://trello.com/b/xwW2oMc0/simon-dictation
[17:05:37] <mahula> ok, i'll check that 
[17:07:07] <bedahr> I've updated the task list a bit to reflect what we talked about and to remove now stale tasks from the queue
[17:07:56] <bedahr> mahula: Are you "Mathias Lenz (mathiaslenz)"?
[17:08:01] <mahula> yes
[17:08:09] <bedahr> okay, I added you to the board
[17:08:25] <mahula> thank you
[17:08:28] <bedahr> that's it from my side. does anyone else have something to add to our meeting's agenda?
[17:09:46] <mahula> no, not yet, i think from my side
[17:10:17] <bedahr> okay, then I'd like to conclude this meeting
[17:10:19] <bedahr> thank you for attending
[17:10:56] <bedahr> mahula: if you want to, just idle here and I'll ping you when Simon is back (if I'm at the computer myself by then, otherwise we'll have to reschedule; in any case, I'll keep you aprised)
[17:10:56] <unormal_> Thanks for hosting.
[17:11:01] <bedahr> thanks
[17:11:24] <ralfherzog> yes, thanks for hosting
[17:11:24] <-> unormal_ is now known as unormal
[17:12:06] <mahula> thanx for having me


More information about the Kde-speech mailing list