Speech Interaction under Linux
These days I have tried to create a speech interaction system using free software on top of Arch Linux. It took me quite some time to find good software and I've written an interface that might be quite interesting for others, too. So I've decided to share my experiences. There are different levels one might want to achieve in speech interaction. The ultimate thing probably is (the possibility) to control the whole computer by voice only, including dictating texts in your native language that are instantly turned to written words. According to what I've read, it's already possible to achieve this under Linux. I could not get it to work, though. Windows Vista already provides software that works quite well (I've tried it), and Windows 7 is said to be even better. Seems as if the community had something to do. So far I have managed to create and use a simple dialogue system with limited input vocabulary.
Basics of Speech Recognition
Basically, a speech recognition engine needs several model files in order to correctly recognise speech in a given language:
- An acoustic model that formally describes single sound (phonemes) - the challenge here is that every person has their own characteristic voice
- A language model that describes how words can be combined (e. g. a formal grammar or an n-gram model) - this can be omitted, but it is virtually necessary for good results
- A dictionary that describes the pronunciation of every input word
Acoustic models are often created in the training phase of the speech recognition system. There are however ready-to-use acoustic models for several languages. According to Wikipedia, there are speaker-dependent and speaker-independent acoustic models. The latter obviously are more difficult to create, yet they are limited to a smaller vocabulary. A format often used for AMs is the Hidden Markov Model.
When a speech engine processes input, it prepares several hypotheses to be tested. The hypotheses are assigned a score mainly based on the language model. If the language model is a formal grammar, the system will check if the combination of parts of speech determined in the hypothesis is valid for the current language. If it is an n-gram model, it will have a large database of text modules, each composed of n words (usually two or three). These text modules should statistically represent usual combinations of words. The language model can be omitted from the process, interpreting every single word on its own. But in that case, the quality of the results will worsen notably.
With the help of the ponetic transcriptions in the dictionary, the system can look up the sounds it "heard" and find out the word they represent.
Software and formats
I have encountered two major recognition engines so far: CMU Sphinx and Julius. The latter is an open-source project from Japan. That's pretty much everything I can say about it, as it failed to work for me when I tried it. I will try to find and cover more engines soon (one might be the Open Mind Speech project). CMU Sphinx comes in different versions: 2 and 3 are written in C, 4 is written in Java. Additionally, there is PocketSphinx, also written in C, and specifically optimized for handheld devices. This is the one I've been experimenting with. All versions are released under a BSD-style license.
One notable project is Simon: It tries to integrate training phase and voice control in one modern interface and aims at persons with a disability. It uses Julius as a speech recognition backend and HTK (Hidden Markov toolkit) to deal with acoustic models. Simon itself is free software, but HTK is not (though it's monetarily free and open-source; you need to register to download it). Unfortunately, Simon keeps segfaulting on my system, rendering it useless for me.
There is a lot more software that operates on the Simon's layer. Some of them are VoxImp and Gnome Voice Control.
Creating a custom dialogue system
This is currently the most important thing for me, and it is really easy. The first thing to do is to create a speech corpus. That is a plain text file with every sentence you want the computer to regnoize, line by line. Then you will submit that text file to this Carnegie Mellon University online tool and you can download a language model and a dictionary suitable for the Sphinx software (but probably for others as well). Now you need to get a working copy of PocketSphinx, which is not currently possible trough the AUR. You should download the release versions of sphinxbase and pocketsphinx and compile them yourself. Note that before ./configuring sphinxbase, you need to change ./configure.in if you want to use OSS. As soon as you have libpocketsphinx.so, libsphinxbase.so and libsphinxad.so in /usr/lib and an acoustic model for your desired language (one for english is provided in /usr/share/pocketsphinx/model/hmm/wsj1), you can start programming the dialogue system with libsrs, which I created and which is available through the AUR (or if you do not use Arch, through git.supraverse.net/libsrs.git). If you are a C++ programmer, take another look at the example at the top of this page. It is really that easy! The only thing you need in addition is to specify paths to the different models and call Speech::initSpeech(); have fun!