Project summary
Current automatic speech recognition (ASR) performance is at its best one order of magnitude below human performance. A new statistical framework is needed that will incorporate knowledge sources in a combined knowledge-based and data-driven paradigm. The project is a part of a joint international effort to develop the next generation speech technology, knowledge-rich speech processing, and will focus on the speech signal processing.
The full system will be applied to information retrieval tasks on the RUNDKAST database, an audio database of Norwegian broadcast news shows. For comparison a baseline HMM-system will be implemented in addition to the knowledge-rich system.
The project will consist of three interconnected activities:
- 1. Front-end development.
- The purpose of the ASR front end is to extract all necessary information for the task of discriminating sounds, words and utterances in a manner that is maximally robust to irrelevant variations. We will investigate and develop a set of analysis and detection algorithms based on knowledge of human speech production, perception and cognitive processing.
- 2. Statistical framework.
- In contrast to current systems, the proposed front end will produce a stream of temporally asynchronous and statistically dependent observations. This will necessitate establishing a different statistical framework for bottom-up verification, evaluation and combination of hypotheses from front-end observations to sentence hypotheses
- 3. Spoken information retrieval.
- Vast amounts of information are stored in audio and multimedia archives worldwide. Most of the spoken information is not transcribed, and thus not text-searchable. Speech recognition is a means for either automatically transcribing spoken audio, or for directly searching audio files by keywords. In this activity, the new algorithms will be tested and benchmarked against conventional technology for the tasks of transcription and information retrieval on the RUNDKAST database.
Project goals
The main goal of the project is to build a fundament for the next generation of speech processing algorithms that will have the potential of achieving near-human performance. The project will be part of an international research network, and will contribute in particular to three areas:
developing knowledge-based speech analysis methods that take into account the properties of the speech signal as well as human perception
developing a statistical framework for combining asynchronous and partly redundant information sources in order to utilize the new analyses and
supply verification of the performance of the new methods in a multi-lingual setting, particularly providing experimental results for Norwegian.
The methods will be applied to the task of information retrieval from audio databases, in particular for a database of Norwegian broadcast news shows.