Language Resources

Current methods in language technology rely heavily on the use of large speech and text corpora. These corpora can be described as general or application specific corpora. Typically, general corpora are used to extract information that is of a generic nature, and is often used to obtain initial models (e.g., acoustic models, grammar models, dialog models) and rules. The application specific corpora are then used for fine-tuning the initial models and rules to achieve enhanced performance for a specific application.

For the Norwegian language, there exist a small number of generic speech corpora produced through participation in the EU projects SAM and SpeechDat(II). There hardly exist text corpora for Norwegian that are of a size that will support corpus based methodologies for spoken dialog systems. A corpus will be created through the participation in the EU project PAROLE by the University of Bergen, although even larger corpora are desirable for e.g. statistical language modelling for speech recognition.

Also, a smaller speech and text corpus for the bus traffic information domain has been collected through the TABOR project.

Although it would be highly desirable, both for the present project and for future research in spoken language technology which depends upon a sufficient language resource infrastructure, to produce a large speech corpus with good phonetic, acoustic and dialectical coverage, such an undertaking is well beyond the current scope. In this project we will aim to exploit existing general speech and text corpora. There will however be a need for producing application specific corpora.

Corpus collection and annotation is a costly and time-consuming process, and it is important that a large portion of the collected data will have a life-span outside the present project. This will be facilitated by observing recommendations for corpus productions from the ESPRIT SAM project, the EAGLES working groups and the SpeechDat(II) project.

  1. Project Goals in Language Resources

Project Goals in Language Resources

The following language resources will be produced as part of the project:
Part I:
  1. WoZ recordings, with a relatively small number (10-20) of subjects to verify dialog and to collect realistic speech for initial training of speech recognizers.
  2. Database collections of written simulations (cfr. BUSTUC)
  3. WoZ recordings with a intermediate number (~100-150) of subjects using the final version of the dialog structure for ASR training
Part II:
  1. Recordings of dialogs for naive persons using an initial version of the automatic dialog system. The choice of persons should result in a representative coverage of speaker variations, incl. dialectical variations. The recordings will be used to improve the different parts of the system; i.e. the recognizer, the dialog structure, etc.

For further information contact Professor Torbjørn Svendsen
Last modified: Tue Oct 21 15:40:26 MET DST 1997