Language Resources
Current methods in language technology rely heavily on the use
of large speech and text corpora. These corpora can be described as general or
application specific corpora. Typically, general corpora are used to extract
information that is of a generic nature, and is often used to obtain initial
models (e.g., acoustic models, grammar models, dialog models) and rules. The
application specific corpora are then used for fine-tuning the initial models
and rules to achieve enhanced performance for a specific application.
For the Norwegian language, there exist a small number of generic
speech corpora produced through participation in the EU projects SAM
and SpeechDat(II). There hardly exist text corpora for Norwegian that
are of a size that will support corpus based methodologies for spoken
dialog systems. A corpus will be created through the participation in
the EU project PAROLE by the University of Bergen, although even
larger corpora are desirable for e.g. statistical language modelling
for speech recognition.
Also, a smaller speech and text corpus for the bus traffic information
domain has been collected through the TABOR project.
Although it would be highly desirable, both for the present project
and for future research in spoken language technology which depends
upon a sufficient language resource infrastructure, to produce a large
speech corpus with good phonetic, acoustic and dialectical coverage,
such an undertaking is well beyond the current scope. In this project
we will aim to exploit existing general speech and text corpora. There
will however be a need for producing application specific corpora.
Corpus collection and annotation is a costly and time-consuming
process, and it is important that a large portion of the collected
data will have a life-span outside the present project. This will be
facilitated by observing recommendations for corpus productions from
the ESPRIT SAM project, the EAGLES working groups and the
SpeechDat(II) project.
- Project Goals in Language Resources
Project Goals in Language Resources
The following language resources will be produced as part of the
project:
- Part I:
- WoZ recordings, with a relatively small number (10-20) of
subjects to verify dialog and to collect realistic speech
for initial training of speech recognizers.
- Database collections of written simulations (cfr. BUSTUC)
- WoZ recordings with a intermediate number (~100-150) of
subjects using the final version of the dialog structure for ASR
training
- Part II:
- Recordings of dialogs for naive persons using an initial
version of the automatic dialog system. The choice of
persons should result in a representative coverage of
speaker variations, incl. dialectical variations. The
recordings will be used to improve the different parts of
the system; i.e. the recognizer, the dialog structure, etc.
For further information contact
Professor Torbjørn Svendsen
Last modified: Tue Oct 21 15:40:26 MET DST 1997