Project summary
A prerequisite for language technology is access to large amounts of language data. A database of broadcart news is a reasonable alternative for the immediate future. It will be able to reduce the current lack of speech data for on-going research activities. At the same time, it can be a supplement to existing, but currently unavailable (proprietary) language data, and also be a natural part of a future national language resource bank. Databases of broadcast news can be used for a number of research topics in speech technology, such as:
- supplement to existing databases for training, speaker adaptation and testing of speech recognition on read speech
- basic data for research on recognition of spontaneous speech
- database for research on automatic indexing of audio data
- database for research on topic and/or speaker segmentation
- Speech/non-speech detection (e.g. background music)
- Entry ticket to international research cooperation involved with speech technology for broadcast news applications
The project will consist of the following main parts:
- Establish agreements with information providers (broadcasting companies, particularly Norwegian Broadcasting - NRK)
- Define standard for annotation and transcription based on international conventions
- Establish tools for annotation and transcription, preferrably based on existing tools
- Annotate the database, which includes
- Orthographic transcription
- Annotation of speaker turns
- Annotation of music, jingles, background noise etc.
- Annotation of speaking style (e.g. manuscript reading, interview/dialogue, debate, spontaneous monologue)
- Bandwidth (studio recording, wireless link, telephone, sattellite telephone, mobile)
- Environment (studio, street, sports arena etc.)
- Language, dialect
- Validation of recordings and transcriptions
- Documentation and accessibility
