Header image  
A Transcribed Database of Broadcast News for Applications in Language Technology  
line decor
   Norwegian | Speech group | Contact Us
line decor
 
 
 
 

 
 
About the project

The project's main purpose is a speech database collection. RUNDKAST is financed by the Faculty of Information Technology, Mathematics and Electrical Engineering at NTNU.

 

Goals

The main goal of the RUNDKAST project is to establish a digital database of audio recordings from radio news broadcasts that is transcribed and annotated for research in language technology. The database will be part of a necessary infrastructure for language technology research in Norway, and will be included in a future national Norwegian language resource bank.

Project summary

A prerequisite for language technology is access to large amounts of language data. A database of broadcart news is a reasonable alternative for the immediate future. It will be able to reduce the current lack of speech data for on-going research activities. At the same time, it can be a supplement to existing, but currently unavailable (proprietary) language data, and also be a natural part of a future national language resource bank. Databases of broadcast news can be used for a number of research topics in speech technology, such as:

  • supplement to existing databases for training, speaker adaptation and testing of speech recognition on read speech
  • basic data for research on recognition of spontaneous speech
  • database for research on automatic indexing of audio data
  • database for research on topic and/or speaker segmentation
  • Speech/non-speech detection (e.g. background music)
  • Entry ticket to international research cooperation involved with speech technology for broadcast news applications

The project will consist of the following main parts:

  • Establish agreements with information providers (broadcasting companies, particularly Norwegian Broadcasting - NRK)
  • Define standard for annotation and transcription based on international conventions
  • Establish tools for annotation and transcription, preferrably based on existing tools
  • Annotate the database, which includes
    • Orthographic transcription
    • Annotation of speaker turns
    • Annotation of music, jingles, background noise etc.
    • Annotation of speaking style (e.g. manuscript reading, interview/dialogue, debate, spontaneous monologue)
    • Bandwidth (studio recording, wireless link, telephone, sattellite telephone, mobile)
    • Environment (studio, street, sports arena etc.)
    • Language, dialect
  • Validation of recordings and transcriptions
  • Documentation and accessibility

 

 Last update: Nov 14, 2006
 

 

Current status

Approximately 80 hours of speech has been transcribed. The transcription has been performed by 4th and 5th year students at NTNU. The transciption tool has been Transcriber 1.5.1, producing XML transcription files. The database is now being organized and checked prior to formal (external) validation. The validation is expected to be completed before the end of 2006.

We are now in the process of preparing the validation of the database. In addition, we plan to perform phonemic annotation of a small portion of the database.The immediate application for the phonemically annotated speech is in research in the SIRKUS project.