1. Introduction
This project plan is a revised and more detailed version of the
description given in the original plan. The project partners also
participate in another, related project supported by NFR; BRAGE –
Brukergrensesnitt med naturlig tale
(www.tele.ntnu.no/projects/brage). This is a project within the
NFR-program KUNSTI. We will aim at establishing close contact between
the two projects in order to benefit from the obvious synergy effect
achievable by co-ordinating the activities in the two projects.
In the following we will describe the main tasks to be performed in the
project at a higher level of detail than given in the original
proposal.
2. Background
In modern society people become more and more dependent on computers and electronic communication to carry out their everyday tasks. A significant step in this development is the introduction of networked palmtop computers. This is due to a continued miniaturisation of electronic circuits and deployment of a ubiquitously available broadband wireless communication infrastructure. This makes it realistic to envisage that in the near future it will be natural for people to carry with them small personal computing devices connected to the internet or a remote server via wireless communication, in the same way as mobile phones now are carried. These systems will provide access to a large set of information and communication services assisting people in whatever they have to do, also when in motion.
The handheld systems will demand a corresponding miniaturised multimodal user interface that should be able to function in an interactive manner. The most natural user interface for a small, mobile device is speech. Because of the limited size of the device it will not be feasible to use a standard type of keyboard. In addition, speech as an I/O device requires very little physical space, and is the most natural and efficient way for humans to communicate. Also, it might be operated in a hands-/eyes-busy situation, and can even function in a multilingual mode. Thus a natural research topic is a user interface based on speech and complementary I/O-options like pen/gestures. These basic I/O-modes should be integrated with the underlying process of dialogue design.
Due to the mobility of users and the characteristics of wireless communication, the operating environment for these services will be more dynamic than what is typical for traditional distributed systems. To build systems which must operate satisfactorily under such conditions will pose new challenges and requires new solutions and new engineering methods. Services need to make the most of their surroundings and adapt themselves to overcome any limitations temporarily posed by the current user and network context, and likewise exploit any opportunities for improvement offered. For instance the service may have to adapt the user interface in response to a change in the conditions surrounding the user, or to the connection quality. Also, the service may have alternative modalities in order to fulfil the needs of a user irrespective of any disabilities (design-for-all).
It is believed that the sales of handheld networked computers will grow rapidly and thus outnumber the sales of traditional PCs. At the same time there will be large investments in the wireless communication infrastructure. Consequently there will be a huge demand for services that utilise this infrastructure and are capable of generating sufficient income to recoup these investments. Norway, together with the other Nordic countries, is a leading region world-wide in the deployment and use of wireless communication and computers, and Norwegian software industry is in a good position to establish itself as a major player in this market. Therefore it is particularly important to build the competence needed to exploit this opportunity and disseminate it to the industry.
3. Short Project Description
The activity will focus on the convergence between communication systems, advanced dialogue management and spoken language technology. This activity will be focused towards a target application; geographical information systems for mobile users. The system will be comprised by the following parts:
- A small mobile device (e.g. available in a car or as a handheld unit) with multimodal I/O-possibilities (i.e. speech, pen, text, graphics).
- A communication link between the mobile device and a remote server.
- A server containing the complete information database and most of the dialogue system modules. Parts relevant for a particular user may be downloaded to the mobile device for further processing.
- GPS (Global Positioning System) to determine the location of the user.
The foreseen application will provide the user with location-based services, e.g. tourist information about the area where the user is currently located. Information may be presented to the user by speech, text or graphics (maps). The user interface should be adaptable to the varying needs of the user. For instance, in very noisy environments a written response is preferable to a spoken one. Also, the system should be able to adapt to the resources available at a particular time.
4. Main Goals
- Award 3 doctorates.
- Develop software and competence for mobile computing and speech based user interfaces.
- Develop a prototype for a mobile geographical information system.
- Produce an average of 3 international publications annually.
5. Project Organisation and Scheduling
The project is organised into the following work packages:
- WP1: Communication System
- WP2: Advanced Dialogue System
- WP3: Speech Technology
- WP4: Prototype Development
- WP5: Exploitation and Dissemination
- WP6: Project Management and Administration
The project is a co-operation between NTNU and SINTEF. Three doctoral students will be employed by the project, one each in WP1, WP2, and WP3. We also expect to supervise final year siv.ing. students each year within the scope of the project.
With respect to project organisation and scheduling, some changes have been necessary compared to the original project proposal:
a) Project management.
Professor Torbjørn Svendsen, NTNU will serve as the Project Manager from August 2003, when he will be back from his sabbatical. Until then, Associate Professor Magne Hallstein Johnsen will be acting Project Manager. Magne Hallstein Johnsen is the Project Manager for the BRAGE project mentioned in the introduction.
b) Start date.
The start date of the project is moved back 3 months (from 2003-01-01 to 2003-04-01). The main reason is the desire to obtain continuity for the researchers who will be working on the project. Trym Holter at SINTEF, will be back from his 27 months stay at Motorola Research Laboratories, Australia in April 2003, and will immediately be engaged in the project. The change of start date will also reduce the time where the project has an interim project management.
c) End date.
It is assumed that the term of employment for the doctoral students will be 4 years, including one year financed by NTNU. The year financed by NTNU will probably be reciprocated by the equivalent of one work-year teaching assistance by the students, leaving 3 work-years for the doctoral studies. The doctoral students will be commencing their studies in the summer of 2003. In order to co-ordinate the ending of the project with the planned graduation of the doctoral students, the new project end date is 2007-06-30.
6. Detailed Project Outline
In the following we summarise detailed project plans for each of the 6 work packages listed in the previous section. We emphasise that these are preliminary plans given in as much detail and accuracy as we find it possible half a year before project start. A more detailed plan for the first year will be worked out at the project start. We do not expect any changes in the main goals or work to be performed, however the timeline for the various activities may be adjusted.
For the work to be performed, we have a specific target application in mind. However, we will aim at keeping the generic aspect of our research in focus, in order to make generalisation and adaptation to other applications easily feasible.
6.1 WP1: Communication System
6.1.1 Objectives
- Graduate 1 dr. ing. student.
- Define and adapt a specific wireless communication system applicable for the target application.
- Study and evaluate the effects of a varying communication channel on the target application.
- Investigate and implement ways to deal with noisy, time-varying channels for the target application.
6.1.2 Background and state-of-the-art
When moving around, the mobile user gains access to various types of networks, and varying resources are accessible through these networks. The network conditions, such as load and noise conditions also vary, leading to variable Quality of Service (QoS). This is particularly the case for wireless networks, where capacity is limited and error rates high. It will be important to investigate how to cope with these varying conditions and how they affect performance of the application at hand.
The majority of the R&D in the field of mobile multimodal access systems appear to use existing PDA’s or palmtop computers to simulate future handsets. Sometimes a GSM card is added to the device, in other cases experiments rely on some kind of wireless LAN for the coupling of the device to the network. Increasingly, communication is based on GPRS or (simulated) UMTS networks. At the same time developments are underway to connect mobile terminals to local and global networks through Bluetooth or similar short-range (Piconet) wireless networks. How the communication shall be performed in the present project is a topic that needs to be considered at an early stage.
A user-PDA and a server will typically be communicating by UMTS and/or WLAN. In both cases a combination of speech, text and graphics will be transmitted. Typically the speech recognition part must be performed in a server. The total system must show a robust performance wrt. varying channel quality. For speech this means that the recognition rate should show only a graceful degradation. In mobile/nomadic systems distributed processing is a realistic option, thus speech could either be transmitted in ‘raw’ form (as samples), coded (as in GSM) or in a pre-processed form (first part of recognition). A recent EU-project, ETSI/AURORA, has investigated the recognition performance as a function of several such schemes (including GSM) for narrow-band communication. However, also future communication protocols and speech quality demands will lead to problems regarding varying delays and/or lost packets, varying signal-to-noise level, burst errors etc.
6.1.3 Description of Work
The WP involves studies of the types of wireless communication systems applicable for the target application, and adaptation to the particular application. Investigation on the effect of varying channel quality on the system performance (e.g. speech recognition) is an important topic.
An evaluation of various types of wireless communication systems will be performed with the specific target application in mind. Based on this evaluation a decision will be made on what system to use for the target application. The system will be adapted to the target application.
A study on how a noisy, time-varying communication channel affects the target application will be performed. Ways to deal with these effects will be investigated.
The thesis topic of the doctoral student will be targeted towards the influence of the channel imperfections on recognition performance and to which extent advanced techniques (like adaptive modulation etc.) can cope with the degradation of the recognition performance.
6.1.4 Deliverables
- D1.1: Report on evaluation of types of wireless communication systems.
- D1.2: Report on adaptation of the selected wireless communication system for the target application.
- D1.3: Report on the effects of varying channel conditions.
- D1.4: Dr. ing. thesis.
6.1.5 Milestones and Expected Results
- M1.1: spring 2003: Public announcement of dr. ing. grant.
- M1.2: autumn 2003: Decision on the type of wireless communication system to use in the target application. Report on evaluation.
- M1.3: spring 2004: Implementation of the communication link in the target system finished. Report on the implementation and adaptation.
- M1.4: autumn 2004: Report on the effects of varying channel conditions.
- M1.5: spring 2007: Dr. ing. thesis finished.
6.2 WP2: Advanced Dialogue System
6.2.1 Objectives
- Graduate 1 dr. ing. student.
- Define a strategy for a mixed-initiative spoken dialogue system for the target application.
- Evaluate the inclusion of multimodality in the target application.
- Implement a multimodal, mixed initiative dialogue system for the target application.
- Co-ordinate the activity with BRAGE.
6.2.2 Background and state-of-the-art
Most of the services offered will have an interactive form; i.e. the user and the system will co-operate to reach the goal through some kind of dialogue. A dialogue is a tool to help the user reach the goal. Obviously the only adequate criteria for evaluating such systems is whether the user agrees that he has got the information he/she required; i.e. if the dialogue ended successfully. In principle there is an unlimited number of ways to organise such a dialogue, but normally the user must be allowed to have some control over the evolution. However, the success rate will normally be correlated with the ability to split up the dialogue into a modest set of subdialogues each having a restricted task domain and vocabulary. A multimodal mode is mandatory if the user requests includes graphics (maps etc.). Further, the combination of speech with for instance a pen (“tap and talk”) normally leads to a significant improvement with respect to the success rate. Alternative modes of operation are important aspects in the design-for-all principle, and in order to obtain barrier-free access for impaired people.
The US Defence Advanced Research Projects Agency (DARPA) has expressed its interest in moving speech centric applications to mobile environments, and contracts have been signed in order to develop mobile multimodal interfaces. A PDA-based prototype has been shown dealing with a GIS-based system. The user may tap and speech simultaneously (“show me how to get there”). At Microsoft, the MIPAD platform demonstrates a “tap and talk” strategy where the user must tap a specific icon, and then start talking. The system has been demonstrated for Personal Information Management (PIM) services (setting up appointments, e-mail handling). The EURESCOM project MUST (“Multimodal multilingual information services for small mobile terminals”) presents a nice overview of state-of-the art within this area.
6.2.3 Description of Work
A mixed initiative dialogue system based on multimodal interaction will be developed. The system should be able to adapt to the user with respect to both topic of request, the amount of simultaneous input information and possible change of subtopic. A corresponding dialogue structure will be developed. The system should further be able to handle errors (system misunderstanding) and guide the user to reach his/hers goals. Therefore, strategies for error handling and verification of the speech input will be investigated. It is assumed that the multimodal interaction on the PDA will be based on “tap and talk” as input, while maps, text and/or synthesised speech and input verification will constitute the system output.
The thesis topic of the doctoral student will be targeted towards the study of multimodal, mixed initiative dialogue analysis and design.
6.2.4 Deliverables
- D2.1: Report on dialogue strategy.
- D2.2: Report on multimodality in the mixed initiative target application.
- D2.3: Report on error handling strategies.
- D2.4: Dr. ing. thesis.
6.2.5 Milestones and Expected Results
- M2.1: spring 2003: Public announcement of dr. ing. grant.
- M2.2: winter 2003/2004: Report on multimodality in mixed initiative target application.
- M2.3: autumn 2004: Implementation of the first version of the dialogue system in the target application.
- M2.4: spring 2005: Report on error handling strategies.
- M2.5: winter 2005/2006: Error robust implementation of target implementation, v.1.
- M2.6: spring 2007: Dr. ing. thesis finished.
6.3 WP3: Speech Technology
6.3.1 Objectives
- Graduate 1 dr. ing. student.
- Record an application specific speech database.
- Evaluate speech recognition engines and speech synthesis systems for use in the target application.
- Develop acoustic models, pronunciation lexicons, and language models for the speech recogniser in the target application.
- Implement online speaker adaptation in the target application.
- Utilise confidence measures in the error handling and verification strategy.
- Develop and implement language identification and a bilingual speech module in the target application.
6.3.2 Background and state-of-the-art
Speech recognition and synthesis have made significant progress during the last years. This has resulted in a variety of speech based dialogue systems, however mainly over telephone; i.e. speech based I/O only. The performance of a dialogue system is measured by the so-called success rate; i.e. to which extent does the user gets the information he wants. The success rate is obviously dependent on both the speaking style, the dialogue structure, the feedback to the user, and the speech recognition performance (word error rate). Acceptable performance with respect to the latter has been achieved during normal conditions:
- Not too adverse noise background and channel degradation.
- Limited vocabularies corresponding to structured subdialogues.
- Restricted/normalised speaking style.
- Speech quality monitoring.
Dialogue systems with a multimodal user interface have not yet been commercially deployed. However, a lot of research interest has recently been reported. It is believed that the multimodality will add to the performance (success rate) especially in more difficult environments. It is also believed that users subjectively prefer such multimodal interfaces to for instance speech only.
6.3.3 Description of Work
Robust speech recognition, synthesis, and language identification are the main parts necessary in this WP. No work on speech synthesis will be performed in this project; however we have to decide on which synthesiser to use in the target application. The same goes for the speech recognition engine. We will evaluate and decide upon a particular engine to use for the target application.
Recording in a multimodal environment will be performed. A limited speech database recorded in the application domain will be necessary for domain adaptation, speaker adaptation and test. Additional speech databases (e.g. SpeechDat) will be utilised in order to obtain acoustic models and pronunciation lexicons.
The system will be speaker independent, however online adaptation in order to improve the recognition performance will be utilised. The dialogue system may utilise confidence information or N-best results from the speech recogniser in its error handling strategy. We will investigate strategies for utilising the available information from the speech recogniser in helping the user to achieve his/her goal. Also, since the target application is very well suited for foreign language speaking tourists, we will incorporate a bilingual speech module in the system. A language identification module therefore has to be developed.
The thesis topic of the doctoral student will be targeted towards bilingual speech recognition and language identification.
6.3.4 Deliverables
- D3.1: Application specific speech database.
- D3.2: Bilingual language identification module.
- D3.3: Bilingual speech recognition module for the target application.
- D3.4: Dr. ing. thesis.
6.3.5 Milestones and Expected Results
- M3.1: spring 2003: Public announcement of dr. ing. grant.
- M3.2: summer 2003: Specification of application specific database.
- M3.3: autumn 2003: Selection of speech synthesiser and speech recognition engine determined.
- M3.4: winter 2003/2004: Recording of the application specific database finished.
- M3.5: spring 2004: First version of the speech module implemented in the target application (without speaker adaptation and language identification).
- M3.6: autumn 2004: Online speaker adaptation implemented in the target application.
- M3.7: autumn 2006: Language identification and bilingual speech module implemented in the target application.
- M3.8: spring 2007: Dr. ing. thesis finished.
6.4 WP4: Prototype Development
6.4.1 Objectives
- Establish a flexible hardware platform suitable for the target application.
- Establish a flexible, generic software solution for a dialogue system.
- Implement the target application on the hardware platform.
6.4.2 Description of Work
A hardware platform suitable for the target application will be established. This will typically consist of the following parts:
- A small mobile device (e.g. available in a car or as a handheld unit) with multimodal I/O-possibilities (i.e. speech, pen, text, graphics).
- A communication link between the mobile device and a remote server.
- A server containing the complete information database. Parts relevant for a particular user may be downloaded to the mobile device for further processing. Processing not possible in the mobile device will be performed in the server (e.g. speech recognition).
- GPS (Global Positioning System) to determine the location of the user.
We will utilise existing solutions for the hardware platform. Also, for the generic software solution we will use existing systems. This software is a general framework for a dialogue system on which we will build our target application. There are several options, and a choice of a specific system will be the first task in this WP. For the software solution there are two hot candidates. Preferably we will explore the possibilities of utilising the same platform as in the BRAGE project (i.e. software based on Tabulib from Telenor R&D and the GALAXY system from MIT). Alternatively, a software solution from AT&T will be freely available for our project.
The information database for the target application will consist of a geographical information system (GIS) and relevant data for a particular geographical area (i.e. nearest police station, post-office, pharmacy, etc.). Such a database already exists in Norway (Bravida Geomatikk AS), and we will explore the possibilities of utilising this database in the present project. Alternatively, we will establish one within the project valid for a limited geographical area.
Based on the outcome from WP1, WP2, and WP3, the various modules constituting the complete dialogue system will be integrated on the particular platform.
6.4.3 Deliverables
- D4.1: A hardware platform.
- D4.2: Infrastructure software for the dialogue system.
- D4.3: Software for the target application.
6.4.4 Milestones and Expected Results
- M4.1: autumn 2003: Preliminary report on hardware platforms.
- M4.2: spring 2004: Hardware platform established (including generic software for the dialogue system).
- M4.3: autumn 2004: First version of target application in operation.
6.5 WP5: Exploitation and Dissemination
6.5.1 Objectives
- Bring the concept of a multimodal dialogue system close to market.
- Publish project results to the national and international education and research community.
- Publish project results to Norwegian industry (potential manufacturer).
- Publish project results to the public community (potential users).
6.5.2 Description of Work
A project Web-site will be established (Norwegian and English versions) and maintained throughout the lifecycle of the project. Through international and national scientific meetings and publications, we will assure that the results reach the right target groups in the consumer sector, the technology sector, and the research community. We will also aim at getting feedback from external sources, and explore the possibilities of collaboration with other related activities in order to improve our results.
6.5.3 Deliverables
- D5.1: An average of 3 international publications annually.
- D5.2: National publications through scientific meetings and national journals (including popular science).
6.5.4 Milestones and Expected Results
- M5.1: Spring 2003: Project Web-site established.
- M5.2: Spring 2004: The first project paper ready for publications.
More details will follow in the yearly progress reports.
6.6 WP6: Project Management and Administration
6.6.1 Objectives
- Establish an effective project management organisation.
- Establish contacts with related activities in order to utilise potential synergetic effects.
- Guide the dissemination and exploitation of the project results.
- Establish good working routines within the project in order to control the progress in the project.
6.6.2 Description of Work
The project manager will be responsible for the daily running of the project to ensure that it proceeds on time and within budget. Co-ordination and co-operation with other related activities, in particular the BRAGE-project, will be the responsibility of the project manager.
6.6.3 Deliverables
- D6.1: Interim reports on economical status (every 4. month).
- D6.2: Annual progress reports.
- D6.3: Final report.
6.6.4 Milestones and Expected Results
- M6.1: spring 2003: Kick-off meeting.
- M6.2: spring 2007: Final report.
7. Key Personnel
The following persons will be central for the proposed activity:
- Torbjørn Svendsen, Professor, dept. of Telecommunication, NTNU
- Magne Hallstein Johnsen, Associate Professor, dept. of Telecommunication, NTNU
- Geir Øien, Professor, dept. of Telecommunication, NTNU
- Tore Amble, Associate Professor, dept. of Computer and Information Science, NTNU
- Trym Holter, Research Scientist, SINTEF Telecom and Informatics
- Erik Harborg, Research Scientist, SINTEF Telecom and Informatics
CV’s for these persons was included in the original proposal.
8. Interaction With Other Relevant Projects
The Norwegian Research Council (NFR) has initiated a program (KUNSTI) within the area of Spoken Language Technology. The applicants for the present proposal are also involved in a project proposal (BRAGE) to the KUNSTI program regarding natural language user interfaces, which now has been accepted by NFR. We will establish a close contact between the two projects, in order to utilise the strong synergy effects by co-ordinating the activities.
Also, NTNU is involved in another project proposal within the KUNSTI-program, FONEMA. The main goal of this proposal is to obtain improved quality for Norwegian text-to-speech (speech synthesis). A decision has not yet been taken by NFR regarding this proposal. If accepted, we will establish close contact between the projects in order to utilise the results.
9. International Co-operation
Both NTNU and SINTEF have a long tradition in co-operating with universities, research institutes and companies abroad. This international network has been established through co-operating projects (EU), participation in international organisations and committees (COST, ISO, ISCA, ELRA, ERCIM), and exchange of technical staff and students. In the present project we will offer each of the dr. ing. students at least 6 months stay at another university or research institution. The established international network will be an important factor in accomplishing this.
Depending on the choice of hardware platform (see Section 6.4), we will establish close contact with AT&T in order to utilise their platform for dialogue system development.
In the original proposal we have included a list of related previous and ongoing activities, involving one or both of the project partners, which may support the activities to be performed in the present proposal. Of particular importance we emphasise the following:
NTNU participates in COST278. MUST has participation from Telenor R&D, a partner in the BRAGE project.
Also, NTNU and SINTEF are partners in an international consortium,
which recently submitted an EoI in response to Call EOI.FP6.2002
within the EU 6. Framework Program. The title is ”PICO –
Personal Interface Communicator”. The aim of PICO is to provide the
user with an immediate multimodal user interface (UI) in a wearable
system, and to enable the control of complex infotainment devices by a
wireless conveyed and to be standardised UI protocol. If the
Commission accepts the upcoming proposal, we will establish close
co-operation between the two activities.
|