Project Description and Goals

Robots are about to reach beyond their traditional well-structured factory floors to become companions for humans in complex, unpredictable environments.  As a companion, the robot must still be able to safely navigate and manipulate objects, but it must also be able to robustly and naturally cooperate and interact with people to match human intuition. Therefore, the traditional robotic research paradigm aiming at studying physical interactions (between robots and objects), needs to be complemented with the investigation of cognitive interactions between robots and humans. Human-robot interaction (HRI) can only be effective on the premise that perception, e.g., seeing and hearing, and perception-action cycles are properly addressed. In comparison to the well-established vision-based HRI methodologies, auditory HRI has been much less investigated. Ideally, HRI would use voice communication as much as humans do it among each other, but current limitations in robot audition do not allow for effective, natural, untethered acoustic communication between robots and humans in real-world environments. This is mainly due to the fact that the human communication partners and other sound sources of interest will be at some distance and the microphone signals as picked-up by the robot are strongly impaired by additional noise and reverberation. Aggravating the problem even further, the robot itself produces significant ‘ego noise’, due to its mechanical drives and electronics. Compared to other hands-free human-machine audio interfaces, e.g., in cars, the human-robot distance is usually larger and the environment more adverse, and as a consequence, speech recognition and sound classification performance is far inferior. This implies that robot-embodied cognition cannot use the according inputs from the acoustic domain and exploit its potential for HRI, as robot audition (i) cannot beneficially be combined with other sensorial modalities, e.g., vision, and hence (ii) it cannot interact with the large repertoire of multimodal behaviours that are instinctively used for human-to-human casual communication and interaction, such as speech and prosodic sounds accompanied by hand gestures and head motions.

While the human auditory system has a sophisticated hearing mechanism and unique cognitive capabilities to extract the desired auditory information, the corresponding auditory signal processing for robots is still in its infancy. This defines the objective for EARS: Joining forces from acoustic signal processing, robot vision, and robot cognition, and with a research-oriented industrial partner offering a leading humanoid robot as a development platform, EARS will provide intelligent ‘ears’ for a consumer humanoid robot, and use it for HRI in real-world environments. While not aiming at mimicking the human hearing system, these humanoid ears should approach a close-to-human auditory capability. An advanced acoustic interface shall be able to localise multiple sound sources of interest and to extract the desired signals from complex acoustic real-world scenarios. Source localisation, tracking and recognition will be supported by robot vision in a data fusion framework. From the acquired desired acoustic signals and audio-visual fusion, embodied robot cognition will derive HRI actions and knowledge on the entire scenario, and feed this back to the acoustic interface for further auditory scene analysis. Clearly, speech signals play a dominant role among the signals of interest and for this, novel acoustic signal processing algorithms will be derived from experience with other domains, e.g., smartphones, hearing aids and interactive TV. In synergy with novel approaches for active acoustic sensing in dynamic real-world environments these algorithms will aim at overcoming current limitations of human-robot speech communication.

As a prototypical scenario for EARS we consider a welcoming robot in a hotel lobby. During busy hours, the lobby is filled with personnel and guests, producing a large variety of sounds aside from many speech signals. Nao+, a humanoid robot, sits at the reception desk when Jack comes out of the elevator, looking for help. Nao+, equipped with novel acoustic microphone arrays distributed over its head and limbs, and with a binocular camera pair, slowly turns its head around, collecting acoustic information, possibly disambiguated using visual information, and picks up the elevator’s door sounds, with Jack coming out. Jack, not sure where to turn, looking for someone’s attention, immediately catches Nao+’s attention, three meters away. Nao+ rapidly turns its head toward Jack, waving its arms, saying “Can I help you?” Jack approaches the reception desk, dropping off his keys, and collecting a city map, assisted by Nao+. At the same time, Nao+ detected several other human speakers in the lobby, and maintains track of their position, in case any of them needed its help. Ready to interact with a guest-in-need, or draw the attention of another guest further away from the reception desk, Nao+ wishes Jack goodbye.

The above hotel lobby scenario is illustrative of a large number of situations that a welcome robot will face in realistic situations, such as entrance halls (museums, exhibitions, hospitals, conference venues, companies), help and information desks (train stations, airports, entertainment halls, government services to the public), company show rooms, waiting rooms, restaurants, etc. Being addressed by a keyword (‘Hello’/ ‘Hi’/ ‘Please’/ ‘Help’/ ‘Excuse me’/ etc.) the EARS robot will be able to extract the voice of the requesting person out of the mixture of many voices and sounds, recognize the request even if uttered from a distance of a few meters, and all in the presence of background noise and reverberations.  Essentially, the robot’s ears act as an auditory front-end for a human-robot dialogue with an expert system of arbitrary complexity (including interpreter services, web search etc.), only limited by the automatic speech recognition (ASR) capability given the acoustic input.  Beyond just answering requests, the robot analyses the scene, watches out for people seeking help, identified as acoustic sources following erratic trajectories, with trajectories stopping at seemingly meaningless positions, or recognized through hand gestures typical of people requesting someone’s attention. By opening its arms, the array aperture of  the humanoid robot is increased, so that localisation, spatial filtering, noise reduction and dereverberation is supported to better extract and recognize a human voice among the many sound sources nearby. The ASR engine of the robot extracts a keyword from an utterance, e.g., ‘Where is the ELEVATOR?’ and the robot points with the arms to the elevators, adding ‘This way, please, madam. The elevators are behind the marble statues to your right’.

The strategy for reaching the ambitious technical goals of the project relies on the following key components and associated guiding principles for their development:

  • The robot can listen to multiple target sources in a noisy and reverberant environment and recognize voices and sounds reliably from a distance. To this end, EARS will design novel microphone arrays that allow major advances regarding localisation of multiple acoustic sources of interest, spatial filtering, source separation, noise suppression and dereverberation for the desired signals, and, in addition, inference of knowledge on the acoustic environment and its use. As an example, the robot should be able to extract a speech signal suitable for automatic speech recognition from several meters distance, suppress one or more competing talkers and diffuse background noise at signal-to-noise ratio (SNR) levels of 0db or less. It should also be able to learn the reverberation time of the acoustic environment. For full-duplex speech-based HRI, acoustic echo cancellation will be implemented which should be able to successfully suppress acoustic echoes at levels around 0 dB. Exploiting robot vision, fusion of acoustic and visual information will further increase robustness in localising, tracking, and classifying multiple sources, and recognizing their signals.
  • The robot learns awareness of its environment and events within the environment such as sounds associated with human activities. It will infer the environment from acoustic signals (microphones) and jointly exploit vision to disambiguate (support) the acoustic information. Sound sources will be classified as to whether or not robot attention should be focused. Environment parameters will be extracted from the signals so that the awareness of the environment is built up through learning over time.
  • The robot can interact naturally with human communication partners via gestures and voice. New concepts for more natural and intuitive human-robot interaction will be developed by directly linking the large repertoire of robot behaviour – expressed via movements, gestures, and synthesized voice – to the acoustic and visual inputs and the disambiguated knowledge on the overall scenario.  This close link between low-level sensor data and high-level processes, such as speech recognition and robot actions, will allow an immediate feedback to the sensing level for configuring an adaptive robomorphic microphone array for ‘active sensing’, and  support instantaneous gesture-based HRI meeting human expectations.  It will include attentional models and internal simulations and prediction of human and robot behaviour, and should mark a significant step towards human-like cognition.

In summary, EARS will (i) develop new methodologies for smart sensing, acoustic signal processing, audio-visual data fusion, and human-robot interaction, driven by real-world environment scenarios, (ii) explore the potential of interaction between these methodologies, (iii) verify the resulting concepts on an inexpensive consumer robot in real-world scenarios, and (iv) promote dissemination and commercial exploitation by data-sharing and an open-source software platform for the robot.

The overarching goal of EARS is to provide highly effective ‘ears’ for a humanoid robot and to demonstrate its benefits in synergy with robot vision for HRI applications. Consequently, as the outcome of EARS, a consumer-robot prototype will

  • be able to  listen to multiple target sources in a noisy and reverberant environment and recognize voices and sounds reliably from a distance.
  • learn awareness of its environment and events within the environment, such as sounds associated with human activities.
  • interact naturally with human communication partners via gestures and voice.