Microphone Array Signal Processing for Robot Audition

H. W. Loellmann (FAU Erlangen-Nuremberg), Alastair H. Moore, Patrick A. Naylor (Imperial College London), Boaz Rafaely (Ben-Gurion University of the Negev), Radu Horaud (INRIA Grenoble), Alexandre Mazel (Softbank Robotics), W. Kellermann (FAU Erlangen-Nuremberg)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract: Robot audition for humanoid robots interacting naturally with humans in an unconstrained real-world environment is a hitherto unsolved challenge. The recorded microphone signals are usually distorted by background and interfering noise sources (speakers) as well as room reverberation. In addition, the movements of a robot and its actuators cause ego-noise which degrades the recorded signals significantly. The movement of the robot body and its head also complicates the detection and tracking of the desired, possibly moving, sound sources of interest. This paper presents an overview of the concepts in microphone array processing for robot audition and some recent achievements. [/showhide]
Copyright Notice ©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper: EARS_HSCMA_2017_FAU_HL

Audio-visual Tracking by Density Approximation in a Sequential Bayesian Filtering Framework

I. D. Gebru (INRIA Grenoble), C. Evers, P. A. Naylor (Imperial College London), R. Horaud (INRIA Grenoble)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract: This paper proposes a novel audio-visual tracking approach that exploits constructively audio and visual modalities in order to estimate trajectories of multiple people in a joint state space. The tracking problem is modeled using a sequential Bayesian filtering framework. Within this framework, we propose to represent the posterior density with a Gaussian Mixture Model (GMM). To ensure that a GMM representation can be retained sequentially over time, the predictive density is approximated by a GMM using the Unscented Transform. While a density interpolation technique is introduced to obtain a continuous representation of the observation likelihood, which is also a GMM. Furthermore, to prevent the number of mixtures from growing exponentially over time, a density approximation based on the Expectation Maximization (EM) algorithm is applied, resulting in a compact GMM representation of the posterior density. Recordings using a camcorder and microphone array are used to evaluate the proposed approach, demonstrating significant improvements in tracking performance of the proposed audio-visual approach compared to two benchmark visual trackers. [/showhide]

This contribution received a Best Paper Award.

Towards Acoustically Robust Localization of Speakers in a Reverberant Environment

B. Rafaely (Ben-Gruion University of the Negev), D. Kolossa, and Y. Maymon (Ruhr University Bochum)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract: Direction-of-arrival (DoA) estimation of a speaker in a room using microphone arrays is an important task in audio signal processing in general, and robot audition in particular. Recently, a novel DoA estimation method developed for spherical arrays presented accurate performance even under real-world conditions with strong reverberation. The method identifies time-frequency bins dominated by the direct signal from the source, and employs only these bins for DoA estimation. A recent extension allowed the use of shorter time frames by employing Gaussian mixture model based clustering to the DoA statistics. However, performance still degrades under challenging acoustical conditions, such as close to a reflecting surface. In this paper, a novel analysis is presented that provides insight into the acoustic significance of the individual Gaussians in the mixture, clearly showing the distinctiveness of the Gaussian corresponding to the direct path signal. The results presented here can be employed in the design of acoustically robust DoA estimation under strong reverberation. [/showhide]

A Unified Framework for Multiple Arrays on a Robot and Application to Sound Localization

L. Madmoni (Ben-Gurion University of the Negev), H. Barfuss (FAU Erlangen-Nuremberg), B. Rafaely (Ben-Gurion University of the Negev), and W. Kellermann (FAU Erlangen-Nuremberg)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract: The auditory system of humanoid robots has recently gained significant attention in the research community. This system enables interaction with the surroundings via microphone arrays embodied into the robot. Typically, a single array topology is designed which may limit the performance of the auditory system. In a recent paper, a system with two array types has been studied: the first consists of sensors distributed over the robot’s head, and the second is a body-mounted robomorphic array. These arrays have only been studied separately without exploiting their performance as a single array. In this work, a unified framework has been developed for the two arrays in the spherical harmonics domain, and an initial investigation of direction of arrival estimation (DOA) with the unified framework is presented. The simulation study shows that the unified framework outperforms DOA estimation with the individual arrays, mainly at the low frequency range [/showhide]

Multi-Source Estimation Consistency for Improved Multiple Direction-of-Arrival Estimation

S. Hafezi, A. H. Moore, and P. A. Naylor (Imperial College London)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract:In Direction-of-Arrival(DOA) estimation for multiple sources, removal of noisy data points from a set of local DOA estimates increases the resulting estimation accuracy, especially when there are many sources and they have small angular separation. In this work, we propose a post-processing technique for the enhancement of DOA extraction from a set of local estimates using the consistency of these estimates within the time frame based on adaptive multi-source assumption. Simulations in a realistic reverberant environment with sensor noise and up to 5 sources demonstrate that the proposed technique outperforms the baseline and state-of-the-art approaches. In these tests the proposed technique had the worst average error of 9◦, robustness of 5◦ to widely varying source separation and 3◦ to number of sources. [/showhide]

HRTF-based Two-Dimensional Robust Least-Squares Frequency-Invariant Beamformer Design for Robot Audition

H. Barfuss, M. Buerger, J. Podschus, and W. Kellermann (FAU Erlangen-Nuremberg)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract: In this work, we propose a two-dimensional Head-Related Transfer Function (HRTF)-based robust beamformer design for robot audition, which allows for explicit control of the beamformer response for the entire three-dimensional sound field surrounding a humanoid robot. We evaluate the proposed method by means of both signal-independent and signal-dependent measures in a robot audition scenario. Our results confirm the effectiveness of the proposed two-dimensional HRTF-based beamformer design, compared to our previously published one-dimensional HRTF-based beamformer design, which was carried out for a fixed elevation angle only. [/showhide]
Copyright Notice ©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper: EARS_HSCMA_2017_FAU_HB

Speaker Tracking in Reverberant Environments Using Multiple Directions of Arrival

C. Evers (Imperial College London), B. Rafaely (Ben-Gurion University of the Negev), P. A. Naylor (Imperial College London)
Workshop on on Hands-free Speech Communication and Microphone Arrays (HSCMA), San Francisco, USA, March 1-3, 2017
[showhide type=”Abstract”] Abstract: Accurate estimation of the Direction of Arrival (DOA) of a sound source is an important prerequisite for a wide range of acoustic signal processing applications. However, in enclosed environments, early reflections and late reverberation often lead to localization errors. Recent work demonstrated that improved robustness against reverberation can be achieved by clustering only the DOAs from direct-path bins in the short-term Fourier transform of a speech signal of several seconds duration from a static talker. Nevertheless, for moving talkers, short blocks of at most several hundred milliseconds are required to capture the spatio-temporal variation of the source direction. Processing of short blocks of data in reverberant environment can lead to clusters whose centroids correspond to spurious DOAs away from the source direction. We therefore propose in this paper a novel multi-detection source tracking approach that estimates the smoothed trajectory of the source DOAs. Results for realistic room simulations validate the proposed approach and demonstrate significant improvements in estimation accuracy compared to single-detection tracking. [/showhide]

Spatio-Spectral Masking for Spherical Array Beamforming

U. Abend and B. Rafaely (Ben-Gurion University of the Negev)

ICSEE 2016, Eilat, Israel, Nov. 16-18, 2016
[showhide type=”Abstract”] Abstract: Beamforming using spherical arrays has become increasingly popular in recent years. However, the performance of beamforming algorithms is greatly affected by the limited number of sensors. This work offers a novel approach based on pre-processing of the spatial data in order to better separate the signal from noise, thus improving beamforming performance. The method involves transformation of the data to the spatio-spectral domain, using the spatially-localized spherical Fourier transform, followed by masking. The masking function is defined using a-priori knowledge of signal to noise ratio. The performance of the proposed algorithm is then evaluated using a simulation study, showing improvement over conventional spatial filtering. [/showhide]
Copyright Notice ©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper: Paper_ICSEE_2016_BGU_UA

Efficient Relative Transfer Function Estimation Framework in the Spherical Harmonics Domain

Y. Biderman, B. Rafaely, (Ben-Gurion University of the Negev), S. Gannot (Bar Ilan University) and S. Doclo (University of Oldenburg)
European Signal Processing Conference (EUSIPCO) 2016, Budapest, Hungary, September 2016.
[showhide type=”Abstract”] Abstract: In acoustic conditions with reverberation and coherent sources, various spatial filtering techniques, such as the linearly constrained minimum variance (LCMV) beamformer, require accurate estimates of the relative transfer functions (RTFs) between the sensors with respect to the desired speech source. However, the time-domain support of these RTFs may affect the estimation accuracy in several ways. First, short RTFs justify the multiplicative transfer function (MTF) assumption when the length of the signal time frames is limited. Second, they require fewer parameters to be estimated, hence reducing the effect of noise and model errors. In this paper, a spherical microphone array based framework for RTF estimation is presented, where the signals are transformed to the spherical harmonics (SH)-domain. The RTF time-domain supports are studied under different acoustic conditions, showing that SH-domain RTFs are shorter compared to conventional space-domain RTFs. [/showhide]
Copyright Notice ©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper: Paper_EUSIPCO_2016_BGU_YB

Analysis of Distortion in Audio Signals Introduced by Microphone Motion

V. Tourbabin and B. Rafaely (Ben-Gurion University of the Negev)

European Signal Processing Conference (EUSIPCO) 2016, Budapest, Hungary, September 2016.
[showhide type=”Abstract”] Abstract: Signals recorded by microphones form the basis for a wide range of audio signal processing systems. In some applications, such as humanoid robots, the microphones may be moving while recording the audio signals. A common practice is to assume that the microphone is stationary within a short time frame. Although this assumption may be reasonable under some conditions, there is currently no theoretical framework that predicts the level of signal distortion due to motion as a function of system parameters. This paper presents such a framework, for linear and circular microphone motion, providing upper bounds on the motioninduced distortion, and showing that the dependence of this upper bound on motion speed, signal frequency, and time-frame duration, is linear. A simulation study of a humanoid robot rotating its head while recording a speech signal validates the theoretical results. [/showhide]
Copyright Notice ©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper: Paper_EUSIPCO_2016_BGU_VT