Head Related Impulse Reponses (HRIRs) have been measured for the (Benchmark II) prototype head for the NAO robot. This prototype head was developed within the EARS project as part of Deliverable D5.3. The head contains 12 microphones in a pseudo-spherical arrangement whose positions have been determined as part of Deliverable D1.2. The head used for the HRIR measurements is not the same but manufactured to the same specifications as the robot head used for the IEEE-AASP Challenge on Acoustic Source Localization and Tracking (LOCATA) . A mat-file with the measured HRIRs and a corresponding documentation (pdf-file) are provided by this zip-archive.
The EARS map objects are Matlab classes designed to store and visualise data for acoustic scene mapping. EARS map objects allow the storage of a) individual speakers at one time step using a mapFeature object, b) a collection of speakers at one time step using a map object, and c) a trajectory of the evolution of a map objects over time using a mapFeature object. The objects are designed to contain data from both sound source localisation (SSL) as well as speaker tracking algorithms to provide a complete representation of the acoustic scene.
This page provides supplementary audio and visual data for the ICASSP 2015 submission “Phase-Optimized K-SVD for Signal Extraction from Underdetermined Multichannel Sparse Mixtures” by Antoine Deleforge and Walter Kellermann (Manuscript available at http://arxiv.org/abs/1410.2430). This paper introduces a new sparse matrix factorization technique operating in the complex Fourier domain. Contrary to existing non-negative factorization method, the proposed approach is complex, multichannel, and estimates the instantaneous phase of all involved sound sources.
Download MATLAB code and usage examples for PO-KSVD:
The method is applied to the challenging problem of “egonoise” reduction, i.e., how to reduce the auditory noise produced by a robot performing motor actions such as hand waving or walking. All recordings were made with the commercial robot NAO V5 of Aldebaran robotics, in the audio lab of the LMS chair (Erlangen, Germany). The T60 reverberation time of the room was around 200ms. Although the recordings performed and used by the proposed method are 4-channel (left, right, front and rear microphones), this page provides stereo sounds corresponding to the left and the right microphones only, for a better listening experience.
Below are two videos of the robot NAO waving and walking. The soundtracks correspond to the sounds recorded at the left and the right microphones of the robot (best heard with headphones).
As can be heard, these signals are highly non-stationary and possess an intricate spatial distribution, making them challenging to model or extract from a mixture.
Multichannel Wiener pre-filtering
To reduce the noise produced by the CPU fan, the multichannel Wiener filtering technique described in [Löllmann et al. 2014] was used on all recordings, as illustrated below:
Fan noise (training signal) :
Input noisy signal (waving right arm + speech + fan noise) :
Cleaned signal :
All spectrograms showed in this page correspond to the left microphone channel and use the following color code:
The “waving” egonoise
Test mixtures were generated by summing up utterances from the GRID corpus and out-of-training “waving noise” recordings. The utterances were emitted by a loudspeaker placed 1 meter in front of the robot at null elevation, and recorded with the fan turned off. The waving noise was recorded with the fan turned on, and with NAO repeatedly waving its right arm. It was then pre-processed using the multichannel Wiener filtering method described in previous section.
Below are the results obtained using the proposed PO-KSVD+mask and PO-KSVD methods, as compared to results obtained using conventional NMF [Yifeng and Ngom 2013] and conventional K-SVD [Aharon et al. 2006]. All methods were trained with a 1 minute recording of NAO repeatedly moving the arm.
PO-KSVD + mask:
The “walking” egonoise
Similarly test mixtures were generated by summing up utterances from the GRID corpus and out-of-training “walking noise” recordings. The utterances were emitted by a loudspeaker placed 1 meter in front of the robot at null elevation, and recorded with the fan turned off. The walking noise was recorded with the fan turned on, and with NAO walking on place. Again, it was pre-processed using multichannel Wiener filtering.
Below are the results obtained using the proposed PO-KSVD+mask and PO-KSVD methods, as compared to results obtained using conventional NMF [Yifeng and Ngom 2013] and conventional K-SVD [Aharon et al. 2006]. All methods were trained with a 1 minute recording of NAO walking on place.