Phase-Optimized K-SVD : A New Sparse Representation for Multichannel Mixtures

This page provides supplementary audio and visual data for the ICASSP 2015 submission “Phase-Optimized K-SVD for Signal Extraction from Underdetermined Multichannel Sparse Mixtures” by Antoine Deleforge and Walter Kellermann (Manuscript available at http://arxiv.org/abs/1410.2430). This paper introduces a new sparse matrix factorization technique operating in the complex Fourier domain. Contrary to existing non-negative factorization method, the proposed approach is complex, multichannel, and estimates the instantaneous phase of all involved sound sources.

 

Download MATLAB code and usage examples for PO-KSVD:

The method is applied to the challenging problem of “egonoise” reduction, i.e., how to reduce the auditory noise produced by a robot performing motor actions such as hand waving or walking. All recordings were made with the commercial robot NAO V5 of Aldebaran robotics, in the audio lab of the LMS chair (Erlangen, Germany). The T60 reverberation time of the room was around 200ms. Although the recordings performed and used by the proposed method are 4-channel (left, right, front and rear microphones), this page provides stereo sounds corresponding to the left and the right microphones only, for a better listening experience.

Below are two videos of the robot NAO waving and walking. The soundtracks correspond to the sounds recorded at the left and the right microphones of the robot (best heard with headphones).

As can be heard, these signals are highly non-stationary and possess an intricate spatial distribution, making them challenging to model or extract from a mixture.

 

Multichannel Wiener pre-filtering

To reduce the noise produced by the CPU fan, the multichannel Wiener filtering technique described in [Löllmann et al. 2014] was used on all recordings, as illustrated below:

MWF_explained

Fan noise (training signal) :

 

Input noisy signal (waving right arm + speech + fan noise) :

 

Cleaned signal :

 

All spectrograms showed in this page correspond to the left microphone channel and use the following color code:

color_bar

The “waving” egonoise

Test mixtures were generated by summing up utterances from the GRID corpus and out-of-training “waving noise” recordings. The utterances were emitted by a loudspeaker placed 1 meter in front of the robot at null elevation, and recorded with the fan turned off. The waving noise was recorded with the fan turned on, and with NAO repeatedly waving its right arm. It was then pre-processed using the multichannel Wiener filtering method described in previous section.

waving_mixture

Clean speech:

 

“Waving” Noise:

 

Noisy input:

 

Below are the results obtained using the proposed PO-KSVD+mask and PO-KSVD methods, as compared to results obtained using conventional NMF [Yifeng and Ngom 2013] and conventional K-SVD [Aharon et al. 2006]. All methods were trained with a 1 minute recording of NAO repeatedly moving the arm.

waving_results

PO-KSVD + mask:

 

PO-KSVD:

 

NMF:

 

K-SVD:

 

The “walking” egonoise

Similarly test mixtures were generated by summing up utterances from the GRID corpus and out-of-training “walking noise” recordings. The utterances were emitted by a loudspeaker placed 1 meter in front of the robot at null elevation, and recorded with the fan turned off. The walking noise was recorded with the fan turned on, and with NAO walking on place. Again, it was pre-processed using multichannel Wiener filtering.

walking_mixtureClean speech:

 

“Walking” noise:

 

Noisy input:

 

Below are the results obtained using the proposed PO-KSVD+mask and PO-KSVD methods, as compared to results obtained using conventional NMF [Yifeng and Ngom 2013] and conventional K-SVD [Aharon et al. 2006]. All methods were trained with a 1 minute recording of NAO walking on place.

walking_results

PO-KSVD + mask:

 

PO-KSVD:

 

NMF:

 

K-SVD: