APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Time-domain Separation Priority Pipeline-based Cascaded Multi-task Learning for Monaural Noisy and Reverberant Speech Separation

Shaoxiang Dang, Nagoya University, Japan, dang.shaoxiang.s0@s.mail.nagoya-u.ac.jp , Tetsuya Matsumoto, Nagoya University, Japan, Yoshinori Takeuchi, Daido University, Japan, Hiroaki Kudo, Nagoya University, Japan
 
Suggested Citation
Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi and Hiroaki Kudo (2025), "Time-domain Separation Priority Pipeline-based Cascaded Multi-task Learning for Monaural Noisy and Reverberant Speech Separation", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e23. http://dx.doi.org/10.1561/116.20250022

Publication Date: 28 aug 2025
© 2025
 
Subjects
Audio signal processing,  Enhancement,  Source separation,  Signal reconstruction,  Deep learning
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 6 times

In this article:
Introduction 
Problem Formulation and Related Works 
Proposed Methods 
Experiments 
Results 
Conclusion 
References 

Abstract

Monaural speech separation is a crucial task in speech processing, focused on isolating single-channel audio with multiple speakers into individual streams. This problem is particularly challenging in noisy and reverberant environments where the target information becomes obscured. Cascaded multi-task learning breaks down complex tasks into simpler sub-tasks and leverages additional information for step-by-step learning, serving as an effective approach for integrating multiple objectives. However, its sequential nature often leads to over-suppression, degrading the performance of downstream modules. This article presents three main contributions. First, we propose a separation-priority pipeline to ensure that the critical separation sub-task is preserved against over-suppression. Second, to extract deeper multi-scale features, we design a consistent-stride deep encoder-decoder structure combined with depth-wise multi-receptive field fusion. Third, we advocate a training strategy that pre-trains each sub-task and applies time-varying and time-invariant weighted fine-tuning to further mitigate over-suppression. Our methods are evaluated on the open-source Libri2Mix and real-world LibriCSS datasets. Experimental results across diverse metrics demonstrate that all proposed innovations improve overall model performance.

DOI:10.1561/116.20250022