Mixture of Spectral Experts for Audio Deepfake Detection

Published:

Please cite:
@inproceedings{yaxuan2026_moe,
title={Mixture of Spectral Experts for Audio Deepfake Detection},
author={Yaxuan Qiu, Zhe Li, Mieradilijiang Maimaiti, Zunwang Ke, Yanbing Li, and Wushour Silamu},
journal={Interspeech},
year={2026},
}

Abstract

Recent advances in neural speech synthesis have produced highly natural waveforms, making audio deepfake detection increasingly challenging as spoofing artifacts become less perceptible. Although pre-trained speech models provide robust representations, they may overlook low-level physical cues, particularly magnitude and phase information. To address this limitation, we propose a detection framework that combines a frequency audio encoder (FAE) with spectral parameter-efficient fine-tuning. The FAE explicitly models magnitude and phase cues, while the proposed Mixture of Spectral Experts (MoSE) efficiently adapts the pre-trained speech model to generation-dependent distribution shifts. By applying low-rank updates in the singular value decomposition (SVD) domain while keeping the singular bases frozen, MoSE facilitates task-specific adaptation to spoofing-related spectral artifacts. Evaluations on ASVspoof 2019 LA, ASVspoof 2021 LA/DF, and In-the-Wild benchmarks demonstrate the effectiveness of our approach and its strong generalization to unseen channel variations and real-world spoofing attacks.

[PDF]