Some example reconstructed speeches for the proposed Efficient Speech Codec in the paper “ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers”. This page includes some multilingual speech reconstructions (English, Italian, French audio speeches @ 16kHz from testset of [1]) of our ESC codec as well as the DAC codec [2] reproduced on the same training speech dataset, as detailed in the paper.

results.png

In general, ESC can achieve comparable reconstruction quality at the same range of compression level (ESC-Large vs. DAC-Base) with respect to the state-of-the-art neural audio codec DAC, while it significantly reduces model complexity, with 4.8x smaller model size, 1.4x encoding speed, and 6.4x decoding speed on a single CPU.

Check out the capabilities of our model through the multilingual speech examples (along with mel-spectrograms) below:

Screenshot 2024-10-21 at 13.14.22.png

Multilingual Speech Samples


Ground Truth Speech Audio (English; Italian; French Speech)

english_speech_img.png

english_speech.wav

italian_speech_img.png

italian_speech.wav

french_speech_img.png

french_speech.wav

Models @ bitrates

ESC-Base-Adv @ 3.00kbps

ESC-Base-Adv @ 6.00kbps

ESC-Base-Adv @ 9.00kbps

DAC-Tiny-Adv @ 3.00kbps

DAC-Tiny-Adv @ 6.00kbps

DAC-Tiny-Adv @ 9.00kbps

ESC-Large @ 3.00kbps

ESC-Large @ 6.00kbps

ESC-Large @ 9.00kbps

DAC-Base-Adv @ 3.00kbps

DAC-Base-Adv @ 6.00kbps

DAC-Base-Adv @ 9.00kbps