Some example reconstructed speeches for the proposed Efficient Speech Codec in the paper “ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers”. This page includes some multilingual speech reconstructions (English, Italian, French audio speeches @ 16kHz from testset of [1]) of our ESC codec as well as the DAC codec [2] reproduced on the same training speech dataset, as detailed in the paper.
In general, ESC can achieve comparable reconstruction quality at the same range of compression level (ESC-Large vs. DAC-Base) with respect to the state-of-the-art neural audio codec DAC, while it significantly reduces model complexity, with 4.8x smaller model size, 1.4x encoding speed, and 6.4x decoding speed on a single CPU.
Check out the capabilities of our model through the multilingual speech examples (along with mel-spectrograms) below:
Ground Truth Speech Audio (English; Italian; French Speech)
Models @ bitrates
ESC-Base-Adv @ 3.00kbps
ESC-Base-Adv @ 6.00kbps
ESC-Base-Adv @ 9.00kbps
DAC-Tiny-Adv @ 3.00kbps
DAC-Tiny-Adv @ 6.00kbps
DAC-Tiny-Adv @ 9.00kbps
ESC-Large @ 3.00kbps
ESC-Large @ 6.00kbps
ESC-Large @ 9.00kbps
DAC-Base-Adv @ 3.00kbps
DAC-Base-Adv @ 6.00kbps
DAC-Base-Adv @ 9.00kbps