Efficient-Speech-Codec (ESC) - Audio Speech Samples

Some example reconstructed speeches for the proposed Efficient Speech Codec in the paper “ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers”. This page includes some multilingual speech reconstructions (English, Italian, French audio speeches @ 16kHz from testset of [1]) of our ESC codec as well as the DAC codec [2] reproduced on the same training speech dataset, as detailed in the paper.

In general, ESC can achieve comparable reconstruction quality at the same range of compression level (ESC-Large vs. DAC-Base) with respect to the state-of-the-art neural audio codec DAC, while it significantly reduces model complexity, with 4.8x smaller model size, 1.4x encoding speed, and 6.4x decoding speed on a single CPU.

Check out the capabilities of our model through the multilingual speech examples (along with mel-spectrograms) below:

Screenshot 2024-10-21 at 13.14.22.png