![]() SoundStream is the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU. It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs. The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end. The encoder converts the input audio stream into a coded signal, which is compressed using the quantizer and then converted back to audio using the decoder. SoundStream leverages state-of-the-art solutions in the field of neural audio synthesis to deliver audio at high perceptual quality, by training a discriminator that computes a combination of adversarial and reconstruction loss functions that induce the reconstructed audio to sound like the uncompressed original input. Once trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network. ![]() ![]() During training, the encoder, quantizer and decoder parameters are optimized using a combination of reconstruction and adversarial losses, computed by a discriminator, which is trained to distinguish between the original input audio and the reconstructed audio. Learning a Scalable Codec with Residual Vector Quantization During inference, the encoder and quantizer on a transmitter client send the compressed bitstream to a receiver client that can then decode the audio signal. The encoder of SoundStream produces vectors that can take an indefinite number of values. In order to transmit them to the receiver using a limited number of bits, it is necessary to replace them by close vectors from a finite set (called a codebook), a process known as vector quantization.
0 Comments
Leave a Reply. |