Speaker: Alexandre Défossez
Date: January 10, 2019 at 13:00 – B011
Abstract:
Recent progress in deep learning for audio synthesis opens the way to models that
directly produce the waveform, shifting away from the traditional paradigm of
relying on vocoders or MIDI synthesizers for speech or music generation. Despite
their successes, current state-of-the-art neural audio synthesizers such as WaveNet
and SampleRNN [24, 17] suffer from prohibitive training and inference times
because they are based on autoregressive models that generate audio samples one
at a time at a rate of 16kHz. In this work, we study the more computationally
efficient alternative of generating the waveform frame-by-frame with large strides.
We present SING, a lightweight neural audio synthesizer for the original task of
generating musical notes given desired instrument, pitch and velocity. Our model
is trained end-to-end to generate notes from nearly 1000 instruments with a single
decoder, thanks to a new loss function that minimizes the distances between the
log spectrograms of the generated and target waveforms. On the generalization
task of synthesizing notes for pairs of pitch and instrument not seen during training,
SING produces audio with significantly improved perceptual quality compared to a
state-of-the-art autoencoder based on WaveNet [4] as measured by a Mean Opinion
Score (MOS), and is about 32 times faster for training and 2, 500 times faster for
inference.