Sound demos for "On Fast Sampling of Diffusion Probabilistic Models"
Section Ⅰ: Neural vocoding on the LJ Speech dataset
The audio samples are generated by conditioning on the ground-truth mel spectrogram.
The pretrained model is DiffWave trained with channel = 128 and T = 200.
We provide samples of the original DiffWave and their fast synthesis algorithm with S = 6 steps.
For FastDPM, we provide samples generated with S = 5 and 6 steps, respectively.
All four settings (VAR / STEP + DDPM-rev / DDIM-rev) are included.
FastDPM (S = 5):
VAR + DDPM-rev
VAR + DDIM-rev
STEP + DDPM-rev
STEP + DDIM-rev
FastDPM (S = 6):
VAR + DDPM-rev
VAR + DDIM-rev
STEP + DDPM-rev
STEP + DDIM-rev
Reference audio:
DiffWave (T = 200)
DiffWave fast (S = 6)
Ground Truth (recorded)
Section Ⅱ: Class-conditional waveform generation on the SC09 dataset
The audio samples are generated by conditioning on the digit labels (0 - 9).
The pretrained model is DiffWave trained with channel = 256 and T = 200.
The number of steps of the approximate reverse process in FastDPM is S = 50.
Samples from four settings (VAR / STEP + DDPM-rev / DDIM-rev) are provided.
Results are arranged according to the conditional digit labels.
VAR + DDIM-rev (κ = 0.5)
STEP + DDIM-rev (κ = 0.5)
VAR + DDPM-rev
VAR + DDIM-rev (κ = 0.0)
Section Ⅲ: Unconditional waveform generation on the SC09 dataset
The audio samples are generated without any conditional information.
The pretrained model is DiffWave trained with channel = 256 and T = 200.
The number of steps of the approximate reverse process in FastDPM is S = 50.
Samples from four settings (VAR / STEP + DDPM-rev / DDIM-rev) are provided.