Sound demos for "On Fast Sampling of Diffusion Probabilistic Models"



Section Ⅰ: Neural vocoding on the LJ Speech dataset

The audio samples are generated by conditioning on the ground-truth mel spectrogram. The pretrained model is DiffWave trained with channel = 128 and T = 200. We provide samples of the original DiffWave and their fast synthesis algorithm with S = 6 steps. For FastDPM, we provide samples generated with S = 5 and 6 steps, respectively. All four settings (VAR / STEP + DDPM-rev / DDIM-rev) are included.

FastDPM (S = 5):

VAR + DDPM-rev VAR + DDIM-rev STEP + DDPM-rev STEP + DDIM-rev


FastDPM (S = 6):

VAR + DDPM-rev VAR + DDIM-rev STEP + DDPM-rev STEP + DDIM-rev


Reference audio:

                     
DiffWave (T = 200)DiffWave fast (S = 6)Ground Truth (recorded)


Section Ⅱ: Class-conditional waveform generation on the SC09 dataset

The audio samples are generated by conditioning on the digit labels (0 - 9). The pretrained model is DiffWave trained with channel = 256 and T = 200. The number of steps of the approximate reverse process in FastDPM is S = 50. Samples from four settings (VAR / STEP + DDPM-rev / DDIM-rev) are provided. Results are arranged according to the conditional digit labels.

  VAR + DDIM-rev (κ = 0.5)  
  STEP + DDIM-rev (κ = 0.5)  
  VAR + DDPM-rev   
  VAR + DDIM-rev (κ = 0.0)  


Section Ⅲ: Unconditional waveform generation on the SC09 dataset

The audio samples are generated without any conditional information. The pretrained model is DiffWave trained with channel = 256 and T = 200. The number of steps of the approximate reverse process in FastDPM is S = 50. Samples from four settings (VAR / STEP + DDPM-rev / DDIM-rev) are provided.

  VAR + DDPM-rev   
  STEP + DDPM-rev   
  VAR + DDIM-rev (κ = 0.5)  
  VAR + DDIM-rev (κ = 0.0)