Reconstruct mel spectrogram from librosa #19

clairerity · 2022-01-26T12:27:12Z

clairerity
Jan 26, 2022

Hello! first of all, thanks for this wonderful repo. I would just like to ask as how to reconstruct the mel spectrogram i generated from librosa? I can do this via VQGAN using this code:

def reconstruct_with_vqgan(x, model):
  z, _, [_, _, indices] = model.encode(x)
  xrec = model.decode(z)
  return xrec

the xrec is the reconstructed image (from VQGAN)

I also add a preprocessing step before reconstructing using this code (same one from DALL-E's VQVAE):

def preprocess(img): 
    s = min(img.size)
    
     if s < target_image_size:
        raise ValueError(f'min dim for image {s} < {target_image_size}')
        
    r = target_image_size / s
    s = (round(r * img.size[1]), round(r * img.size[0]))
    img = TF.resize(img, s, interpolation=PIL.Image.LANCZOS)
    #img = TF.center_crop(img, output_size=2 * [target_image_size])
    img = torch.unsqueeze(T.ToTensor()(img), 0)
    return img

in the end i just call these 2 functions to reconstruct the image

img = PIL.Image.open(image).convert("RGB") # input is the mel spectogram in image form
x_vqgan = preprocess(img)
x_vqgan = x_vqgan.to(DEVICE)
  
x2 = reconstruct_with_vqgan(x_vqgan, model32x32) # model32x32 is the VQGAN model
x2 = custom_to_pil(x2[0]) # final reconstructed image

I was wondering how I could use your model instead to reconstruct in a way that it is similar to this. I just checked the demo and saw that it extracts audio from the video. I'm thinking as to how I can directly reconstruct the mel spectrogram generated on librosa.

Thank you very much in advance :D

v-iashin · 2022-01-26T12:54:23Z

v-iashin
Jan 26, 2022
Maintainer

Hi, thanks a lot. I am glad you like it 🙂

If I understand you correctly, you would like to reconstruct a Mel spectrogram you obtained from wav file using librosa.

However, the demo (in this cell [is the output of the cell is what you want?]) also extracts mel spectrogram using librosa from the raw audio:

SpecVQGAN/feature_extraction/demo_utils.py

Lines 348 to 353 in eee222d

    
           audio_zero_pad, spec = get_spectrogram(audio_new, save_dir=None, length=length, save_results=False) 
        
           # specvqgan expects inputs to be in [-1, 1] but spectrograms are in [0, 1] 
        
           spec = 2 * spec - 1 
        
           return spec

that calls get_spectrogram() that has the implementation:

SpecVQGAN/feature_extraction/extract_mel_spectrogram.py

Lines 166 to 187 in eee222d

    
           def get_spectrogram(audio_path, save_dir, length, folder_name='melspec_10s_22050hz', save_results=True): 
        
               wav, _ = librosa.load(audio_path, sr=None) 
        
               # this cannot be a transform without creating a huge overhead with inserting audio_name in each 
        
               y = np.zeros(length) 
        
               if wav.shape[0] < length: 
        
                   y[:len(wav)] = wav 
        
               else: 
        
                   y = wav[:length] 
        
               if folder_name == 'melspec_10s_22050hz': 
        
                   print('using', folder_name) 
        
                   mel_spec = TRANSFORMS(y) 
        
               else: 
        
                   raise NotImplementedError 
        
               if save_results: 
        
                   os.makedirs(save_dir, exist_ok=True) 
        
                   audio_name = os.path.basename(audio_path).split('.')[0] 
        
                   np.save(P.join(save_dir, audio_name + '_mel.npy'), mel_spec) 
        
                   np.save(P.join(save_dir, audio_name + '_audio.npy'), y) 
        
               else: 
        
                   return y, mel_spec

and here are the transforms you need to apply in order to convert the sound samples to Mel spectrogram

SpecVQGAN/feature_extraction/extract_mel_spectrogram.py

Lines 141 to 151 in eee222d

    
           TRANSFORMS = torchvision.transforms.Compose([ 
        
               MelSpectrogram(sr=22050, nfft=1024, fmin=125, fmax=7600, nmels=80, hoplen=1024//4, spec_power=1), 
        
               LowerThresh(1e-5), 
        
               Log10(), 
        
               Multiply(20), 
        
               Subtract(20), 
        
               Add(100), 
        
               Divide(100), 
        
               Clip(0, 1.0), 
        
               TrimSpec(860) 
        
           ])

Just make sure your mel spectrogram is extracted with the same parameters and you apply the same transforms (log, calling etc, see TRANSFORMS(x)).

0 replies

v-iashin · 2022-01-26T13:00:28Z

v-iashin
Jan 26, 2022
Maintainer

Also, check if the Neural Audio Codec colab demo makes it any clearer

0 replies

clairerity · 2022-02-02T16:05:54Z

clairerity
Feb 2, 2022
Author

Hello thank you very much for these! will check them out! :D

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruct mel spectrogram from librosa #19

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Reconstruct mel spectrogram from librosa #19

clairerity Jan 26, 2022

Replies: 3 comments

v-iashin Jan 26, 2022 Maintainer

v-iashin Jan 26, 2022 Maintainer

clairerity Feb 2, 2022 Author

clairerity
Jan 26, 2022

v-iashin
Jan 26, 2022
Maintainer

v-iashin
Jan 26, 2022
Maintainer

clairerity
Feb 2, 2022
Author