-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference #28
Comments
Yes, we used those 886 audio files for evaluation. Can you specify which checkpoint you used and which results you were not able to obtain? |
I use https://huggingface.co/declare-lab/tango to generate 886 audio files, and use Guidance Scale=3 Steps=200,and get {"frechet_distance": 28.07995041974766, "frechet_audio_distance": 2.2381015516014955, "kullback_leibler_divergence_sigmoid": 3.8415958881378174, "kullback_leibler_divergence_softmax": 2.097446918487549, "lsd": 2.0631229603209094, "psnr": 15.874651663776682, "ssim": 0.4171875863485156, "ssim_stft": 0.09866382013407798, "inception_score_mean": 7.612150196882789, "inception_score_std": 0.8235111705490618, "kernel_inception_distance_mean": 0.010067609062191894, "kernel_inception_distance_std": 1.404596756557554e-07} |
Do I need to control the length of the generated audio to be the same as the original audio length to adjust its metrics. |
No, the length doesn't have to be controlled. I added the inference_hf.py script for running evaluation from our huggingface checkpoints. Can you try and check the scores you obtain from this script? I just did two runs and got the following scores: {
"frechet_distance": 24.4243,
"frechet_audio_distance": 1.7324,
"kl_sigmoid": 3.5901,
"kl_softmax": 1.3216,
"lsd": 2.0861,
"psnr": 15.6047,
"ssim": 0.4061,
"ssim_stft": 0.1027,
"is_mean": 7.5181,
"is_std": 0.6758,
"kid_mean": 0.0066,
"kid_std": 0.0,
"Steps": 200,
"Guidance Scale": 3,
"Test Instances": 886,
"scheduler_config": {
"num_train_timesteps": 1000,
"beta_start": 0.00085,
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"trained_betas": null,
"variance_type": "fixed_small",
"clip_sample": false,
"prediction_type": "v_prediction",
"thresholding": false,
"dynamic_thresholding_ratio": 0.995,
"clip_sample_range": 1.0,
"sample_max_value": 1.0,
"_class_name": "DDIMScheduler",
"_diffusers_version": "0.8.0",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1
},
"args": {
"test_file": "data/test_audiocaps_subset.json",
"text_key": "captions",
"device": "cuda:0",
"test_references": "data/audiocaps_test_references/subset",
"num_steps": 200,
"guidance": 3,
"batch_size": 8,
"num_test_instances": -1
},
"output_dir": "outputs/1688974057_steps_200_guidance_3"
} {
"frechet_distance": 24.9405,
"frechet_audio_distance": 1.6633,
"kl_sigmoid": 3.551,
"kl_softmax": 1.3122,
"lsd": 2.0957,
"psnr": 15.5877,
"ssim": 0.405,
"ssim_stft": 0.1027,
"is_mean": 7.187,
"is_std": 0.5192,
"kid_mean": 0.0066,
"kid_std": 0.0,
"Steps": 200,
"Guidance Scale": 3,
"Test Instances": 886,
"scheduler_config": {
"num_train_timesteps": 1000,
"beta_start": 0.00085,
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"trained_betas": null,
"variance_type": "fixed_small",
"clip_sample": false,
"prediction_type": "v_prediction",
"thresholding": false,
"dynamic_thresholding_ratio": 0.995,
"clip_sample_range": 1.0,
"sample_max_value": 1.0,
"_class_name": "DDIMScheduler",
"_diffusers_version": "0.8.0",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1
},
"args": {
"test_file": "data/test_audiocaps_subset.json",
"text_key": "captions",
"device": "cuda:3",
"test_references": "data/audiocaps_test_references/subset",
"num_steps": 200,
"guidance": 3,
"batch_size": 8,
"num_test_instances": -1
},
"output_dir": "outputs/1688974524_steps_200_guidance_3"
} Our results in the paper are average of multiple runs as there are some randomness in the diffusion inference process. |
Thank you for explaination. |
I found that the sampling rate of the reference audio has an impact on the final result. I would like to ask about the sampling rate of your reference audio before coverting to 16k Hz. |
All our reference audio files are in 16 KHz. I checked the AudioLDM Eval repository, and they now mention that the sampling rate can have an effect on the evaluation scores. Their paper and evaluation code indicate that their scores are reported for 16 KHz. So we also report results with the same sampling rate for a fair comparison. |
Hello, during the inference phase, do I only need to use the 886 audio files from your data/test_audiocaps_subset.json? I have been unable to obtain the results from your paper, even when using your checkpoint.
The text was updated successfully, but these errors were encountered: