musicly-ai/singing_voice_conversion ❓🖼️ → 🖼️
About
this is the replicate version of singing_voice_conversion from amphion

Example Output
Output
Performance Metrics
35.15s
Prediction Time
59.18s
Total Time
All Input Parameters
{ "target_singer": "Adele", "source_audio_path": "https://replicate.delivery/pbxt/K82snUEMoacRPUe9mkP5jRWZE2Z6tZcMy82T1fnXOCa4TXgU/female_vocal.wav" }
Input Parameters
- target_singer
- target singer
- source_audio_path (required)
- The path to the audio file
Output Schema
Output
Example Execution Logs
the target_singer is : Adele source_audio_path: /tmp/tmp63uozntufemale_vocal.wav ------------args_list------------------- ['--config', 'ckpts/svc/vocalist_l1_contentvec+whisper/args.json', '--acoustics_dir', 'ckpts/svc/vocalist_l1_contentvec+whisper', '--vocoder_dir', 'pretrained/bigvgan', '--target_singer', 'vocalist_l1_Adele', '--trans_key', 'autoshift', '--diffusion_inference_steps', '1000', '--source', '/tmp/', '--output_dir', 'result', '--log_level', 'debug'] ------------args_list-------------- There are 1 source audios: ********** Conversion for tmp63uozntufemale_vocal... Prepare for meta eval data: 0.0s 0%| | 0/1 [00:00<?, ?it/s] 0%| | 0/1 [00:00<?, ?it/s][A 100%|██████████| 1/1 [00:02<00:00, 2.03s/it][A 100%|██████████| 1/1 [00:02<00:00, 2.03s/it] Loading F0...: 0%| | 0/1 [00:00<?, ?it/s][A Loading F0...: 100%|██████████| 1/1 [00:00<00:00, 2794.34it/s] Singers statistics: 0%| | 0/1 [00:00<?, ?it/s][A Singers statistics: 100%|██████████| 1/1 [00:00<00:00, 1478.95it/s] Prepare for acoustic features: 2.0s Loading Whisper Model... Using GPU... 0%| | 0/1 [00:00<?, ?it/s][A 100%|██████████| 1/1 [00:04<00:00, 4.12s/it][A 100%|██████████| 1/1 [00:04<00:00, 4.38s/it] Load Contentvec Model... 2023-12-28 18:15:30 | INFO | fairseq.tasks.hubert_pretraining | current directory is /src 2023-12-28 18:15:30 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False} 2023-12-28 18:15:30 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False} 0%| | 0/1 [00:00<?, ?it/s][A 100%|██████████| 1/1 [00:03<00:00, 3.23s/it][A 100%|██████████| 1/1 [00:03<00:00, 3.63s/it] Prepare for content features: 16.4s 2023-12-28 18:15:35 | INFO | inference | ======================================================== 2023-12-28 18:15:35 | INFO | inference | || New inference process started. || 2023-12-28 18:15:35 | INFO | inference | ======================================================== 2023-12-28 18:15:35 | INFO | inference | 2023-12-28 18:15:35 | DEBUG | inference | Using DEBUG logging level. 2023-12-28 18:15:35 | DEBUG | inference | Acoustic dir: ckpts/svc/vocalist_l1_contentvec+whisper 2023-12-28 18:15:35 | DEBUG | inference | Vocoder dir: pretrained/bigvgan 2023-12-28 18:15:35 | DEBUG | inference | Setting random seed done in 0.38ms 2023-12-28 18:15:35 | DEBUG | inference | Random seed: 10086 2023-12-28 18:15:35 | INFO | inference | Building dataset... ----------get_metadata: path--------- ckpts/svc/vocalist_l1_contentvec+whisper/data/tmp63uozntufemale_vocal/eval.json ----------get_metadata: path--------- 2023-12-28 18:15:35 | INFO | inference | Building dataset done in 5.86ms 2023-12-28 18:15:35 | INFO | inference | Building model... 2023-12-28 18:15:35 | INFO | inference | Building model done in 275.970ms 2023-12-28 18:15:35 | INFO | inference | Initializing accelerate... 2023-12-28 18:15:35 | INFO | inference | Initializing accelerate done in 23.438ms 2023-12-28 18:15:35 | INFO | inference | Loading checkpoint... 2023-12-28 18:15:35 | INFO | accelerate.accelerator | Loading states from ckpts/svc/vocalist_l1_contentvec+whisper/checkpoint/epoch-6852_step-0678447_loss-1.946773 2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All model weights loaded successfully 2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All optimizer states loaded successfully 2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All scheduler states loaded successfully 2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All dataloader sampler states loaded successfully 2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All random states loaded successfully 2023-12-28 18:15:35 | INFO | accelerate.accelerator | Loading in 0 custom states 2023-12-28 18:15:35 | INFO | inference | Loading checkpoint done in 53.397ms 2023-12-28 18:15:35 | INFO | inference | Using PNDM scheduler. Model Init: 0.4s Auto transposing: source f0 median = 232.4, target f0 median = 333.0, factor = 1.43 0%| | 0/1009 [00:00<?, ?it/s][A 1%| | 7/1009 [00:00<00:14, 69.30it/s][A 3%|▎ | 26/1009 [00:00<00:07, 137.78it/s][A 4%|▍ | 45/1009 [00:00<00:06, 159.75it/s][A 6%|▌ | 62/1009 [00:00<00:05, 163.57it/s][A 8%|▊ | 80/1009 [00:00<00:05, 168.91it/s][A 10%|▉ | 98/1009 [00:00<00:05, 170.09it/s][A 11%|█▏ | 116/1009 [00:00<00:05, 170.68it/s][A 13%|█▎ | 135/1009 [00:00<00:04, 175.17it/s][A 15%|█▌ | 154/1009 [00:00<00:04, 177.45it/s][A 17%|█▋ | 173/1009 [00:01<00:04, 179.59it/s][A 19%|█▉ | 192/1009 [00:01<00:04, 180.32it/s][A 21%|██ | 211/1009 [00:01<00:04, 181.17it/s][A 23%|██▎ | 230/1009 [00:01<00:04, 179.22it/s][A 25%|██▍ | 248/1009 [00:01<00:04, 178.30it/s][A 26%|██▋ | 266/1009 [00:01<00:04, 177.79it/s][A 28%|██▊ | 284/1009 [00:01<00:04, 174.89it/s][A 30%|██▉ | 302/1009 [00:01<00:04, 174.08it/s][A 32%|███▏ | 320/1009 [00:01<00:03, 174.00it/s][A 33%|███▎ | 338/1009 [00:01<00:03, 173.83it/s][A 35%|███▌ | 356/1009 [00:02<00:03, 173.45it/s][A 37%|███▋ | 374/1009 [00:02<00:03, 173.38it/s][A 39%|███▉ | 392/1009 [00:02<00:03, 173.42it/s][A 41%|████ | 410/1009 [00:02<00:03, 170.60it/s][A 42%|████▏ | 428/1009 [00:02<00:03, 171.37it/s][A 44%|████▍ | 446/1009 [00:02<00:03, 171.98it/s][A 46%|████▌ | 464/1009 [00:02<00:03, 174.29it/s][A 48%|████▊ | 483/1009 [00:02<00:02, 177.75it/s][A 50%|████▉ | 502/1009 [00:02<00:02, 180.22it/s][A 52%|█████▏ | 521/1009 [00:03<00:02, 181.92it/s][A 54%|█████▎ | 540/1009 [00:03<00:02, 183.35it/s][A 55%|█████▌ | 559/1009 [00:03<00:02, 184.32it/s][A 57%|█████▋ | 578/1009 [00:03<00:02, 185.09it/s][A 59%|█████▉ | 597/1009 [00:03<00:02, 178.99it/s][A 61%|██████ | 615/1009 [00:03<00:02, 177.05it/s][A 63%|██████▎ | 633/1009 [00:03<00:02, 175.94it/s][A 65%|██████▍ | 651/1009 [00:03<00:02, 175.29it/s][A 66%|██████▋ | 669/1009 [00:03<00:01, 174.40it/s][A 68%|██████▊ | 687/1009 [00:03<00:01, 173.93it/s][A 70%|██████▉ | 705/1009 [00:04<00:01, 173.83it/s][A 72%|███████▏ | 723/1009 [00:04<00:01, 173.45it/s][A 73%|███████▎ | 741/1009 [00:04<00:01, 173.42it/s][A 75%|███████▌ | 759/1009 [00:04<00:01, 170.42it/s][A 77%|███████▋ | 777/1009 [00:04<00:01, 171.94it/s][A 79%|███████▉ | 795/1009 [00:04<00:01, 172.26it/s][A 81%|████████ | 813/1009 [00:04<00:01, 172.30it/s][A 82%|████████▏ | 831/1009 [00:04<00:01, 172.29it/s][A 84%|████████▍ | 849/1009 [00:04<00:00, 172.45it/s][A 86%|████████▌ | 867/1009 [00:04<00:00, 172.61it/s][A 88%|████████▊ | 885/1009 [00:05<00:00, 172.70it/s][A 89%|████████▉ | 903/1009 [00:05<00:00, 172.78it/s][A 91%|█████████▏| 921/1009 [00:05<00:00, 172.78it/s][A 93%|█████████▎| 939/1009 [00:05<00:00, 173.02it/s][A 95%|█████████▍| 957/1009 [00:05<00:00, 173.17it/s][A 97%|█████████▋| 975/1009 [00:05<00:00, 173.95it/s][A 98%|█████████▊| 993/1009 [00:05<00:00, 174.63it/s][A 100%|██████████| 1009/1009 [00:05<00:00, 173.89it/s] Synthesis audios using bigvgan vocoder... Loading Vocoder from Weights file: /src/pretrained/bigvgan/400000.pt For predicted mels, #sample = 1... Model inference: 15.6s 100%|██████████| 1/1 [00:34<00:00, 34.51s/it] 100%|██████████| 1/1 [00:34<00:00, 34.51s/it]
Version Details
- Version ID
20a05a5868e4f0b908abddcd46608df229e60e7fe2a74139fdd45f7a8d232e10
- Version Created
- December 27, 2023