musicly-ai/singing_voice_conversion ❓🖼️ → 🖼️
About
this is the replicate version of singing_voice_conversion from amphion
Example Output
Output
Performance Metrics
35.15s
Prediction Time
59.18s
Total Time
All Input Parameters
{
"target_singer": "Adele",
"source_audio_path": "https://replicate.delivery/pbxt/K82snUEMoacRPUe9mkP5jRWZE2Z6tZcMy82T1fnXOCa4TXgU/female_vocal.wav"
}
Input Parameters
- target_singer
- target singer
- source_audio_path (required)
- The path to the audio file
Output Schema
Output
Example Execution Logs
the target_singer is : Adele
source_audio_path: /tmp/tmp63uozntufemale_vocal.wav
------------args_list-------------------
['--config', 'ckpts/svc/vocalist_l1_contentvec+whisper/args.json', '--acoustics_dir', 'ckpts/svc/vocalist_l1_contentvec+whisper', '--vocoder_dir', 'pretrained/bigvgan', '--target_singer', 'vocalist_l1_Adele', '--trans_key', 'autoshift', '--diffusion_inference_steps', '1000', '--source', '/tmp/', '--output_dir', 'result', '--log_level', 'debug']
------------args_list--------------
There are 1 source audios:
**********
Conversion for tmp63uozntufemale_vocal...
Prepare for meta eval data: 0.0s
0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:02<00:00, 2.03s/it][A
100%|██████████| 1/1 [00:02<00:00, 2.03s/it]
Loading F0...: 0%| | 0/1 [00:00<?, ?it/s][A
Loading F0...: 100%|██████████| 1/1 [00:00<00:00, 2794.34it/s]
Singers statistics: 0%| | 0/1 [00:00<?, ?it/s][A
Singers statistics: 100%|██████████| 1/1 [00:00<00:00, 1478.95it/s]
Prepare for acoustic features: 2.0s
Loading Whisper Model...
Using GPU...
0%| | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:04<00:00, 4.12s/it][A
100%|██████████| 1/1 [00:04<00:00, 4.38s/it]
Load Contentvec Model...
2023-12-28 18:15:30 | INFO | fairseq.tasks.hubert_pretraining | current directory is /src
2023-12-28 18:15:30 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-12-28 18:15:30 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
0%| | 0/1 [00:00<?, ?it/s][A
100%|██████████| 1/1 [00:03<00:00, 3.23s/it][A
100%|██████████| 1/1 [00:03<00:00, 3.63s/it]
Prepare for content features: 16.4s
2023-12-28 18:15:35 | INFO | inference | ========================================================
2023-12-28 18:15:35 | INFO | inference | || New inference process started. ||
2023-12-28 18:15:35 | INFO | inference | ========================================================
2023-12-28 18:15:35 | INFO | inference |
2023-12-28 18:15:35 | DEBUG | inference | Using DEBUG logging level.
2023-12-28 18:15:35 | DEBUG | inference | Acoustic dir: ckpts/svc/vocalist_l1_contentvec+whisper
2023-12-28 18:15:35 | DEBUG | inference | Vocoder dir: pretrained/bigvgan
2023-12-28 18:15:35 | DEBUG | inference | Setting random seed done in 0.38ms
2023-12-28 18:15:35 | DEBUG | inference | Random seed: 10086
2023-12-28 18:15:35 | INFO | inference | Building dataset...
----------get_metadata: path---------
ckpts/svc/vocalist_l1_contentvec+whisper/data/tmp63uozntufemale_vocal/eval.json
----------get_metadata: path---------
2023-12-28 18:15:35 | INFO | inference | Building dataset done in 5.86ms
2023-12-28 18:15:35 | INFO | inference | Building model...
2023-12-28 18:15:35 | INFO | inference | Building model done in 275.970ms
2023-12-28 18:15:35 | INFO | inference | Initializing accelerate...
2023-12-28 18:15:35 | INFO | inference | Initializing accelerate done in 23.438ms
2023-12-28 18:15:35 | INFO | inference | Loading checkpoint...
2023-12-28 18:15:35 | INFO | accelerate.accelerator | Loading states from ckpts/svc/vocalist_l1_contentvec+whisper/checkpoint/epoch-6852_step-0678447_loss-1.946773
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All model weights loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All optimizer states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All scheduler states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All dataloader sampler states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All random states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.accelerator | Loading in 0 custom states
2023-12-28 18:15:35 | INFO | inference | Loading checkpoint done in 53.397ms
2023-12-28 18:15:35 | INFO | inference | Using PNDM scheduler.
Model Init: 0.4s
Auto transposing: source f0 median = 232.4, target f0 median = 333.0, factor = 1.43
0%| | 0/1009 [00:00<?, ?it/s][A
1%| | 7/1009 [00:00<00:14, 69.30it/s][A
3%|▎ | 26/1009 [00:00<00:07, 137.78it/s][A
4%|▍ | 45/1009 [00:00<00:06, 159.75it/s][A
6%|▌ | 62/1009 [00:00<00:05, 163.57it/s][A
8%|▊ | 80/1009 [00:00<00:05, 168.91it/s][A
10%|▉ | 98/1009 [00:00<00:05, 170.09it/s][A
11%|█▏ | 116/1009 [00:00<00:05, 170.68it/s][A
13%|█▎ | 135/1009 [00:00<00:04, 175.17it/s][A
15%|█▌ | 154/1009 [00:00<00:04, 177.45it/s][A
17%|█▋ | 173/1009 [00:01<00:04, 179.59it/s][A
19%|█▉ | 192/1009 [00:01<00:04, 180.32it/s][A
21%|██ | 211/1009 [00:01<00:04, 181.17it/s][A
23%|██▎ | 230/1009 [00:01<00:04, 179.22it/s][A
25%|██▍ | 248/1009 [00:01<00:04, 178.30it/s][A
26%|██▋ | 266/1009 [00:01<00:04, 177.79it/s][A
28%|██▊ | 284/1009 [00:01<00:04, 174.89it/s][A
30%|██▉ | 302/1009 [00:01<00:04, 174.08it/s][A
32%|███▏ | 320/1009 [00:01<00:03, 174.00it/s][A
33%|███▎ | 338/1009 [00:01<00:03, 173.83it/s][A
35%|███▌ | 356/1009 [00:02<00:03, 173.45it/s][A
37%|███▋ | 374/1009 [00:02<00:03, 173.38it/s][A
39%|███▉ | 392/1009 [00:02<00:03, 173.42it/s][A
41%|████ | 410/1009 [00:02<00:03, 170.60it/s][A
42%|████▏ | 428/1009 [00:02<00:03, 171.37it/s][A
44%|████▍ | 446/1009 [00:02<00:03, 171.98it/s][A
46%|████▌ | 464/1009 [00:02<00:03, 174.29it/s][A
48%|████▊ | 483/1009 [00:02<00:02, 177.75it/s][A
50%|████▉ | 502/1009 [00:02<00:02, 180.22it/s][A
52%|█████▏ | 521/1009 [00:03<00:02, 181.92it/s][A
54%|█████▎ | 540/1009 [00:03<00:02, 183.35it/s][A
55%|█████▌ | 559/1009 [00:03<00:02, 184.32it/s][A
57%|█████▋ | 578/1009 [00:03<00:02, 185.09it/s][A
59%|█████▉ | 597/1009 [00:03<00:02, 178.99it/s][A
61%|██████ | 615/1009 [00:03<00:02, 177.05it/s][A
63%|██████▎ | 633/1009 [00:03<00:02, 175.94it/s][A
65%|██████▍ | 651/1009 [00:03<00:02, 175.29it/s][A
66%|██████▋ | 669/1009 [00:03<00:01, 174.40it/s][A
68%|██████▊ | 687/1009 [00:03<00:01, 173.93it/s][A
70%|██████▉ | 705/1009 [00:04<00:01, 173.83it/s][A
72%|███████▏ | 723/1009 [00:04<00:01, 173.45it/s][A
73%|███████▎ | 741/1009 [00:04<00:01, 173.42it/s][A
75%|███████▌ | 759/1009 [00:04<00:01, 170.42it/s][A
77%|███████▋ | 777/1009 [00:04<00:01, 171.94it/s][A
79%|███████▉ | 795/1009 [00:04<00:01, 172.26it/s][A
81%|████████ | 813/1009 [00:04<00:01, 172.30it/s][A
82%|████████▏ | 831/1009 [00:04<00:01, 172.29it/s][A
84%|████████▍ | 849/1009 [00:04<00:00, 172.45it/s][A
86%|████████▌ | 867/1009 [00:04<00:00, 172.61it/s][A
88%|████████▊ | 885/1009 [00:05<00:00, 172.70it/s][A
89%|████████▉ | 903/1009 [00:05<00:00, 172.78it/s][A
91%|█████████▏| 921/1009 [00:05<00:00, 172.78it/s][A
93%|█████████▎| 939/1009 [00:05<00:00, 173.02it/s][A
95%|█████████▍| 957/1009 [00:05<00:00, 173.17it/s][A
97%|█████████▋| 975/1009 [00:05<00:00, 173.95it/s][A
98%|█████████▊| 993/1009 [00:05<00:00, 174.63it/s][A
100%|██████████| 1009/1009 [00:05<00:00, 173.89it/s]
Synthesis audios using bigvgan vocoder...
Loading Vocoder from Weights file: /src/pretrained/bigvgan/400000.pt
For predicted mels, #sample = 1...
Model inference: 15.6s
100%|██████████| 1/1 [00:34<00:00, 34.51s/it]
100%|██████████| 1/1 [00:34<00:00, 34.51s/it]
Version Details
- Version ID
20a05a5868e4f0b908abddcd46608df229e60e7fe2a74139fdd45f7a8d232e10- Version Created
- December 27, 2023