musicly-ai/singing_voice_conversion ❓🖼️ → 🖼️

▶️ 579 runs 📅 Dec 2023 ⚙️ Cog 0.8.6 🔗 GitHub ⚖️ License
audio-to-audio voice-conversion

About

this is the replicate version of singing_voice_conversion from amphion

Example Output

Output

Example output

Performance Metrics

35.15s Prediction Time
59.18s Total Time
All Input Parameters
{
  "target_singer": "Adele",
  "source_audio_path": "https://replicate.delivery/pbxt/K82snUEMoacRPUe9mkP5jRWZE2Z6tZcMy82T1fnXOCa4TXgU/female_vocal.wav"
}
Input Parameters
target_singer Default: Adele
target singer
source_audio_path (required) Type: string
The path to the audio file
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
the target_singer is : Adele
source_audio_path: /tmp/tmp63uozntufemale_vocal.wav
------------args_list-------------------
['--config', 'ckpts/svc/vocalist_l1_contentvec+whisper/args.json', '--acoustics_dir', 'ckpts/svc/vocalist_l1_contentvec+whisper', '--vocoder_dir', 'pretrained/bigvgan', '--target_singer', 'vocalist_l1_Adele', '--trans_key', 'autoshift', '--diffusion_inference_steps', '1000', '--source', '/tmp/', '--output_dir', 'result', '--log_level', 'debug']
------------args_list--------------
There are 1 source audios:
**********
Conversion for tmp63uozntufemale_vocal...
Prepare for meta eval data: 0.0s
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:02<00:00,  2.03s/it]
100%|██████████| 1/1 [00:02<00:00,  2.03s/it]
Loading F0...:   0%|          | 0/1 [00:00<?, ?it/s]
Loading F0...: 100%|██████████| 1/1 [00:00<00:00, 2794.34it/s]
Singers statistics:   0%|          | 0/1 [00:00<?, ?it/s]
Singers statistics: 100%|██████████| 1/1 [00:00<00:00, 1478.95it/s]
Prepare for acoustic features: 2.0s
Loading Whisper Model...
Using GPU...
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:04<00:00,  4.12s/it]
100%|██████████| 1/1 [00:04<00:00,  4.38s/it]
Load Contentvec Model...
2023-12-28 18:15:30 | INFO | fairseq.tasks.hubert_pretraining | current directory is /src
2023-12-28 18:15:30 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-12-28 18:15:30 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:03<00:00,  3.23s/it]
100%|██████████| 1/1 [00:03<00:00,  3.63s/it]
Prepare for content features: 16.4s
2023-12-28 18:15:35 | INFO | inference | ========================================================
2023-12-28 18:15:35 | INFO | inference | ||		New inference process started.		||
2023-12-28 18:15:35 | INFO | inference | ========================================================
2023-12-28 18:15:35 | INFO | inference |
2023-12-28 18:15:35 | DEBUG | inference | Using DEBUG logging level.
2023-12-28 18:15:35 | DEBUG | inference | Acoustic dir: ckpts/svc/vocalist_l1_contentvec+whisper
2023-12-28 18:15:35 | DEBUG | inference | Vocoder dir: pretrained/bigvgan
2023-12-28 18:15:35 | DEBUG | inference | Setting random seed done in 0.38ms
2023-12-28 18:15:35 | DEBUG | inference | Random seed: 10086
2023-12-28 18:15:35 | INFO | inference | Building dataset...
----------get_metadata: path---------
ckpts/svc/vocalist_l1_contentvec+whisper/data/tmp63uozntufemale_vocal/eval.json
----------get_metadata: path---------
2023-12-28 18:15:35 | INFO | inference | Building dataset done in 5.86ms
2023-12-28 18:15:35 | INFO | inference | Building model...
2023-12-28 18:15:35 | INFO | inference | Building model done in 275.970ms
2023-12-28 18:15:35 | INFO | inference | Initializing accelerate...
2023-12-28 18:15:35 | INFO | inference | Initializing accelerate done in 23.438ms
2023-12-28 18:15:35 | INFO | inference | Loading checkpoint...
2023-12-28 18:15:35 | INFO | accelerate.accelerator | Loading states from ckpts/svc/vocalist_l1_contentvec+whisper/checkpoint/epoch-6852_step-0678447_loss-1.946773
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All model weights loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All optimizer states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All scheduler states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All dataloader sampler states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.checkpointing | All random states loaded successfully
2023-12-28 18:15:35 | INFO | accelerate.accelerator | Loading in 0 custom states
2023-12-28 18:15:35 | INFO | inference | Loading checkpoint done in 53.397ms
2023-12-28 18:15:35 | INFO | inference | Using PNDM scheduler.
Model Init: 0.4s
Auto transposing: source f0 median = 232.4, target f0 median = 333.0, factor = 1.43
  0%|          | 0/1009 [00:00<?, ?it/s]
  1%|          | 7/1009 [00:00<00:14, 69.30it/s]
  3%|▎         | 26/1009 [00:00<00:07, 137.78it/s]
  4%|▍         | 45/1009 [00:00<00:06, 159.75it/s]
  6%|▌         | 62/1009 [00:00<00:05, 163.57it/s]
  8%|▊         | 80/1009 [00:00<00:05, 168.91it/s]
 10%|▉         | 98/1009 [00:00<00:05, 170.09it/s]
 11%|█▏        | 116/1009 [00:00<00:05, 170.68it/s]
 13%|█▎        | 135/1009 [00:00<00:04, 175.17it/s]
 15%|█▌        | 154/1009 [00:00<00:04, 177.45it/s]
 17%|█▋        | 173/1009 [00:01<00:04, 179.59it/s]
 19%|█▉        | 192/1009 [00:01<00:04, 180.32it/s]
 21%|██        | 211/1009 [00:01<00:04, 181.17it/s]
 23%|██▎       | 230/1009 [00:01<00:04, 179.22it/s]
 25%|██▍       | 248/1009 [00:01<00:04, 178.30it/s]
 26%|██▋       | 266/1009 [00:01<00:04, 177.79it/s]
 28%|██▊       | 284/1009 [00:01<00:04, 174.89it/s]
 30%|██▉       | 302/1009 [00:01<00:04, 174.08it/s]
 32%|███▏      | 320/1009 [00:01<00:03, 174.00it/s]
 33%|███▎      | 338/1009 [00:01<00:03, 173.83it/s]
 35%|███▌      | 356/1009 [00:02<00:03, 173.45it/s]
 37%|███▋      | 374/1009 [00:02<00:03, 173.38it/s]
 39%|███▉      | 392/1009 [00:02<00:03, 173.42it/s]
 41%|████      | 410/1009 [00:02<00:03, 170.60it/s]
 42%|████▏     | 428/1009 [00:02<00:03, 171.37it/s]
 44%|████▍     | 446/1009 [00:02<00:03, 171.98it/s]
 46%|████▌     | 464/1009 [00:02<00:03, 174.29it/s]
 48%|████▊     | 483/1009 [00:02<00:02, 177.75it/s]
 50%|████▉     | 502/1009 [00:02<00:02, 180.22it/s]
 52%|█████▏    | 521/1009 [00:03<00:02, 181.92it/s]
 54%|█████▎    | 540/1009 [00:03<00:02, 183.35it/s]
 55%|█████▌    | 559/1009 [00:03<00:02, 184.32it/s]
 57%|█████▋    | 578/1009 [00:03<00:02, 185.09it/s]
 59%|█████▉    | 597/1009 [00:03<00:02, 178.99it/s]
 61%|██████    | 615/1009 [00:03<00:02, 177.05it/s]
 63%|██████▎   | 633/1009 [00:03<00:02, 175.94it/s]
 65%|██████▍   | 651/1009 [00:03<00:02, 175.29it/s]
 66%|██████▋   | 669/1009 [00:03<00:01, 174.40it/s]
 68%|██████▊   | 687/1009 [00:03<00:01, 173.93it/s]
 70%|██████▉   | 705/1009 [00:04<00:01, 173.83it/s]
 72%|███████▏  | 723/1009 [00:04<00:01, 173.45it/s]
 73%|███████▎  | 741/1009 [00:04<00:01, 173.42it/s]
 75%|███████▌  | 759/1009 [00:04<00:01, 170.42it/s]
 77%|███████▋  | 777/1009 [00:04<00:01, 171.94it/s]
 79%|███████▉  | 795/1009 [00:04<00:01, 172.26it/s]
 81%|████████  | 813/1009 [00:04<00:01, 172.30it/s]
 82%|████████▏ | 831/1009 [00:04<00:01, 172.29it/s]
 84%|████████▍ | 849/1009 [00:04<00:00, 172.45it/s]
 86%|████████▌ | 867/1009 [00:04<00:00, 172.61it/s]
 88%|████████▊ | 885/1009 [00:05<00:00, 172.70it/s]
 89%|████████▉ | 903/1009 [00:05<00:00, 172.78it/s]
 91%|█████████▏| 921/1009 [00:05<00:00, 172.78it/s]
 93%|█████████▎| 939/1009 [00:05<00:00, 173.02it/s]
 95%|█████████▍| 957/1009 [00:05<00:00, 173.17it/s]
 97%|█████████▋| 975/1009 [00:05<00:00, 173.95it/s]
 98%|█████████▊| 993/1009 [00:05<00:00, 174.63it/s]
100%|██████████| 1009/1009 [00:05<00:00, 173.89it/s]
Synthesis audios using bigvgan vocoder...
Loading Vocoder from Weights file: /src/pretrained/bigvgan/400000.pt
For predicted mels, #sample = 1...
Model inference: 15.6s
100%|██████████| 1/1 [00:34<00:00, 34.51s/it]
100%|██████████| 1/1 [00:34<00:00, 34.51s/it]
Version Details
Version ID
20a05a5868e4f0b908abddcd46608df229e60e7fe2a74139fdd45f7a8d232e10
Version Created
December 27, 2023
Run on Replicate →