lucataco/singing_voice_conversion 🖼️❓🔢 → 🖼️

▶️ 1.0K runs 📅 Dec 2023 ⚙️ Cog 0.8.6 🔗 GitHub 📄 Paper ⚖️ License
audio-to-audio singing-voice-conversion voice-cloning

About

Amphion Singing Voice Conversion: DiffWaveNetSVC

Example Output

Output

Example output

Performance Metrics

31.51s Prediction Time
156.53s Total Time
All Input Parameters
{
  "source_audio": "https://replicate.delivery/pbxt/K5coMzCs7mnhljhRVhdhN29I3RlHPkneVxrbPtyArzxvAVtI/adele.wav",
  "target_singer": "Taylor Swift",
  "key_shift_mode": 0,
  "pitch_shift_control": "Auto Shift",
  "diffusion_inference_steps": 1000
}
Input Parameters
source_audio (required) Type: string
Input source audio file
target_singer Default: Taylor Swift
Target singer to convert audio to
key_shift_mode Type: integerDefault: 0Range: -6 - 6
Key shift values
pitch_shift_control Default: Auto Shift
Pitch shift control
diffusion_inference_steps Type: integerDefault: 1000Range: 0 - 1000
Diffusion inference steps
Output Schema

Output

Type: stringFormat: uri

Example Execution Logs
/tmp/input_audio
vocalist_l1_TaylorSwift
autoshift
getopt: unrecognized option '--diffusion_inference_steps'
Exprimental Configuration File: ckpts/svc/vocalist_l1_contentvec+whisper/args.json
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Monotonic align not found. Please make sure you have compiled it.
There are 1 source audios:
**********
Conversion for source...
Prepare for meta eval data: 0.0s
  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:01<00:00,  1.98s/it]
100%|██████████| 1/1 [00:01<00:00,  1.98s/it]
Prepare for acoustic features: 2.0s
Prepare for content features: 0.0s
2023-12-21 22:37:31 | INFO | inference | ========================================================
2023-12-21 22:37:31 | INFO | inference | ||		New inference process started.		||
2023-12-21 22:37:31 | INFO | inference | ========================================================
2023-12-21 22:37:31 | INFO | inference |
2023-12-21 22:37:31 | DEBUG | inference | Using DEBUG logging level.
2023-12-21 22:37:31 | DEBUG | inference | Acoustic dir: ckpts/svc/vocalist_l1_contentvec+whisper
2023-12-21 22:37:31 | DEBUG | inference | Vocoder dir: pretrained/bigvgan
2023-12-21 22:37:31 | DEBUG | inference | Setting random seed done in 0.83ms
2023-12-21 22:37:31 | DEBUG | inference | Random seed: 10086
2023-12-21 22:37:31 | INFO | inference | Building dataset...
2023-12-21 22:37:31 | INFO | inference | Building dataset done in 4.60ms
2023-12-21 22:37:31 | INFO | inference | Building model...
2023-12-21 22:37:31 | INFO | inference | Building model done in 276.183ms
2023-12-21 22:37:31 | INFO | inference | Initializing accelerate...
2023-12-21 22:37:32 | INFO | inference | Initializing accelerate done in 1057.268ms
2023-12-21 22:37:32 | INFO | inference | Loading checkpoint...
2023-12-21 22:37:32 | INFO | accelerate.accelerator | Loading states from ckpts/svc/vocalist_l1_contentvec+whisper/checkpoint/epoch-6852_step-0678447_loss-1.946773
2023-12-21 22:37:32 | INFO | accelerate.checkpointing | All model weights loaded successfully
2023-12-21 22:37:32 | INFO | accelerate.checkpointing | All optimizer states loaded successfully
2023-12-21 22:37:32 | INFO | accelerate.checkpointing | All scheduler states loaded successfully
2023-12-21 22:37:32 | INFO | accelerate.checkpointing | All dataloader sampler states loaded successfully
2023-12-21 22:37:32 | INFO | accelerate.checkpointing | All random states loaded successfully
2023-12-21 22:37:32 | INFO | accelerate.accelerator | Loading in 0 custom states
2023-12-21 22:37:32 | INFO | inference | Loading checkpoint done in 106.015ms
2023-12-21 22:37:32 | INFO | inference | Using PNDM scheduler.
Model Init: 1.5s
Auto transposing: source f0 median = 372.9, target f0 median = 286.9, factor = 0.77
  0%|          | 0/1009 [00:00<?, ?it/s]
  0%|          | 1/1009 [00:02<39:02,  2.32s/it]
  2%|▏         | 20/1009 [00:02<01:26, 11.39it/s]
  4%|▍         | 39/1009 [00:02<00:38, 25.02it/s]
  6%|▌         | 58/1009 [00:02<00:23, 41.23it/s]
  8%|▊         | 77/1009 [00:02<00:15, 59.36it/s]
 10%|▉         | 96/1009 [00:02<00:11, 78.69it/s]
 11%|█▏        | 115/1009 [00:02<00:09, 98.15it/s]
 13%|█▎        | 134/1009 [00:03<00:07, 116.43it/s]
 15%|█▌        | 153/1009 [00:03<00:06, 132.68it/s]
 17%|█▋        | 172/1009 [00:03<00:05, 146.29it/s]
 19%|█▉        | 191/1009 [00:03<00:05, 157.09it/s]
 21%|██        | 210/1009 [00:03<00:04, 164.64it/s]
 23%|██▎       | 229/1009 [00:03<00:04, 170.24it/s]
 25%|██▍       | 248/1009 [00:03<00:04, 174.54it/s]
 26%|██▋       | 267/1009 [00:03<00:04, 176.83it/s]
 28%|██▊       | 286/1009 [00:03<00:04, 178.76it/s]
 30%|███       | 305/1009 [00:03<00:03, 180.21it/s]
 32%|███▏      | 324/1009 [00:04<00:03, 179.83it/s]
 34%|███▍      | 343/1009 [00:04<00:03, 179.98it/s]
 36%|███▌      | 362/1009 [00:04<00:03, 181.63it/s]
 38%|███▊      | 381/1009 [00:04<00:03, 181.04it/s]
 40%|███▉      | 400/1009 [00:04<00:03, 182.13it/s]
 42%|████▏     | 419/1009 [00:04<00:03, 182.37it/s]
 43%|████▎     | 438/1009 [00:04<00:03, 183.85it/s]
 45%|████▌     | 457/1009 [00:04<00:02, 184.98it/s]
 47%|████▋     | 476/1009 [00:04<00:02, 185.91it/s]
 49%|████▉     | 495/1009 [00:04<00:02, 186.25it/s]
 51%|█████     | 514/1009 [00:05<00:02, 187.04it/s]
 53%|█████▎    | 533/1009 [00:05<00:02, 187.69it/s]
 55%|█████▍    | 552/1009 [00:05<00:02, 188.32it/s]
 57%|█████▋    | 571/1009 [00:05<00:02, 188.13it/s]
 58%|█████▊    | 590/1009 [00:05<00:02, 188.37it/s]
 60%|██████    | 609/1009 [00:05<00:02, 188.66it/s]
 62%|██████▏   | 628/1009 [00:05<00:02, 188.88it/s]
 64%|██████▍   | 647/1009 [00:05<00:01, 188.97it/s]
 66%|██████▌   | 666/1009 [00:05<00:01, 188.77it/s]
 68%|██████▊   | 685/1009 [00:05<00:01, 188.38it/s]
 70%|██████▉   | 704/1009 [00:06<00:01, 188.63it/s]
 72%|███████▏  | 723/1009 [00:06<00:01, 188.84it/s]
 74%|███████▎  | 742/1009 [00:06<00:01, 189.15it/s]
 75%|███████▌  | 761/1009 [00:06<00:01, 188.98it/s]
 77%|███████▋  | 780/1009 [00:06<00:01, 189.16it/s]
 79%|███████▉  | 799/1009 [00:06<00:01, 186.62it/s]
 81%|████████  | 819/1009 [00:06<00:01, 188.31it/s]
 83%|████████▎ | 838/1009 [00:06<00:00, 185.22it/s]
 85%|████████▍ | 857/1009 [00:06<00:00, 186.46it/s]
 87%|████████▋ | 877/1009 [00:07<00:00, 188.15it/s]
 89%|████████▉ | 897/1009 [00:07<00:00, 188.85it/s]
 91%|█████████ | 917/1009 [00:07<00:00, 189.80it/s]
 93%|█████████▎| 937/1009 [00:07<00:00, 190.56it/s]
 95%|█████████▍| 957/1009 [00:07<00:00, 190.87it/s]
 97%|█████████▋| 977/1009 [00:07<00:00, 191.29it/s]
 99%|█████████▉| 997/1009 [00:07<00:00, 190.46it/s]
100%|██████████| 1009/1009 [00:07<00:00, 130.99it/s]
Synthesis audios using bigvgan vocoder...
Loading Vocoder from Weights file: /src/Amphion/pretrained/bigvgan/400000.pt
For predicted mels, #sample = 1...
Model inference: 14.1s
100%|██████████| 1/1 [00:17<00:00, 17.56s/it]
100%|██████████| 1/1 [00:17<00:00, 17.56s/it]
/src/Amphion/result/source/source_vocalist_l1_TaylorSwift.wav
Version Details
Version ID
f29872ee3557e0186735048f1d6de98a52518ae5c49e19453b3fdaad710bdc2b
Version Created
December 21, 2023
Run on Replicate →