fofr/yue 🔢📝❓ → 🖼️

▶️ 2.6K runs 📅 Feb 2025 ⚙️ Cog 0.13.6 🔗 GitHub 📄 Paper ⚖️ License
multilingual music-generation text-to-music

About

Generate music with YuE-s1-7B (English, chain of thought model)

Example Output

Output

Example output

Performance Metrics

716.23s Prediction Time
765.83s Total Time
All Input Parameters
{
  "lyrics": "[verse]\nStaring at the sunset, colors paint the sky\nThoughts of you keep swirling, can't deny\nI know I let you down, I made mistakes\nBut I'm here to mend the heart I didn't break\n\n[chorus]\nEvery road you take, I'll be one step behind\nEvery dream you chase, I'm reaching for the light\nYou can't fight this feeling now\nI won't back down\nYou know you can't deny it now\nI won't back down\n\n[verse]\nThey might say I'm foolish, chasing after you\nBut they don't feel this love the way we do\nMy heart beats only for you, can't you see?\nI won't let you slip away from me\n\n[chorus]\nEvery road you take, I'll be one step behind\nEvery dream you chase, I'm reaching for the light\nYou can't fight this feeling now\nI won't back down\nYou know you can't deny it now\nI won't back down\n\n[bridge]\nNo, I won't back down, won't turn around\nUntil you're back where you belong\nI'll cross the oceans wide, stand by your side\nTogether we are strong\n\n[outro]\nEvery road you take, I'll be one step behind\nEvery dream you chase, love's the tie that binds\nYou can't fight this feeling now\nI won't back down",
  "num_segments": 2,
  "max_new_tokens": 1500,
  "genre_description": "inspiring female uplifting pop airy vocal electronic bright vocal vocal"
}
Input Parameters
seed Type: integer
Set a seed for reproducibility. Random by default.
lyrics Type: stringDefault: [verse] Oh yeah, oh yeah, oh yeah [chorus] Oh yeah, oh yeah, oh yeah
Lyrics for music generation. Must be structured in segments with [verse], [chorus], [bridge], or [outro] tags
num_segments Type: integerDefault: 2Range: 1 - 10
Number of segments to generate
max_new_tokens Type: integerDefault: 1500Range: 500 - 3000
Maximum number of new tokens to generate
genre_description Type: stringDefault: inspiring female uplifting pop airy vocal electronic bright vocal vocal
Text containing genre tags that describe the musical style (e.g. instrumental, genre, mood, vocal timbre, vocal gender)
quantization_stage1 Default: bf16
Quantization stage 1
quantization_stage2 Default: bf16
Quantization stage 2
Output Schema

Output

Type: arrayItems Type: stringItems Format: uri

Example Execution Logs
Using seed: 1859145149
Starting music generation pipeline...
Parsed command line arguments
Created output directories: /src/output/stage1, /src/output/stage2
Set random seed to 1859145149
Using device: cuda:0
Loading tokenizer...
Loading Stage 1 model from m-a-p/YuE-s1-7B-anneal-en-cot...
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:00,  7.72it/s]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:00<00:00,  8.45it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  9.00it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  8.75it/s]
Compiling model with torch.compile()...
Loading codec tools and models...
/root/.pyenv/versions/3.12.6/lib/python3.12/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
WeightNorm.apply(module, name, dim)
Loading genre tags and lyrics...
Splitting lyrics into segments...
Found 6 lyric segments
Genre tags: inspiring female uplifting pop airy vocal electronic bright vocal vocal
Number of lyric segments: 6
Will process 2 segments
Processing segment 0/2
Processing segment 1/2
Generating tokens for segment 1...
Stage1 inference...:   0%|          | 0/3 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Processing segment 2/2
Generating tokens for segment 2...
Stage1 inference...:  67%|██████▋   | 2/3 [01:27<00:43, 43.50s/it]
Stage1 inference...: 100%|██████████| 3/3 [02:53<00:00, 61.31s/it]
Stage1 inference...: 100%|██████████| 3/3 [02:53<00:00, 57.75s/it]
Stage 1 generation complete
Processing Stage 1 outputs...
Processing segment 0 of 2
Processing segment 1 of 2
Saving Stage 1 outputs to:
/src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy
/src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy
Offloading Stage 1 model from GPU...
Starting Stage 2 inference...
Loading Stage 2 model from m-a-p/YuE-s2-1B-general...
Compiling Stage 2 model...
Starting Stage 2 inference with batch size 4
Processing file 1/2: /src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy
Output duration: 30s, Number of batches: 5
Processing in 2 segments
Processing segment 1/2
Stage 2 generation with batch size 4
Starting teacher forcing generation loop...
Processing frame 0/300
Processing frame 100/300
Processing frame 200/300
Processing segment 2/2
Stage 2 generation with batch size 1
Starting teacher forcing generation loop...
Processing frame 0/300
Processing frame 100/300
Processing frame 200/300
Fixing invalid codes...
Saving Stage 2 output to /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy
  0%|          | 0/2 [00:00<?, ?it/s]
Processing file 2/2: /src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy
Output duration: 30s, Number of batches: 5
Processing in 2 segments
Processing segment 1/2
Stage 2 generation with batch size 4
Starting teacher forcing generation loop...
Processing frame 0/300
Processing frame 100/300
Processing frame 200/300
Processing segment 2/2
Stage 2 generation with batch size 1
Starting teacher forcing generation loop...
Processing frame 0/300
Processing frame 100/300
Processing frame 200/300
Fixing invalid codes...
Saving Stage 2 output to /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy
 50%|█████     | 1/2 [03:25<03:25, 205.97s/it]
100%|██████████| 2/2 [07:00<00:00, 210.78s/it]
100%|██████████| 2/2 [07:00<00:00, 210.06s/it]
Stage 2 outputs: ['/src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy', '/src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy']
Stage 2 DONE.
Reconstructing audio tracks...
Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy
Saving reconstructed audio to /src/output/recons/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.mp3
Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy
Saving reconstructed audio to /src/output/recons/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.mp3
Mixing tracks...
Creating mix: /src/output/recons/mix/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_mixed.mp3
Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy
Compressed shape: (8, 1500)
/src/inference/xcodec_mini_infer/vocoder.py:45: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
compressed = torch.tensor(compressed).to(f"cuda:{args.cuda_idx}")
Decoded in 0.36s (82.86x RTF)
Saved: /src/output/vocoder/stems/vtrack.mp3
Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy
Compressed shape: (8, 1500)
Decoded in 0.02s (1879.81x RTF)
Saved: /src/output/vocoder/stems/itrack.mp3
Created mix: /src/output/vocoder/mix/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_mixed.mp3
Successfully created 'inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_mixed.mp3' with matched low-frequency energy.
Version Details
Version ID
f45da0cfbe372eb9116e87a1e3519aceb008fd03b0d771d21fb8627bee2b4117
Version Created
February 3, 2025
Run on Replicate →