fofr/yue 🔢📝❓ → 🖼️
About
Generate music with YuE-s1-7B (English, chain of thought model)

Example Output
Output
Performance Metrics
716.23s
Prediction Time
765.83s
Total Time
All Input Parameters
{ "lyrics": "[verse]\nStaring at the sunset, colors paint the sky\nThoughts of you keep swirling, can't deny\nI know I let you down, I made mistakes\nBut I'm here to mend the heart I didn't break\n\n[chorus]\nEvery road you take, I'll be one step behind\nEvery dream you chase, I'm reaching for the light\nYou can't fight this feeling now\nI won't back down\nYou know you can't deny it now\nI won't back down\n\n[verse]\nThey might say I'm foolish, chasing after you\nBut they don't feel this love the way we do\nMy heart beats only for you, can't you see?\nI won't let you slip away from me\n\n[chorus]\nEvery road you take, I'll be one step behind\nEvery dream you chase, I'm reaching for the light\nYou can't fight this feeling now\nI won't back down\nYou know you can't deny it now\nI won't back down\n\n[bridge]\nNo, I won't back down, won't turn around\nUntil you're back where you belong\nI'll cross the oceans wide, stand by your side\nTogether we are strong\n\n[outro]\nEvery road you take, I'll be one step behind\nEvery dream you chase, love's the tie that binds\nYou can't fight this feeling now\nI won't back down", "num_segments": 2, "max_new_tokens": 1500, "genre_description": "inspiring female uplifting pop airy vocal electronic bright vocal vocal" }
Input Parameters
- seed
- Set a seed for reproducibility. Random by default.
- lyrics
- Lyrics for music generation. Must be structured in segments with [verse], [chorus], [bridge], or [outro] tags
- num_segments
- Number of segments to generate
- max_new_tokens
- Maximum number of new tokens to generate
- genre_description
- Text containing genre tags that describe the musical style (e.g. instrumental, genre, mood, vocal timbre, vocal gender)
- quantization_stage1
- Quantization stage 1
- quantization_stage2
- Quantization stage 2
Output Schema
Output
Example Execution Logs
Using seed: 1859145149 Starting music generation pipeline... Parsed command line arguments Created output directories: /src/output/stage1, /src/output/stage2 Set random seed to 1859145149 Using device: cuda:0 Loading tokenizer... Loading Stage 1 model from m-a-p/YuE-s1-7B-anneal-en-cot... You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 7.72it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 8.45it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 9.00it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.75it/s] Compiling model with torch.compile()... Loading codec tools and models... /root/.pyenv/versions/3.12.6/lib/python3.12/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. WeightNorm.apply(module, name, dim) Loading genre tags and lyrics... Splitting lyrics into segments... Found 6 lyric segments Genre tags: inspiring female uplifting pop airy vocal electronic bright vocal vocal Number of lyric segments: 6 Will process 2 segments Processing segment 0/2 Processing segment 1/2 Generating tokens for segment 1... Stage1 inference...: 0%| | 0/3 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Processing segment 2/2 Generating tokens for segment 2... Stage1 inference...: 67%|██████▋ | 2/3 [01:27<00:43, 43.50s/it] Stage1 inference...: 100%|██████████| 3/3 [02:53<00:00, 61.31s/it] Stage1 inference...: 100%|██████████| 3/3 [02:53<00:00, 57.75s/it] Stage 1 generation complete Processing Stage 1 outputs... Processing segment 0 of 2 Processing segment 1 of 2 Saving Stage 1 outputs to: /src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy /src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy Offloading Stage 1 model from GPU... Starting Stage 2 inference... Loading Stage 2 model from m-a-p/YuE-s2-1B-general... Compiling Stage 2 model... Starting Stage 2 inference with batch size 4 Processing file 1/2: /src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy Output duration: 30s, Number of batches: 5 Processing in 2 segments Processing segment 1/2 Stage 2 generation with batch size 4 Starting teacher forcing generation loop... Processing frame 0/300 Processing frame 100/300 Processing frame 200/300 Processing segment 2/2 Stage 2 generation with batch size 1 Starting teacher forcing generation loop... Processing frame 0/300 Processing frame 100/300 Processing frame 200/300 Fixing invalid codes... Saving Stage 2 output to /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy 0%| | 0/2 [00:00<?, ?it/s] Processing file 2/2: /src/output/stage1/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy Output duration: 30s, Number of batches: 5 Processing in 2 segments Processing segment 1/2 Stage 2 generation with batch size 4 Starting teacher forcing generation loop... Processing frame 0/300 Processing frame 100/300 Processing frame 200/300 Processing segment 2/2 Stage 2 generation with batch size 1 Starting teacher forcing generation loop... Processing frame 0/300 Processing frame 100/300 Processing frame 200/300 Fixing invalid codes... Saving Stage 2 output to /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy 50%|█████ | 1/2 [03:25<03:25, 205.97s/it] 100%|██████████| 2/2 [07:00<00:00, 210.78s/it] 100%|██████████| 2/2 [07:00<00:00, 210.06s/it] Stage 2 outputs: ['/src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy', '/src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy'] Stage 2 DONE. Reconstructing audio tracks... Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy Saving reconstructed audio to /src/output/recons/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.mp3 Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy Saving reconstructed audio to /src/output/recons/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.mp3 Mixing tracks... Creating mix: /src/output/recons/mix/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_mixed.mp3 Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_vtrack.npy Compressed shape: (8, 1500) /src/inference/xcodec_mini_infer/vocoder.py:45: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). compressed = torch.tensor(compressed).to(f"cuda:{args.cuda_idx}") Decoded in 0.36s (82.86x RTF) Saved: /src/output/vocoder/stems/vtrack.mp3 Processing /src/output/stage2/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_itrack.npy Compressed shape: (8, 1500) Decoded in 0.02s (1879.81x RTF) Saved: /src/output/vocoder/stems/itrack.mp3 Created mix: /src/output/vocoder/mix/inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_mixed.mp3 Successfully created 'inspiring-female-uplifting-pop-airy-vocal-electronic-bright-vocal-vocal_tp0@93_T1@0_rp1@0_maxtk1500_1d9d7bd6-7754-4397-ad7f-32f66868171d_mixed.mp3' with matched low-frequency energy.
Version Details
- Version ID
f45da0cfbe372eb9116e87a1e3519aceb008fd03b0d771d21fb8627bee2b4117
- Version Created
- February 3, 2025