camenduru/emage 🖼️ → 🖼️

▶️ 110 runs 📅 Apr 2024 ⚙️ Cog 0.9.4 🔗 GitHub 📄 Paper ⚖️ License
animation audio-to-video co-speech-gestures gesture-generation
About

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
Example Output

Output

Performance Metrics

21.50s Prediction Time
21.53s Total Time
Input Parameters
audio_path (required) Type: string: Input Audio
Output Schema
Output
Type: array • Items Type: string • Items Format: uri
Example Execution Logs
2024-04-09 08:01:51.795 | INFO     | utils.other_tools_hf:print_exp_info:877 - {'a_encoder': None,
'a_fix_pre': False,
'a_pre_encoder': None,
'acc_weight': 0.0,
'additional_data': False,
'adv_weight': 20.0,
'ali_weight': 0.0,
'amsgrad': False,
'apex': False,
'asmr': 0.0,
'atcont': 0.0,
'atmr': 0.0,
'aud_prob': 1.0,
'audio_dims': 1,
'audio_f': 256,
'audio_fps': 16000,
'audio_norm': False,
'audio_rep': 'wave16k',
'audio_sr': 16000,
'batch_size': 64,
'beat_align': True,
'benchmark': True,
'cache_only': False,
'cache_path': './datasets/beat_cache/beat_smplx_en_emage_test/',
'cf': 0.0,
'ch': 1.0,
'cl': 1.0,
'clean_final_seconds': 0,
'clean_first_seconds': 0,
'config': './configs/emage_test_hf.yaml',
'csv_name': 'a2g_0',
'cu': 1.0,
'cudnn_enabled': True,
'd_lr_weight': 0.2,
'd_name': None,
'data_path': './EMAGE/test_sequences/',
'data_path_1': './EMAGE/',
'dataset': 'beat_testonly_hf',
'ddp': False,
'debug': False,
'decay_epochs': 9999,
'decay_rate': 0.1,
'decode_fusion': None,
'deterministic': True,
'disable_filtering': False,
'div_reg_weight': 0.0,
'dropout_prob': 0.3,
'e_name': 'VAESKConv',
'e_path': 'weights/AESKConv_240_100.bin',
'emo_rep': None,
'emotion_dims': 8,
'emotion_f': 0,
'epoch_stage': 0,
'epochs': 400,
'eval_model': 'motion_representation',
'f_encoder': 'null',
'f_fix_pre': False,
'f_pre_encoder': 'null',
'fac_prob': 1.0,
'facial_dims': 100,
'facial_f': 0,
'facial_fps': 15,
'facial_norm': False,
'facial_rep': 'smplxflame_30',
'fid_weight': 0.0,
'finger_net': 'original',
'freeze_wordembed': True,
'fsmr': 0.0,
'ftmr': 0.0,
'fusion_mode': 'sum',
'g_name': 'MAGE_Transformer',
'gap_weight': 0.0,
'gpus': [0],
'grad_norm': 0.99,
'hidden_size': 768,
'id_rep': 'onehot',
'input_context': 'both',
'is_train': True,
'ita_weight': 0.0,
'iwa_weight': 0.0,
'kld_aud_weight': 0.0,
'kld_fac_weight': 0.0,
'kld_weight': 0.0,
'l': 4,
'lf': 3.0,
'lh': 3.0,
'll': 3.0,
'loader_workers': 0,
'log_period': 10,
'loss_contrastive_neg_weight': 0.005,
'loss_contrastive_pos_weight': 0.2,
'loss_gan_weight': 5.0,
'loss_kld_weight': 0.1,
'loss_physical_weight': 0.0,
'loss_reg_weight': 0.05,
'loss_regression_weight': 70.0,
'lr_base': 0.0005,
'lr_min': 1e-07,
'lr_policy': 'step',
'lu': 3.0,
'm_decoder': None,
'm_encoder': 'null',
'm_fix_pre': False,
'm_pre_encoder': 'null',
'mean_pose_path': '/datasets/trinity/train/',
'model': 'emage_audio',
'momentum': 0.8,
'motion_f': 256,
'msmr': 0.0,
'mtmr': 0.0,
'multi_length_training': [1.0],
'n_layer': 1,
'n_poses': 34,
'n_pre_poses': 4,
'name': '0409_080151_emage_test_hf',
'nesterov': True,
'new_cache': True,
'no_adv_epoch': 999,
'notes': '',
'opt': 'adam',
'opt_betas': [0.5, 0.999],
'ori_joints': 'beat_smplx_joints',
'out_path': './outputs/audio2pose/',
'pos_encoding_type': 'sin',
'pos_prob': 1.0,
'pose_dims': 330,
'pose_fps': 30,
'pose_length': 64,
'pose_norm': False,
'pose_rep': 'smplxflame_30',
'pre_frames': 4,
'pre_type': 'zero',
'pretrain': False,
'project': 's2g',
'queue_size': 1024,
'random_seed': 2021,
'rec_aud_weight': 0.0,
'rec_fac_weight': 0.0,
'rec_pos_weight': 0.0,
'rec_txt_weight': 0.0,
'rec_ver_weight': 0.0,
'rec_weight': 1.0,
'render_concurrent_num': 1,
'render_tmp_img_filetype': 'bmp',
'render_video_fps': 30,
'render_video_height': 720,
'render_video_width': 1920,
'root_path': './',
'rot6d': True,
'sem_rep': None,
'sparse': 1,
'speaker_dims': 4,
'speaker_f': 0,
'speaker_id': 'onehot',
'stat': 'ts',
'std_pose_path': '/datasets/trinity/train/',
'stride': 20,
't_encoder': None,
't_fix_pre': False,
't_pre_encoder': None,
'tar_joints': 'beat_smplx_full',
'test_ckpt': './EMAGE/emage_audio_175.bin',
'test_data_path': '/datasets/trinity/test/',
'test_length': 64,
'test_period': 20,
'train_data_path': '/datasets/trinity/train/',
'train_trans': True,
'trainer': 'emage',
'training_speakers': [2],
'tsmr': 0.0,
'ttmr': 0.0,
'txt_prob': 1.0,
'use_aug': False,
'vae_codebook_size': 256,
'vae_grow': [1, 1, 2, 1],
'vae_layer': 4,
'vae_length': 240,
'vae_quantizer_lambda': 1.0,
'vae_test_dim': 330,
'vae_test_len': 32,
'vae_test_stride': 20,
'val_data_path': '/datasets/trinity/val/',
'variational': False,
'vel_weight': 0.0,
'warmup_epochs': 0,
'warmup_lr': 0.0005,
'wei_weight': 0.0,
'weight_decay': 0.0,
'word_cache': False,
'word_dims': 300,
'word_f': 0,
'word_index_num': 5793,
'word_rep': None,
'z_type': 'speaker'}
2024-04-09 08:01:51.795 | INFO     | utils.other_tools_hf:print_exp_info:878 - # ------------ 0409_080151_emage_test_hf ----------- #
2024-04-09 08:01:51.796 | INFO     | utils.other_tools_hf:print_exp_info:879 - PyTorch version: 2.2.0+cu121
2024-04-09 08:01:51.796 | INFO     | utils.other_tools_hf:print_exp_info:880 - CUDA version: 12.1
2024-04-09 08:01:51.796 | INFO     | utils.other_tools_hf:print_exp_info:881 - 1 GPUs
2024-04-09 08:01:51.796 | INFO     | utils.other_tools_hf:print_exp_info:882 - Random Seed: 2021
/tmp/tmpvn62w1e0ash.wav
2024-04-09 08:01:51.958 | INFO     | dataloaders.beat_testonly_hf:build_cache:90 - Audio bit rate: 16000
2024-04-09 08:01:51.958 | INFO     | dataloaders.beat_testonly_hf:build_cache:91 - Reading data './EMAGE/test_sequences/'...
2024-04-09 08:01:51.958 | INFO     | dataloaders.beat_testonly_hf:build_cache:92 - Creating the dataset cache...
2024-04-09 08:01:51.960 | INFO     | dataloaders.beat_testonly_hf:cache_generation:140 - # ---- Building cache for Pose   dummy 2nd ---- #
2024-04-09 08:01:54.684 | INFO     | dataloaders.beat_testonly_hf:cache_generation:214 - # ---- Building cache for Facial dummy 2nd and Pose dummy 2nd ---- #
2024-04-09 08:01:54.686 | INFO     | dataloaders.beat_testonly_hf:cache_generation:260 - # ---- Building cache for Audio  dummy 2nd and Pose dummy 2nd ---- #
22050
(169201,)
(122777,)
1906
2024-04-09 08:01:54.689 | INFO     | dataloaders.beat_testonly_hf:_sample_from_clip:517 - audio: 7s, pose: 63s, facial: 63s
2024-04-09 08:01:54.689 | WARNING  | dataloaders.beat_testonly_hf:_sample_from_clip:521 - reduce to 7s, ignore 56s
2024-04-09 08:01:54.689 | INFO     | dataloaders.beat_testonly_hf:_sample_from_clip:544 - pose from frame 0 to 210, length 210
2024-04-09 08:01:54.689 | INFO     | dataloaders.beat_testonly_hf:_sample_from_clip:545 - 1 clips is expected with stride 210
2024-04-09 08:01:54.689 | INFO     | dataloaders.beat_testonly_hf:_sample_from_clip:549 - audio from frame 0 to 112000, length 112000
2024-04-09 08:01:54.694 | INFO     | dataloaders.beat_testonly_hf:cache_generation:478 - no. of samples: 1
2024-04-09 08:01:54.694 | INFO     | dataloaders.beat_testonly_hf:cache_generation:483 - no. of excluded samples: 0 (0.0%)
2024-04-09 08:01:54.699 | INFO     | predict:__init__:610 - Init test dataloader success
2024-04-09 08:01:55.163 | INFO     | predict:__init__:623 - DataParallel(
(module): MAGE_Transformer(
(audio_pre_encoder_face): WavEncoder(
(feat_extractor): Sequential(
(0): BasicBlock(
(conv1): Conv1d(1, 64, kernel_size=(15,), stride=(5,), padding=(1600,))
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(1, 64, kernel_size=(15,), stride=(5,), padding=(1600,))
(1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv1d(64, 64, kernel_size=(15,), stride=(6,))
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(64, 64, kernel_size=(15,), stride=(6,))
(1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): BasicBlock(
(conv1): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
)
(3): BasicBlock(
(conv1): Conv1d(64, 128, kernel_size=(15,), stride=(6,))
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(64, 128, kernel_size=(15,), stride=(6,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(4): BasicBlock(
(conv1): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(7,))
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
)
(5): BasicBlock(
(conv1): Conv1d(128, 256, kernel_size=(15,), stride=(3,))
(bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(256, 256, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(128, 256, kernel_size=(15,), stride=(3,))
(1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
)
(audio_pre_encoder_body): WavEncoder(
(feat_extractor): Sequential(
(0): BasicBlock(
(conv1): Conv1d(1, 64, kernel_size=(15,), stride=(5,), padding=(1600,))
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(1, 64, kernel_size=(15,), stride=(5,), padding=(1600,))
(1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv1d(64, 64, kernel_size=(15,), stride=(6,))
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(64, 64, kernel_size=(15,), stride=(6,))
(1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(2): BasicBlock(
(conv1): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn1): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(64, 64, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
)
(3): BasicBlock(
(conv1): Conv1d(64, 128, kernel_size=(15,), stride=(6,))
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(64, 128, kernel_size=(15,), stride=(6,))
(1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(4): BasicBlock(
(conv1): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(7,))
(bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
)
(5): BasicBlock(
(conv1): Conv1d(128, 256, kernel_size=(15,), stride=(3,))
(bn1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act1): LeakyReLU(negative_slope=0.01, inplace=True)
(conv2): Conv1d(256, 256, kernel_size=(15,), stride=(1,), padding=(7,))
(bn2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act2): LeakyReLU(negative_slope=0.01, inplace=True)
(downsample): Sequential(
(0): Conv1d(128, 256, kernel_size=(15,), stride=(3,))
(1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
)
(motion_encoder): VQEncoderV6(
(main): Sequential(
(0): Conv1d(337, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): ResBlock(
(model): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
)
)
(3): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(4): LeakyReLU(negative_slope=0.2, inplace=True)
(5): ResBlock(
(model): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
)
)
(6): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(7): LeakyReLU(negative_slope=0.2, inplace=True)
(8): ResBlock(
(model): Sequential(
(0): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
)
)
)
)
(feature2face): Linear(in_features=512, out_features=768, bias=True)
(face2latent): Linear(in_features=768, out_features=256, bias=True)
(transformer_de_layer): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
(face_decoder): TransformerDecoder(
(layers): ModuleList(
(0-3): 4 x TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
)
(position_embeddings): PeriodicPositionalEncoding(
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer_en_layer): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
(motion_self_encoder): TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
)
)
(audio_feature2motion): Linear(in_features=256, out_features=768, bias=True)
(feature2motion): Linear(in_features=256, out_features=768, bias=True)
(bodyhints_face): MLP(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=256, bias=True)
)
)
(bodyhints_body): MLP(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=256, bias=True)
)
)
(motion2latent_upper): MLP(
(mlp): Sequential(
(0): Linear(in_features=768, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=768, bias=True)
)
)
(motion2latent_hands): MLP(
(mlp): Sequential(
(0): Linear(in_features=768, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=768, bias=True)
)
)
(motion2latent_lower): MLP(
(mlp): Sequential(
(0): Linear(in_features=768, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=768, bias=True)
)
)
(wordhints_decoder): TransformerDecoder(
(layers): ModuleList(
(0-7): 8 x TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
)
(upper_decoder): TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
)
(hands_decoder): TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
)
(lower_decoder): TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
)
(linear1): Linear(in_features=768, out_features=1536, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1536, out_features=768, bias=True)
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
)
)
)
(face_classifier): MLP(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=256, bias=True)
)
)
(upper_classifier): MLP(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=256, bias=True)
)
)
(hands_classifier): MLP(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=256, bias=True)
)
)
(lower_classifier): MLP(
(mlp): Sequential(
(0): Linear(in_features=256, out_features=768, bias=True)
(1): LeakyReLU(negative_slope=0.2, inplace=True)
(2): Linear(in_features=768, out_features=256, bias=True)
)
)
(motion_down_upper): Linear(in_features=768, out_features=256, bias=True)
(motion_down_hands): Linear(in_features=768, out_features=256, bias=True)
(motion_down_lower): Linear(in_features=768, out_features=256, bias=True)
(spearker_encoder_body): Embedding(25, 768)
(spearker_encoder_face): Embedding(25, 768)
)
)
2024-04-09 08:01:55.165 | INFO     | predict:__init__:624 - init MAGE_Transformer success
2024-04-09 08:01:55.454 | INFO     | utils.other_tools_hf:load_checkpoints:1042 - load self-pretrained checkpoints for VAESKConv
2024-04-09 08:01:55.461 | INFO     | utils.other_tools_hf:load_checkpoints:1042 - load self-pretrained checkpoints for VAESKConv
2024-04-09 08:01:55.468 | INFO     | utils.other_tools_hf:load_checkpoints:1042 - load self-pretrained checkpoints for VAESKConv
2024-04-09 08:01:55.484 | INFO     | utils.other_tools_hf:load_checkpoints:1042 - load self-pretrained checkpoints for VAESKConv
2024-04-09 08:01:55.495 | INFO     | utils.other_tools_hf:load_checkpoints:1042 - load self-pretrained checkpoints for VAESKConv
2024-04-09 08:01:55.826 | INFO     | utils.other_tools_hf:load_checkpoints:1042 - load self-pretrained checkpoints for MAGE_Transformer
generate_silent_videos concurrentNum=1 time=1712649716.3468359
subprocess_index=0 begin_ts=1712649717.0764742
processed 0 frames
processed 100 frames
subprocess_index=0 render=10.95 all=11.61 begin_ts=1712649717.08 render_end_ts=1712649728.02 write_end_ts=1712649728.69
ffmpeg version 5.1.4-0+deb12u1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12 (Debian 12.2.0-14)
configuration: --prefix=/usr --extra-version=0+deb12u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-libjxl --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-librav1e --enable-shared
libavutil      57. 28.100 / 57. 28.100
libavcodec     59. 37.100 / 59. 37.100
libavformat    59. 27.100 / 59. 27.100
libavdevice    59.  7.100 / 59.  7.100
libavfilter     8. 44.100 /  8. 44.100
libswscale      6.  7.100 /  6.  7.100
libswresample   4.  7.100 /  4.  7.100
libpostproc    56.  6.100 / 56.  6.100
Input #0, image2, from './outputs/audio2pose/custom/hf//999/frame_%d.bmp':
Duration: 00:00:06.00, start: 0.000000, bitrate: N/A
Stream #0:0: Video: bmp, bgr24, 1920x720, 30 fps, 30 tbr, 30 tbn
Stream mapping:
Stream #0:0 -> #0:0 (bmp (native) -> h264 (libx264))
Press [q] to stop, [?] for help
[libx264 @ 0x55eaff5e2540] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
[libx264 @ 0x55eaff5e2540] profile High, level 4.0, 4:2:0, 8-bit
[libx264 @ 0x55eaff5e2540] 264 - core 164 r3095 baee400 - H.264/MPEG-4 AVC codec - Copyleft 2003-2022 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=6 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to './outputs/audio2pose/custom/hf//999/silence_video.mp4':
Metadata:
encoder         : Lavf59.27.100
Stream #0:0: Video: h264 (avc1 / 0x31637661), yuv420p(tv, progressive), 1920x720, q=2-31, 30 fps, 15360 tbn
Metadata:
encoder         : Lavc59.37.100 libx264
Side data:
cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
frame=    1 fps=0.0 q=0.0 size=       0kB time=00:00:00.00 bitrate=N/A speed=N/A
frame=   78 fps=0.0 q=29.0 size=       0kB time=00:00:00.83 bitrate=   0.5kbits/s speed=1.66x
frame=  142 fps=140 q=29.0 size=     256kB time=00:00:02.96 bitrate= 707.0kbits/s speed=2.93x
frame=  180 fps=115 q=-1.0 Lsize=     544kB time=00:00:05.90 bitrate= 755.9kbits/s speed=3.78x
video:541kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.543236%
[libx264 @ 0x55eaff5e2540] frame I:1     Avg QP:23.34  size: 19447
[libx264 @ 0x55eaff5e2540] frame P:45    Avg QP:24.95  size:  6561
[libx264 @ 0x55eaff5e2540] frame B:134   Avg QP:29.79  size:  1784
[libx264 @ 0x55eaff5e2540] consecutive B-frames:  0.6%  0.0%  1.7% 97.8%
[libx264 @ 0x55eaff5e2540] mb I  I16..4: 10.2% 81.3%  8.5%
[libx264 @ 0x55eaff5e2540] mb P  I16..4:  0.4%  1.3%  0.5%  P16..4: 10.8%  3.7%  1.7%  0.0%  0.0%    skip:81.5%
[libx264 @ 0x55eaff5e2540] mb B  I16..4:  0.0%  0.1%  0.0%  B16..8:  7.2%  1.6%  0.3%  direct: 0.2%  skip:90.4%  L0:44.8% L1:45.8% BI: 9.3%
[libx264 @ 0x55eaff5e2540] 8x8 transform intra:69.7% inter:44.7%
[libx264 @ 0x55eaff5e2540] coded y,uvDC,uvAC intra: 25.7% 0.0% 0.0% inter: 1.6% 0.0% 0.0%
[libx264 @ 0x55eaff5e2540] i16 v,h,dc,p: 25% 23% 11% 41%
[libx264 @ 0x55eaff5e2540] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 45%  8% 34%  2%  2%  2%  3%  3%  2%
[libx264 @ 0x55eaff5e2540] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 35% 15% 17%  5%  5%  6%  5%  8%  4%
[libx264 @ 0x55eaff5e2540] i8c dc,h,v,p: 100%  0%  0%  0%
[libx264 @ 0x55eaff5e2540] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0x55eaff5e2540] ref P L0: 56.0%  8.9% 22.0% 13.1%
[libx264 @ 0x55eaff5e2540] ref B L0: 82.5% 13.2%  4.3%
[libx264 @ 0x55eaff5e2540] ref B L1: 95.0%  5.0%
[libx264 @ 0x55eaff5e2540] kb/s:738.35
Video conversion successful. Output file: ./outputs/audio2pose/custom/hf//999/silence_video.mp4
ffmpeg version 5.1.4-0+deb12u1 Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 12 (Debian 12.2.0-14)
configuration: --prefix=/usr --extra-version=0+deb12u1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librist --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --disable-sndio --enable-libjxl --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-libplacebo --enable-librav1e --enable-shared
libavutil      57. 28.100 / 57. 28.100
libavcodec     59. 37.100 / 59. 37.100
libavformat    59. 27.100 / 59. 27.100
libavdevice    59.  7.100 / 59.  7.100
libavfilter     8. 44.100 /  8. 44.100
libswscale      6.  7.100 /  6.  7.100
libswresample   4.  7.100 /  4.  7.100
libpostproc    56.  6.100 / 56.  6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from './outputs/audio2pose/custom/hf//999/silence_video.mp4':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
encoder         : Lavf59.27.100
Duration: 00:00:06.00, start: 0.000000, bitrate: 743 kb/s
Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1920x720, 739 kb/s, 30 fps, 30 tbr, 15360 tbn (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
encoder         : Lavc59.37.100 libx264
Guessed Channel Layout for Input Stream #1.0 : stereo
Input #1, wav, from '/tmp/tmpvn62w1e0ash.wav':
Metadata:
encoder         : Lavf58.76.100
Duration: 00:00:07.67, bitrate: 1536 kb/s
Stream #1:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, stereo, s16, 1536 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Stream #1:0 -> #0:1 (pcm_s16le (native) -> aac (native))
Press [q] to stop, [?] for help
Output #0, mp4, to './outputs/audio2pose/custom/hf//999/res_2_scott_0_3_3.mp4':
Metadata:
major_brand     : isom
minor_version   : 512
compatible_brands: isomiso2avc1mp41
encoder         : Lavf59.27.100
Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1920x720, q=2-31, 739 kb/s, 30 fps, 30 tbr, 15360 tbn (default)
Metadata:
handler_name    : VideoHandler
vendor_id       : [0][0][0][0]
encoder         : Lavc59.37.100 libx264
Stream #0:1: Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 128 kb/s
Metadata:
encoder         : Lavc59.37.100 aac
frame=    0 fps=0.0 q=-1.0 size=       0kB time=00:00:00.00 bitrate=N/A speed=   0x
frame=  180 fps=0.0 q=-1.0 Lsize=     643kB time=00:00:06.01 bitrate= 876.2kbits/s speed=36.6x
video:541kB audio:94kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.230881%
[aac @ 0x564cefe44180] Qavg: 1877.850
Video with audio generated successfully: ./outputs/audio2pose/custom/hf//999/res_2_scott_0_3_3.mp4
Version Details
Version ID: 80048634fe2d4edf63687229c69d7b868204a569e91df880ba418eb8a7485e73
Version Created: April 9, 2024
Run on Replicate →