vufinder/vggt-1b 🖼️✓❓🔢📝 → ❓

▶️ 9.2K runs 📅 Oct 2025 ⚙️ Cog 0.16.12 🔗 GitHub 📄 Paper ⚖️ License

3d-reconstruction camera-pose-estimation depth-estimation image-to-3d video-to-3d

About

Feed-forward neural network that directly infers all key 3D attributes of a scene.

Example Output

Output

Performance Metrics

3.68s Prediction Time

201.71s Total Time

All Input Parameters

{
  "inputs": [
    "https://replicate.delivery/pbxt/NttVVVnd5x4l2kwGctcHSV4MmWztBlPoPeu4FxYR7d15uUH0/001.mp4"
  ],
  "to_base64": true,
  "pcd_source": "point_head",
  "return_pcd": true,
  "sampling_rate": 24,
  "keys_to_exclude": "",
  "alpha_blend_onto": "keep",
  "weighted_pose_transform": false,
  "enable_pose_postprocessing": false
}

Input Parameters

inputs (required) Type: array: Input files, accepts JPG, JPEG, PNG, WEBP files for images and MP4, AVI, MOV files for video. The input images and sampled video frames will be padded to a single aspect ratio and resized to a maximum dimension of 518 pixels.
to_base64 Type: booleanDefault: true: Whether to return the arrays in JSON files as base64 strings with shape and dtype.
pcd_source Default: point_head: Whether point cloud is generated from the output of the point or depth head. Has no effect if return_pcd is False.
return_pcd Type: booleanDefault: true: Whether to return a point cloud file or not.
sampling_rate Type: integerDefault: 24: Sampling rate for video input as every n-th frame. Only applies to video inputs. First and last frames of the video will always be included.
keys_to_exclude Type: stringDefault:: Comma-separated list of keys to exclude from the output JSON files.
alpha_blend_onto Default: white: Blend mode for images with alpha channels. The 'mean' mode blends the image onto ImageNet mean RGB values. The 'keep' mode keeps the original pixel values.
weighted_pose_transform Type: booleanDefault: false: Whether to apply a weighted transformation to the predicted camera poses. Used when `enable_pose_postprocessing` is True. Defaults to False.
enable_pose_postprocessing Type: booleanDefault: false: Whether to postprocess the predicted camera poses. If True, the poses will be transformed by fitting unprojected depth to world points. Defaults to False.

Output Schema

Example Execution Logs

2026-01-14 09:26:30 [INFO] predict: Received 0 valid images.
2026-01-14 09:26:30 [INFO] predict: Received 1 valid videos.
2026-01-14 09:26:30 [WARNING] predict: Some of the input files were ignored!
2026-01-14 09:26:30 [WARNING] predict: Accepted image extensions: .jpg, .jpeg, .png, .webp
2026-01-14 09:26:30 [WARNING] predict: Accepted video extensions: .mp4, .avi, .mov
2026-01-14 09:26:30 [INFO] utils.load: Extracting frames from video 1 of 1...
2026-01-14 09:26:31 [INFO] utils.load: Extracted 9 frames from video 1 of 1!
2026-01-14 09:26:31 [INFO] predict: Running inference...
2026-01-14 09:26:31 [INFO] predict: Inference done!
2026-01-14 09:26:31 [INFO] predict: Postprocessing the results...
2026-01-14 09:26:31 [INFO] utils.glb: Building GLB scene...
2026-01-14 09:26:32 [INFO] utils.glb: GLB Scene built!
2026-01-14 09:26:32 [INFO] utils.output: Dumping data to 9 JSON files...
2026-01-14 09:26:32 [INFO] utils.output: Dumped 9 JSON files!
2026-01-14 09:26:32 [INFO] predict: Postprocessed the results!

Version Details

Version ID: b536cad3bc18a6eb701336a6d0318a8ea9f241208cc1f84050ae43e544960f32
Version Created: February 27, 2026

Run on Replicate →