geopti/sam-audio-large 🖼️📝✓ → 🖼️

▶️ 63.0K runs 📅 Feb 2026 ⚙️ Cog 0.16.11 🔗 GitHub 📄 Paper ⚖️ License

audio-to-audio music-source-separation

Performance

3.8sTypical run time

63.0KTotal runs

About

SAM-Audio is a foundation model for isolating any sound in audio using text

Example Output

Output

Performance Metrics

3.78s Prediction Time

3.79s Total Time

All Input Parameters

{
  "audio": "https://replicate.delivery/pbxt/OXLsi8a21QNpeJBXECdlqLvUyseiDyXPG7vdk0ZxiXzLnQdt/test.m4a",
  "description": "speech",
  "span_anchors": "[]",
  "predict_spans": false,
  "output_residual": false,
  "use_span_prompting": false
}

Input Parameters

audio (required) Type: string: Input audio or video file (WAV, MP3, MP4, etc.)
description Type: stringDefault: speech: Text description of the sound to isolate. Use simple noun phrases like 'speech', 'man speaking', 'dog barking', 'piano', 'guitar playing', 'birds chirping'
span_anchors Type: stringDefault: []: [Only if use_span_prompting=True] Time ranges as JSON array. Format: [['+', start_sec, end_sec], ...]. '+' means sound present, '-' means absent. Example: [['+', 2.0, 4.0]] or [['+', 1.0, 3.0], ['-', 5.0, 6.0]]
predict_spans Type: booleanDefault: false: Auto-detect time spans where target sound occurs. Improves quality for non-ambient sounds but slower.
output_residual Type: booleanDefault: false: Also output the residual audio (everything except the target sound)
use_span_prompting Type: booleanDefault: false: Enable span prompting to specify time ranges where the target sound occurs. More precise but requires knowing timestamps.

Output Schema

Output

Type: array • Items Type: string • Items Format: uri

Version Details

Version ID: d8a8a4fcdcbf0bdc863f6d98cd2117ec0bc02224b576c7b98b2a009a8a1f83fa
Version Created: February 4, 2026

Run on Replicate →