geopti/sam-audio-large 🖼️📝✓ → 🖼️

▶️ 15.1K runs 📅 Feb 2026 ⚙️ Cog 0.16.11 🔗 GitHub 📄 Paper ⚖️ License
audio-to-audio music-source-separation

About

SAM-Audio is a foundation model for isolating any sound in audio using text

Example Output

Output

Example output

Performance Metrics

3.78s Prediction Time
3.79s Total Time
All Input Parameters
{
  "audio": "https://replicate.delivery/pbxt/OXLsi8a21QNpeJBXECdlqLvUyseiDyXPG7vdk0ZxiXzLnQdt/test.m4a",
  "description": "speech",
  "span_anchors": "[]",
  "predict_spans": false,
  "output_residual": false,
  "use_span_prompting": false
}
Input Parameters
audio (required) Type: string
Input audio or video file (WAV, MP3, MP4, etc.)
description Type: stringDefault: speech
Text description of the sound to isolate. Use simple noun phrases like 'speech', 'man speaking', 'dog barking', 'piano', 'guitar playing', 'birds chirping'
span_anchors Type: stringDefault: []
[Only if use_span_prompting=True] Time ranges as JSON array. Format: [['+', start_sec, end_sec], ...]. '+' means sound present, '-' means absent. Example: [['+', 2.0, 4.0]] or [['+', 1.0, 3.0], ['-', 5.0, 6.0]]
predict_spans Type: booleanDefault: false
Auto-detect time spans where target sound occurs. Improves quality for non-ambient sounds but slower.
output_residual Type: booleanDefault: false
Also output the residual audio (everything except the target sound)
use_span_prompting Type: booleanDefault: false
Enable span prompting to specify time ranges where the target sound occurs. More precise but requires knowing timestamps.
Output Schema

Output

Type: arrayItems Type: stringItems Format: uri

Version Details
Version ID
d8a8a4fcdcbf0bdc863f6d98cd2117ec0bc02224b576c7b98b2a009a8a1f83fa
Version Created
February 4, 2026
Run on Replicate →