geopti/sam-audio-large 🖼️📝✓ → 🖼️
About
SAM-Audio is a foundation model for isolating any sound in audio using text
Example Output
Output
Performance Metrics
3.78s
Prediction Time
3.79s
Total Time
All Input Parameters
{
"audio": "https://replicate.delivery/pbxt/OXLsi8a21QNpeJBXECdlqLvUyseiDyXPG7vdk0ZxiXzLnQdt/test.m4a",
"description": "speech",
"span_anchors": "[]",
"predict_spans": false,
"output_residual": false,
"use_span_prompting": false
}
Input Parameters
- audio (required)
- Input audio or video file (WAV, MP3, MP4, etc.)
- description
- Text description of the sound to isolate. Use simple noun phrases like 'speech', 'man speaking', 'dog barking', 'piano', 'guitar playing', 'birds chirping'
- span_anchors
- [Only if use_span_prompting=True] Time ranges as JSON array. Format: [['+', start_sec, end_sec], ...]. '+' means sound present, '-' means absent. Example: [['+', 2.0, 4.0]] or [['+', 1.0, 3.0], ['-', 5.0, 6.0]]
- predict_spans
- Auto-detect time spans where target sound occurs. Improves quality for non-ambient sounds but slower.
- output_residual
- Also output the residual audio (everything except the target sound)
- use_span_prompting
- Enable span prompting to specify time ranges where the target sound occurs. More precise but requires knowing timestamps.
Output Schema
Output
Version Details
- Version ID
d8a8a4fcdcbf0bdc863f6d98cd2117ec0bc02224b576c7b98b2a009a8a1f83fa- Version Created
- February 4, 2026