AI Stem Separation (Demucs)

Separates a mixed audio file into individual stems (vocals, drums, bass, other instruments) using Meta’s Demucs neural network. Useful for analyzing individual elements of a mixed file, or preparing stems for masking analysis.

Requires the phantom-audio[separation] extra. Install with: pip install phantom-audio[separation]

Parameters

Parameter Type Default Description
file_path string required Path to mixed audio file
output_dir string required Directory to write separated stems

Example Output

$ separate_stems full-mix.wav

Stem Separation: full-mix.wav Model: Demucs v4 (htdemucs) Duration: 3:42

Processing… done (18.4s)

Output files: ./stems/vocals.wav (24-bit WAV) ./stems/drums.wav (24-bit WAV) ./stems/bass.wav (24-bit WAV) ./stems/other.wav (24-bit WAV)

Quality estimate: Vocals: high (clear separation) Drums: high (clean transients) Bass: medium (some bleed from kick) Other: medium (residual content)

What the Numbers Mean

  • Quality estimate — Phantom’s confidence in the separation quality for each stem. “High” means clean isolation. “Medium” means some bleed from other sources. Bleed is more common between bass/kick and in dense arrangements.

  • Processing time — Stem separation uses neural networks and is CPU-intensive. Expect 5-20 seconds per minute of audio depending on your machine.

Example Prompts

Full separation

Separate my mix into stems — I want to analyze each element individually

Vocals only

Extract just the vocals from song.wav so I can analyze them

Analysis pipeline

Separate this reference track into stems, then run masking analysis between my vocals and the reference vocals

Pro tip

Stem separation quality drops with heavily compressed or limited audio (less transient information for the model to work with). For best results, use the highest quality source available: uncompressed WAV or FLAC, before any mastering processing.