Skip to main content
Audio models turn text into speech, clone or style voices, generate music, or edit audio. Capabilities and latency profiles differ by provider and tier.

Technical topics

  • Text-to-speech (TTS), Natural speech from text; variants tuned for quality vs speed.
  • Voice cloning, Reference audio to match timbre or style where supported.
  • Real-time / low-latency, Models optimized for interactive or live use cases.
  • Languages & prosody, Multilingual support, emotion, and pacing depend on the specific model.
  • Music generation, Lyrics- or prompt-driven music where available.

Model API & platform docs

For serving modes (serverless vs dedicated), billing, rate limits, task polling, and unified API patterns, see the API Reference section.

Model list

ModelModel IDOrganization
minimax-audio-voice-clone-speech-2.6-hdminimax-audio-voice-clone-speech-2.6-hdminimax
minimax-audio-voice-clone-speech-2.6-turbominimax-audio-voice-clone-speech-2.6-turbominimax
minimax-music-2.5minimax-music-2.5minimax
minimax-tts-speech-01-hdminimax-tts-speech-01-hdminimax
minimax-tts-speech-01-turbominimax-tts-speech-01-turbominimax
minimax-tts-speech-02-hdminimax-tts-speech-02-hdminimax
minimax-tts-speech-02-turbominimax-tts-speech-02-turbominimax
minimax-tts-speech-2.5-hd-previewminimax-tts-speech-2.5-hd-previewminimax
minimax-tts-speech-2.5-turbo-previewminimax-tts-speech-2.5-turbo-previewminimax
minimax-tts-speech-2.6-hdminimax-tts-speech-2.6-hdminimax
minimax-tts-speech-2.6-turbominimax-tts-speech-2.6-turbominimax
Realtime-tts-1.5-maxinworld-tts-1.5-maxinworld
Realtime-tts-1.5-miniinworld-tts-1.5-miniinworld
Realtime-tts-2inworld-tts-2inworld
Step-Audio-EditXStep-Audio-EditXstepfun-ai