On GMI Cloud, multimodal models are the media-focused APIs: they take and return non-text primary artifacts such as waveforms, video files, images, or 3D assets. They sit alongside LLM models, which are centered on language (and some vision-language understanding). Use this page as the section overview for those media modalities, then open the category that matches your output.Documentation Index
Fetch the complete documentation index at: https://docs.gmicloud.ai/llms.txt
Use this file to discover all available pages before exploring further.
What is covered here
| Output | Role | Full catalog |
|---|---|---|
| Audio | TTS, voice cloning, music | Audio models |
| Video | T2V, I2V, editing, avatars | Video models |
| Image | T2I, editing, batch jobs | Image models |
Technical topics (shared across media models)
- Asynchronous jobs — Many video and image pipelines return a task you poll until completion; see tasks and artifacts in API References.
- Inputs and controls — Prompts, reference images, first/last frames, masks, and duration or resolution caps are model-specific; each model page documents the supported schema.
- Providers and tiers — Quality, latency, and cost vary by provider; marketplace and billing docs apply alongside per-model examples.
- SDKs — Video and related flows may use the dedicated SDK, where documented under API References.
LLM vs multimodal
- LLM pages focus on chat/completions-style usage (messages, sampling, optional tools or images in the request).
- Multimodal (media) pages focus on file- or asset-centric flows: generate or edit media, download results, and handle longer-running jobs.