Multimodal Models - GMI Cloud

On GMI Cloud, multimodal models are the media-focused APIs: they take and return non-text primary artifacts such as waveforms, video files, images, or 3D assets. They sit alongside LLM models, which are centered on language (and some vision-language understanding). Use this page as the section overview for those media modalities, then open the category that matches your output.

What is covered here

Output	Role	Full catalog
Audio	TTS, voice cloning, music	Audio models
Video	T2V, I2V, editing, avatars	Video models
Image	T2I, editing, batch jobs	Image models

Technical topics (shared across media models)

Asynchronous jobs — Many video and image pipelines return a task you poll until completion; see tasks and artifacts in API References.
Inputs and controls — Prompts, reference images, first/last frames, masks, and duration or resolution caps are model-specific; each model page documents the supported schema.
Providers and tiers — Quality, latency, and cost vary by provider; marketplace and billing docs apply alongside per-model examples.
SDKs — Video and related flows may use the dedicated SDK, where documented under API References.

LLM vs multimodal

LLM pages focus on chat/completions-style usage (messages, sampling, optional tools or images in the request).
Multimodal (media) pages focus on file- or asset-centric flows: generate or edit media, download results, and handle longer-running jobs.

Some LLMs are vision-language models; they remain listed under LLM models because the primary integration pattern is still the language API.

Model Library

Documentation Index

​What is covered here

​Technical topics (shared across media models)

​LLM vs multimodal

What is covered here

Technical topics (shared across media models)

LLM vs multimodal