Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gmicloud.ai/llms.txt

Use this file to discover all available pages before exploring further.

On GMI Cloud, multimodal models are the media-focused APIs: they take and return non-text primary artifacts such as waveforms, video files, images, or 3D assets. They sit alongside LLM models, which are centered on language (and some vision-language understanding). Use this page as the section overview for those media modalities, then open the category that matches your output.

What is covered here

OutputRoleFull catalog
AudioTTS, voice cloning, musicAudio models
VideoT2V, I2V, editing, avatarsVideo models
ImageT2I, editing, batch jobsImage models

Technical topics (shared across media models)

  • Asynchronous jobs — Many video and image pipelines return a task you poll until completion; see tasks and artifacts in API References.
  • Inputs and controls — Prompts, reference images, first/last frames, masks, and duration or resolution caps are model-specific; each model page documents the supported schema.
  • Providers and tiers — Quality, latency, and cost vary by provider; marketplace and billing docs apply alongside per-model examples.
  • SDKs — Video and related flows may use the dedicated SDK, where documented under API References.

LLM vs multimodal

  • LLM pages focus on chat/completions-style usage (messages, sampling, optional tools or images in the request).
  • Multimodal (media) pages focus on file- or asset-centric flows: generate or edit media, download results, and handle longer-running jobs.
Some LLMs are vision-language models; they remain listed under LLM models because the primary integration pattern is still the language API.