ai-multimodal

7.8

153

162

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

multimodal

7.8

Rating

Installs

AI & LLM

Quick Review

Excellent multimodal AI skill with comprehensive coverage of Google Gemini API capabilities. The description clearly articulates when to use the skill across audio, video, image, document, and generation tasks. SKILL.md provides outstanding task knowledge with clear quick-start examples, cost optimization guidance, and model selection criteria. Structure is well-organized with a concise overview and appropriate delegation to reference files. The skill offers significant novelty by unifying complex multimodal processing (transcription, segmentation, video analysis, PDF extraction, image generation) that would require extensive token usage and multiple API calls if handled by a CLI agent alone. Minor improvement possible in making capability boundaries even more explicit for edge cases.