prunaai/gemma-4-26b-a4b-fast 🔢🖼️📝✓ → 📝

▶️ 15.0K runs 📅 Apr 2026 ⚙️ Cog 0.16.12 📄 Paper ⚖️ License

code-generation document-analysis image-to-text multilingual ocr text-generation video-to-text visual-understanding

Performance

0.3sTypical run time

15.0KTotal runs

About

This is a version of the MoE Gemma 4 26B optimised by Pruna AI.

Example Output

Output

vLLM is a high-throughput, memory-efficient serving engine for Large Language Models that uses a technique called PagedAttention to manage KV cache memory, significantly accelerating inference speeds and increasing serving capacity.

Performance Metrics

0.29s Prediction Time

0.30s Total Time

All Input Parameters

{
  "top_p": 0.95,
  "images": [],
  "message": "Explain vLLM in one sentence",
  "video_fps": 2,
  "max_tokens": 2048,
  "temperature": 0.6,
  "enable_thinking": false,
  "presence_penalty": 1,
  "max_visual_tokens": 280
}

Input Parameters

top_p Type: numberDefault: 0.95Range: 0 - 1: Nucleus sampling: only consider tokens with cumulative probability up to this value
video Type: string: Video file to send to the model
images Type: arrayDefault:: Images to send to the model
message Type: stringDefault: Explain vLLM in one sentence: The user message to send to the model
video_fps Type: numberDefault: 2Range: 0.1 - 30: Frames per second to sample from the video
max_tokens Type: integerDefault: 2048Range: 1 - 16384: Maximum number of tokens to generate
temperature Type: numberDefault: 0.6Range: 0 - 2: Sampling temperature (higher = more creative, lower = more deterministic)
system_prompt Type: string: Optional system prompt to set the model's behavior
enable_thinking Type: booleanDefault: false: Enable thinking mode (model reasons internally before answering)
presence_penalty Type: numberDefault: 1Range: 1 - 1.5: Presence penalty: penalize new tokens based on their presence in the text
max_visual_tokens Type: integerDefault: 280Range: 70 - 1120: Vision token budget per image (higher = more detail, more compute). Supported: 70, 140, 280, 560, 1120

Output Schema

Output

Type: array • Items Type: string

Version Details

Version ID: 007dac3717afbfb1ddade995c5fbff003a5a365660b8f8155b6bbb053eada6e4
Version Created: April 8, 2026

Run on Replicate →