prunaai/gemma-4-26b-a4b-fast 🔢🖼️📝✓ → 📝
About
This is a version of the MoE Gemma 4 26B optimised by Pruna AI.
Example Output
Output
vLLM is a high-throughput, memory-efficient serving engine for Large Language Models that uses a technique called PagedAttention to manage KV cache memory, significantly accelerating inference speeds and increasing serving capacity.
Performance Metrics
0.29s
Prediction Time
0.30s
Total Time
All Input Parameters
{
"top_p": 0.95,
"images": [],
"message": "Explain vLLM in one sentence",
"video_fps": 2,
"max_tokens": 2048,
"temperature": 0.6,
"enable_thinking": false,
"presence_penalty": 1,
"max_visual_tokens": 280
}
Input Parameters
- top_p
- Nucleus sampling: only consider tokens with cumulative probability up to this value
- video
- Video file to send to the model
- images
- Images to send to the model
- message
- The user message to send to the model
- video_fps
- Frames per second to sample from the video
- max_tokens
- Maximum number of tokens to generate
- temperature
- Sampling temperature (higher = more creative, lower = more deterministic)
- system_prompt
- Optional system prompt to set the model's behavior
- enable_thinking
- Enable thinking mode (model reasons internally before answering)
- presence_penalty
- Presence penalty: penalize new tokens based on their presence in the text
- max_visual_tokens
- Vision token budget per image (higher = more detail, more compute). Supported: 70, 140, 280, 560, 1120
Output Schema
Output
Version Details
- Version ID
007dac3717afbfb1ddade995c5fbff003a5a365660b8f8155b6bbb053eada6e4- Version Created
- April 8, 2026