zsxkib/kimi-vl-a3b-thinking 🖼️🔢📝 → 📝

▶️ 983 runs 📅 Apr 2025 ⚙️ Cog 0.14.4 🔗 GitHub 📄 Paper ⚖️ License

image-analysis image-to-text text-generation visual-understanding

About

Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

Example Output

Prompt:

"Where am I?"

Output

◁think▷Okay, let's see. The user is asking "Where am I?" and provided an image of a cyberpunk bar scene. The image has a lot of neon lights, futuristic elements, and people sitting at a bar. There's a sign that says "The Heist" and some text about sitting next to Jackie. Also, there's a mini-map in the corner, which might indicate it's from a video game.

First, I need to figure out which game this is. The cyberpunk aesthetic with the bar setting and the mini-map suggests it could be Cyberpunk 2077. The Heist as a location is a known area in that game. The characters' outfits and the overall environment match the game's style. The user might be in-game, so the answer should confirm that they're in Cyberpunk 2077, specifically at The Heist bar. I should mention the game's title and the location's name. Also, note the HUD elements as part of the game's interface. Make sure to keep it concise but informative.◁/think▷You are in Cyberpunk 2077, a role-playing video game set in a dystopian future version of Night City. The image shows you inside The Heist, a bar or nightclub popular in the game. The cyberpunk aesthetic, neon lighting, and HUD elements (mini-map, objectives like "Sit next to Jackie") align with the game's signature style. This location is known for its vibrant atmosphere and interactions with characters like Jackie Welles, a key figure in the early story.<|im_end|>

Performance Metrics

15.24s Prediction Time

115.81s Total Time

All Input Parameters

{
  "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
  "top_p": 1,
  "prompt": "Where am I?",
  "temperature": 0.6,
  "max_length_tokens": 2048
}

Input Parameters

image Type: string: Optional image input
top_p Type: numberDefault: 1Range: 0 - 1: Top-p sampling probability
prompt (required) Type: string: Text prompt for the model
temperature Type: numberDefault: 0.6Range: 0 - 2: Sampling temperature
max_length_tokens Type: integerDefault: 2048Range: 1 - ∞: Maximum number of tokens to generate

Output Schema

Output

Type: array • Items Type: string

Example Execution Logs

Loaded image: /tmp/tmplw0gw75ldemo1.jpeg
Image token added to prompt and image object included.
Processing conversation and image (if any)...
Stop words: ['<|im_end|>']
Starting generation...
conversations = [{'role': 'user', 'content': '<image>\nWhere am I?', 'images': [<PIL.Image.Image image mode=RGB size=2516x1410 at 0x7BA35CF966D0>]}, {'role': 'assistant', 'content': None}]
raw text = <|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|><|media_start|>image<|media_content|><|media_pad|><|media_end|><image>
Where am I?<|im_end|><|im_assistant|>assistant<|im_middle|>
Generation complete.
Finished prediction. Params: temp=0.6, top_p=1.0, max_len=2048

Version Details

Version ID: c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d
Version Created: April 17, 2025

Run on Replicate →