zsxkib/kimi-vl-a3b-thinking ๐ผ๏ธ๐ข๐ โ ๐
About
Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

Example Output
"Where am I?"
Output
โthinkโทOkay, let's see. The user is asking "Where am I?" and provided an image of a cyberpunk bar scene. The image has a lot of neon lights, futuristic elements, and people sitting at a bar. There's a sign that says "The Heist" and some text about sitting next to Jackie. Also, there's a mini-map in the corner, which might indicate it's from a video game.
First, I need to figure out which game this is. The cyberpunk aesthetic with the bar setting and the mini-map suggests it could be Cyberpunk 2077. The Heist as a location is a known area in that game. The characters' outfits and the overall environment match the game's style. The user might be in-game, so the answer should confirm that they're in Cyberpunk 2077, specifically at The Heist bar. I should mention the game's title and the location's name. Also, note the HUD elements as part of the game's interface. Make sure to keep it concise but informative.โ/thinkโทYou are in Cyberpunk 2077, a role-playing video game set in a dystopian future version of Night City. The image shows you inside The Heist, a bar or nightclub popular in the game. The cyberpunk aesthetic, neon lighting, and HUD elements (mini-map, objectives like "Sit next to Jackie") align with the game's signature style. This location is known for its vibrant atmosphere and interactions with characters like Jackie Welles, a key figure in the early story.<|im_end|>
Performance Metrics
All Input Parameters
{ "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg", "top_p": 1, "prompt": "Where am I?", "temperature": 0.6, "max_length_tokens": 2048 }
Input Parameters
- image
- Optional image input
- top_p
- Top-p sampling probability
- prompt (required)
- Text prompt for the model
- temperature
- Sampling temperature
- max_length_tokens
- Maximum number of tokens to generate
Output Schema
Output
Example Execution Logs
Loaded image: /tmp/tmplw0gw75ldemo1.jpeg Image token added to prompt and image object included. Processing conversation and image (if any)... Stop words: ['<|im_end|>'] Starting generation... conversations = [{'role': 'user', 'content': '<image>\nWhere am I?', 'images': [<PIL.Image.Image image mode=RGB size=2516x1410 at 0x7BA35CF966D0>]}, {'role': 'assistant', 'content': None}] raw text = <|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|><|media_start|>image<|media_content|><|media_pad|><|media_end|><image> Where am I?<|im_end|><|im_assistant|>assistant<|im_middle|> Generation complete. Finished prediction. Params: temp=0.6, top_p=1.0, max_len=2048
Version Details
- Version ID
c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d
- Version Created
- April 17, 2025