zsxkib/kimi-vl-a3b-thinking ๐Ÿ–ผ๏ธ๐Ÿ”ข๐Ÿ“ โ†’ ๐Ÿ“

โ–ถ๏ธ 983 runs ๐Ÿ“… Apr 2025 โš™๏ธ Cog 0.14.4 ๐Ÿ”— GitHub ๐Ÿ“„ Paper โš–๏ธ License
image-analysis image-to-text text-generation visual-understanding

About

Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

Example Output

Prompt:

"Where am I?"

Output

โ—thinkโ–ทOkay, let's see. The user is asking "Where am I?" and provided an image of a cyberpunk bar scene. The image has a lot of neon lights, futuristic elements, and people sitting at a bar. There's a sign that says "The Heist" and some text about sitting next to Jackie. Also, there's a mini-map in the corner, which might indicate it's from a video game.

First, I need to figure out which game this is. The cyberpunk aesthetic with the bar setting and the mini-map suggests it could be Cyberpunk 2077. The Heist as a location is a known area in that game. The characters' outfits and the overall environment match the game's style. The user might be in-game, so the answer should confirm that they're in Cyberpunk 2077, specifically at The Heist bar. I should mention the game's title and the location's name. Also, note the HUD elements as part of the game's interface. Make sure to keep it concise but informative.โ—/thinkโ–ทYou are in Cyberpunk 2077, a role-playing video game set in a dystopian future version of Night City. The image shows you inside The Heist, a bar or nightclub popular in the game. The cyberpunk aesthetic, neon lighting, and HUD elements (mini-map, objectives like "Sit next to Jackie") align with the game's signature style. This location is known for its vibrant atmosphere and interactions with characters like Jackie Welles, a key figure in the early story.<|im_end|>

Performance Metrics

15.24s Prediction Time
115.81s Total Time
All Input Parameters
{
  "image": "https://raw.githubusercontent.com/zsxkib/cog-kimi-vl-a3b-thinking/main/images/demo1.jpeg",
  "top_p": 1,
  "prompt": "Where am I?",
  "temperature": 0.6,
  "max_length_tokens": 2048
}
Input Parameters
image Type: string
Optional image input
top_p Type: numberDefault: 1Range: 0 - 1
Top-p sampling probability
prompt (required) Type: string
Text prompt for the model
temperature Type: numberDefault: 0.6Range: 0 - 2
Sampling temperature
max_length_tokens Type: integerDefault: 2048Range: 1 - โˆž
Maximum number of tokens to generate
Output Schema

Output

Type: array โ€ข Items Type: string

Example Execution Logs
Loaded image: /tmp/tmplw0gw75ldemo1.jpeg
Image token added to prompt and image object included.
Processing conversation and image (if any)...
Stop words: ['<|im_end|>']
Starting generation...
conversations = [{'role': 'user', 'content': '<image>\nWhere am I?', 'images': [<PIL.Image.Image image mode=RGB size=2516x1410 at 0x7BA35CF966D0>]}, {'role': 'assistant', 'content': None}]
raw text = <|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_user|>user<|im_middle|><|media_start|>image<|media_content|><|media_pad|><|media_end|><image>
Where am I?<|im_end|><|im_assistant|>assistant<|im_middle|>
Generation complete.
Finished prediction. Params: temp=0.6, top_p=1.0, max_len=2048
Version Details
Version ID
c44e6d137b421d27afa9bf4f2360833d4f05a697b549356d44c3a56760a3164d
Version Created
April 17, 2025
Run on Replicate โ†’