google/gemini-3-flash 🖼️🔢📝❓ → 📝

⭐ Official ▶️ 331.6K runs 📅 Jan 2026 ⚙️ Cog 0.16.9
image-to-text text-generation video-to-text

About

Google's most intelligent model built for speed with frontier intelligence, superior search, and grounding

Example Output

Prompt:

"Based on the video content (as there is no audio), come up with your own transcript. Imagine the audience is incredibly technical. Include timestamps."

Output

Transcript: The Architecture and Training Dynamics of Large Language Models
Audience Profile: Machine Learning Engineers, Data Scientists, and AI Researchers.


00:00
[Title Slide: Large Language Models for the curious beginner]
Welcome to a deep dive into the underlying mechanics of Large Language Models (LLMs). While often perceived as intuitive agents, these systems are fundamentally high-dimensional statistical engines designed for next-token prediction.

00:14
Consider a standard prompt involving the history and physical properties of transistors. At its core, the model is performing an iterative inference task. Given an input sequence, it calculates a probability distribution over the entire vocabulary to select the most likely subsequent token.

00:39
Whether completing a Shakespearean quote or responding as a technical assistant, the process remains consistent: the model consumes a context window and outputs a token. This is not a simple lookup; it is a calculation based on weighted probabilities. For the phrase "Paris is a city in...", the model doesn't just "know" the answer; it assigns a 17% probability to "France," 15% to "and," and so on, based on its training distribution.

00:52
In an "AI Assistant" mode, we see this autoregressive generation in action. The model maintains the conversation state, predicting "Sure," then "there," then "are," effectively traversing a path through a massive probability tree to construct a coherent response.

01:28
This predictive capability is derived from an enormous corpus of human-generated text. To move beyond random output, the model must "ingest" billions of lines of code, literature, and discourse. However, the information isn't stored as text; it is encoded within the model’s internal state.

01:49
If we look "under the hood" of the LLM block, we find a massive array of tunable parameters, or "weights." These weights define the strength of connections between neurons in the network. A "Large" Language Model is defined by the sheer scale of these parameters—often numbering in the hundreds of billions. Each weight acts as a dial that influences the signal as it passes through the network.

02:18
Initially, these weights are randomized, resulting in gibberish. The training process involves feeding the model a snippet of text, hiding the next word, and asking the model to predict it.

02:33
When the model predicts incorrectly—say, guessing "age" instead of "worst" in a Dickens quote—an optimization algorithm (typically stochastic gradient descent) adjusts the weights across the entire network to minimize the error.

03:10
The scale of this operation is staggering. To achieve state-of-the-art performance, a model requires an amount of computation equivalent to one billion operations per second sustained for over 100 million years. This massive parallelization is what allows the model to capture the nuances of human language.

03:45
The lifecycle of an LLM involves two distinct phases. Step 1: Pretraining, where the model learns the statistical structure of language through unsupervised learning on a massive dataset. Step 2: RLHF (Reinforcement Learning from Human Feedback). This is the alignment phase, where human graders rank model outputs to fine-tune the model's behavior, transforming it from a simple text-completer into a helpful, safe assistant.

04:22
Hardware is the silent enabler here. Graphics Processing Units (GPUs) are utilized not for rendering, but for their ability to perform massive matrix multiplications in parallel. Input vectors are transformed by weight matrices to produce output vectors at incredible speeds.

04:29
Before 2017, NLP models processed text sequentially, which was computationally inefficient and struggled with long-range dependencies. This changed with the seminal paper "Attention Is All You Need," which introduced the Transformer architecture.

04:51
The "Secret Sauce" of the Transformer is the Attention Mechanism. In this framework, tokens are represented as high-dimensional vectors (embeddings). The attention mechanism allows these vectors to "communicate" with each other.

05:15
For example, the word "bank" has a different vector representation in the context of "river" than it does in the context of "check" or "deposit." Attention allows the model to update the vector of a word based on the semantic meaning of the surrounding tokens.

05:38
A Transformer block consists of an Attention layer followed by a Feedforward Neural Network. These blocks are stacked dozens of times. As data passes through these layers, the representation of each token becomes increasingly sophisticated, moving from a literal word to a complex bundle of semantic and grammatical concepts.

06:05
By the time the signal reaches the final layer, the vector for the final token contains a rich summary of the entire context, allowing the model to project that vector back into vocabulary space and pick the next word with high precision.

06:47
What we perceive as intelligence is, in fact, the result of billions of parameters perfectly tuned through massive computation to navigate the complexities of human communication.

07:05
[Credits and acknowledgments to the Computer History Museum and 3Blue1Brown's Deep Learning series.]

Performance Metrics

29.71s Prediction Time
29.94s Total Time
All Input Parameters
{
  "top_p": 0.95,
  "images": [],
  "prompt": "Based on the video content (as there is no audio), come up with your own transcript. Imagine the audience is incredibly technical. Include timestamps.",
  "videos": [
    "https://replicate.delivery/pbxt/OTslXEbnYxZRWDuYXCG4oO0SIcriyQdn4FDWLzJhiVVMTAUF/videoplayback%20%281%29.mp4"
  ],
  "temperature": 1,
  "thinking_level": "low",
  "max_output_tokens": 65535
}
Input Parameters
audio Type: string
Input audio to send with the prompt (max 1 audio file, up to 8.4 hours)
top_p Type: numberDefault: 0.95Range: 0 - 1
Nucleus sampling parameter - the model considers the results of the tokens with top_p probability mass
images Type: arrayDefault:
Input images to send with the prompt (max 10 images, each up to 7MB)
prompt (required) Type: string
The text prompt to send to the model
videos Type: arrayDefault:
Input videos to send with the prompt (max 10 videos, each up to 45 minutes)
temperature Type: numberDefault: 1Range: 0 - 2
Sampling temperature between 0 and 2
thinking_level
Thinking level for reasoning (low or high). Replaces thinking_budget for Gemini 3 models.
max_output_tokens Type: integerDefault: 65535Range: 1 - 65535
Maximum number of tokens to generate
system_instruction Type: string
System instruction to guide the model's behavior
Output Schema

Output

Type: arrayItems Type: string

Example Execution Logs
Generating text with gemini-3-flash-preview...
Uploading 1 videos to Gemini Files API...
Uploaded cog-input-1482576100.mp4, waiting for processing...
Processing video... (state: PROCESSING)
Processing video... (state: PROCESSING)
Video ready (took 10s)
Warning: there are non-text parts in the response: ['thought_signature'], returning concatenated text result from text parts. Check the full candidates.content.parts accessor to get the full model response.
Warning: there are non-text parts in the response: ['thought_signature'], returning concatenated text result from text parts. Check the full candidates.content.parts accessor to get the full model response.
Text generation completed in 29.57s
Input token count: 30557
Output token count: 1160
Version Details
Version ID
12917939800a325e127c528db67c32fe8a23a51c0400690e68c8731c2508c553
Version Created
January 26, 2026
Run on Replicate →