ibm-granite/granite-3.3-8b-instruct 🔢📝❓✓ → 📝

⭐ Official ▶️ 1.4M runs 📅 Apr 2025 ⚙️ Cog 0.16.7 🔗 GitHub ⚖️ License
code-generation document-summarization multilingual text-generation text-translation

About

Granite-3.3-8B-Instruct is a 8-billion parameter 128K context length language model fine-tuned for improved reasoning and instruction-following capabilities.

Example Output

Prompt:

"How is perplexity measured for LLMs and why is it useful?"

Output

Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It measures how well a language model predicts a sample. A lower perplexity score indicates better performance, as it suggests the model is more confident in its predictions.

Perplexity is calculated by raising 2 to the average negative log-likelihood of the test set. In simpler terms, it quantifies the uncertainty of the model in predicting the next word in a sequence. A perplexity of 2^n means that, on average, the model is uncertain about the next word, assigning a probability of 1/n to the correct word.

Here's a step-by-step breakdown of the calculation:

  1. For each word in the test set, the language model computes a probability distribution over the vocabulary, predicting the likelihood of the next word.
  2. The negative logarithm (base 2) of these probabilities is calculated for each word.
  3. The negative log-likelihood is averaged across all words in the test set.
  4. 2 raised to the power of this average negative log-likelihood gives the perplexity score.

Perplexity is useful for several reasons:

  1. Model comparison: It allows for comparing different language models or model configurations. A model with lower perplexity on a given test set is generally considered to have better performance.

  2. Training optimization: Perplexity can be used as a loss function during training, guiding the model to minimize uncertainty in its predictions. Lower perplexity during training indicates that the model is learning to make more accurate predictions.

  3. Transfer learning and fine-tuning: Perplexity can help assess the impact of transfer learning or fine-tuning on pre-trained language models. A decrease in perplexity after applying these techniques suggests improved performance.

  4. Understanding model behavior: Perplexity can offer insights into how well the model captures the underlying patterns and structures of the language.

However, it's essential to consider the limitations of perplexity. It doesn't always correlate with human-evaluated quality, especially for specific tasks or domains. Moreover, it tends to favor models that assign very low probabilities to a small subset of words, which might not necessarily indicate better performance in practical applications.

In summary, perplexity is a valuable

Performance Metrics

3.98s Prediction Time
268.92s Total Time
All Input Parameters
{
  "top_k": 50,
  "top_p": 0.9,
  "prompt": "How is perplexity measured for LLMs and why is it useful?",
  "max_tokens": 512,
  "min_tokens": 0,
  "temperature": 0.6,
  "presence_penalty": 0,
  "frequency_penalty": 0
}
Input Parameters
seed Type: integer
Random seed. Leave unspecified to randomize the seed.
stop Type: arrayDefault:
A list of sequences to stop generation at. For example, ["<end>","<stop>"] will stop generation at the first instance of "<end>" or "<stop>".
tools Type: arrayDefault:
Tools for request. Passed to the chat template.
top_k Type: integerDefault: 50
The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).
top_p Type: numberDefault: 0.9
A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).
prompt Type: string
Completion API user prompt.
stream Type: booleanDefault: false
Request streaming response. Defaults to False.
messages Type: arrayDefault:
Chat completion API messages.
documents Type: arrayDefault:
Documents for request. Passed to the chat template.
max_tokens Type: integer
max_tokens is deprecated in favor of the max_completion_tokens field.
min_tokens Type: integerDefault: 0
The minimum number of tokens the model should generate as output.
temperature Type: numberDefault: 0.6
The value used to modulate the next token probabilities.
tool_choice Type: string
Tool choice for request. If the choice is a specific function, this should be specified as a JSON string.
chat_template Type: string
A template to format the prompt with. If not specified, the chat template provided by the model will be used.
system_prompt Type: string
Completion API system prompt. The chat template provides a good default.
response_format Type: object
An object specifying the format that the model must output.
presence_penalty Type: numberDefault: 0
Presence penalty
frequency_penalty Type: numberDefault: 0
Frequency penalty
chat_template_kwargs Type: objectDefault: [object Object]
Additional arguments to be passed to the chat template.
add_generation_prompt Type: booleanDefault: true
Add generation prompt. Passed to the chat template. Defaults to True.
max_completion_tokens Type: integer
An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning tokens.
Output Schema

Output

Type: arrayItems Type: string

Example Execution Logs
2025-04-15 17:15:15 [info     ] predict() commencing           request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?
2025-04-15 17:15:15 [debug    ] Formatted prompt using chat template formatted_prompt=<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.
 Today's Date: April 15, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>How is perplexity measured for LLMs and why is it useful?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|> request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?
2025-04-15 17:15:15 [debug    ] SamplingParams                 request_id=c23a193892654772b688b7fd97f894b8 sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.9, top_k=50, min_p=0.0, seed=22349, stop=[], stop_token_ids=[0], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None) user_prompt=How is perplexity measured for LLMs and why is it useful?
INFO 04-15 17:15:15 [async_llm.py:228] Added request c23a193892654772b688b7fd97f894b8.
2025-04-15 17:15:19 [debug    ] result                         finish_reason=length request_id=c23a193892654772b688b7fd97f894b8 text=Perplexity is a common metric used to evaluate the performance of language models, including large language models (LLMs). It measures how well a language model predicts a sample. A lower perplexity score indicates better performance, as it suggests the model is more confident in its predictions.

Perplexity is calculated by raising 2 to the average negative log-likelihood of the test set. In simpler terms, it quantifies the uncertainty of the model in predicting the next word in a sequence. A perplexity of 2^n means that, on average, the model is uncertain about the next word, assigning a probability of 1/n to the correct word.

Here's a step-by-step breakdown of the calculation:

1. For each word in the test set, the language model computes a probability distribution over the vocabulary, predicting the likelihood of the next word.
2. The negative logarithm (base 2) of these probabilities is calculated for each word.
3. The negative log-likelihood is averaged across all words in the test set.
4. 2 raised to the power of this average negative log-likelihood gives the perplexity score.

Perplexity is useful for several reasons:

1. **Model comparison**: It allows for comparing different language models or model configurations. A model with lower perplexity on a given test set is generally considered to have better performance.

2. **Training optimization**: Perplexity can be used as a loss function during training, guiding the model to minimize uncertainty in its predictions. Lower perplexity during training indicates that the model is learning to make more accurate predictions.

3. **Transfer learning and fine-tuning**: Perplexity can help assess the impact of transfer learning or fine-tuning on pre-trained language models. A decrease in perplexity after applying these techniques suggests improved performance.

4. **Understanding model behavior**: Perplexity can offer insights into how well the model captures the underlying patterns and structures of the language.

However, it's essential to consider the limitations of perplexity. It doesn't always correlate with human-evaluated quality, especially for specific tasks or domains. Moreover, it tends to favor models that assign very low probabilities to a small subset of words, which might not necessarily indicate better performance in practical applications.

In summary, perplexity is a valuable user_prompt=How is perplexity measured for LLMs and why is it useful?
2025-04-15 17:15:19 [info     ] Generation took 3.98s          request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?
/root/.pyenv/versions/3.11.12/lib/python3.11/site-packages/cog/server/scope.py:21: ExperimentalFeatureWarning: current_scope is an experimental internal function. It may change or be removed without warning.
  warnings.warn(
2025-04-15 17:15:19 [info     ] predict() complete             request_id=c23a193892654772b688b7fd97f894b8 user_prompt=How is perplexity measured for LLMs and why is it useful?
Version Details
Version ID
618ecbe80773609e96ea19d8c96e708f6f2b368bb89be8fad509983194466bf8
Version Created
September 19, 2025
Run on Replicate →