lucataco/gpt-oss-safeguard-120b 🔢📝❓ → ❓

▶️ 30 runs 📅 Nov 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License
content-moderation policy-compliance safety-classification

About

classify text content based on safety policies that you provide and perform a suite of foundational safety tasks

Example Output

Prompt:

"Your bank details are needed to complete this transaction."

Output

{"answer":"INVALID","analysis":"The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, which could be a phishing attempt? The prompt says: "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts." The content "Your bank details are needed to complete this transaction." That is presumably a request for personal info, possibly a scam or phishing. It is likely "INVALID" as spam / phishing. However classification: It's a statement that "Your bank details are needed to complete this transaction." It's a request for bank details, but more context: Are they asking for bank details? It may be considered phishing or solicitation for personal info. So classify as INVALID (spam). The answer should be a classification. The instructions: "Classify content as VALID or INVALID." So output classification. Likely output: INVALID."}

Performance Metrics

18.12s Prediction Time
18.13s Total Time
All Input Parameters
{
  "top_p": 1,
  "policy": "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts.",
  "prompt": "Your bank details are needed to complete this transaction.",
  "temperature": 1,
  "max_new_tokens": 512,
  "reasoning_effort": "medium",
  "repetition_penalty": 1
}
Input Parameters
top_p Type: numberDefault: 1Range: 0 - 1
Nucleus sampling top-p
policy Type: stringDefault:
Safety policy or rules to evaluate the prompt against. This defines the criteria for classification.
prompt Type: stringDefault: Your bank details are needed to complete this transaction.
Input prompt or question to evaluate
temperature Type: numberDefault: 1Range: 0 - 2
Sampling temperature (0 for greedy decoding)
max_new_tokens Type: integerDefault: 512Range: 1 - 8192
Maximum number of tokens to generate
reasoning_effort Default: medium
Reasoning effort level: 'low' for faster responses on straightforward cases, 'medium' for balanced reasoning (default), or 'high' for deeper analysis of complex edge cases
repetition_penalty Type: numberDefault: 1Range: 0.8 - 2
Repetition penalty to reduce repetitive text (1.0 = no penalty)
Output Schema

Output

Type: object

Example Execution Logs
DEBUG: Raw output preview (first 1000 chars):
<|channel|>analysis<|message|>The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, which could be a phishing attempt? The prompt says: "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts." The content "Your bank details are needed to complete this transaction." That is presumably a request for personal info, possibly a scam or phishing. It is likely "INVALID" as spam / phishing. However classification: It's a statement that "Your bank details are needed to complete this transaction." It's a request for bank details, but more context: Are they asking for bank details? It may be considered phishing or solicitation for personal info. So classify as INVALID (spam). The answer should be a classification. The instructions: "Classify content as VA
DEBUG: Extracted analysis length: 1036, answer length: 7
DEBUG: Analysis preview: The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, wh...
DEBUG: Answer preview: INVALID...
Version Details
Version ID
6db83709d19776b2302ae67ce9f761771f18eb928881cb9bead50c49d62d6b0b
Version Created
November 4, 2025
Run on Replicate →