lucataco/gpt-oss-safeguard-120b 🔢📝❓ → ❓

▶️ 49 runs 📅 Nov 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License

content-moderation policy-compliance safety-classification

About

classify text content based on safety policies that you provide and perform a suite of foundational safety tasks

Example Output

Prompt:

"Your bank details are needed to complete this transaction."

Output

{"answer":"INVALID","analysis":"The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, which could be a phishing attempt? The prompt says: "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts." The content "Your bank details are needed to complete this transaction." That is presumably a request for personal info, possibly a scam or phishing. It is likely "INVALID" as spam / phishing. However classification: It's a statement that "Your bank details are needed to complete this transaction." It's a request for bank details, but more context: Are they asking for bank details? It may be considered phishing or solicitation for personal info. So classify as INVALID (spam). The answer should be a classification. The instructions: "Classify content as VALID or INVALID." So output classification. Likely output: INVALID."}

Performance Metrics

18.12s Prediction Time

18.13s Total Time

All Input Parameters

{
  "top_p": 1,
  "policy": "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts.",
  "prompt": "Your bank details are needed to complete this transaction.",
  "temperature": 1,
  "max_new_tokens": 512,
  "reasoning_effort": "medium",
  "repetition_penalty": 1
}

Input Parameters

top_p Type: numberDefault: 1Range: 0 - 1: Nucleus sampling top-p
policy Type: stringDefault:: Safety policy or rules to evaluate the prompt against. This defines the criteria for classification.
prompt Type: stringDefault: Your bank details are needed to complete this transaction.: Input prompt or question to evaluate
temperature Type: numberDefault: 1Range: 0 - 2: Sampling temperature (0 for greedy decoding)
max_new_tokens Type: integerDefault: 512Range: 1 - 8192: Maximum number of tokens to generate
reasoning_effort Default: medium: Reasoning effort level: 'low' for faster responses on straightforward cases, 'medium' for balanced reasoning (default), or 'high' for deeper analysis of complex edge cases
repetition_penalty Type: numberDefault: 1Range: 0.8 - 2: Repetition penalty to reduce repetitive text (1.0 = no penalty)

Output Schema

Output

Type: object

Example Execution Logs

DEBUG: Raw output preview (first 1000 chars):
<|channel|>analysis<|message|>The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, which could be a phishing attempt? The prompt says: "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts." The content "Your bank details are needed to complete this transaction." That is presumably a request for personal info, possibly a scam or phishing. It is likely "INVALID" as spam / phishing. However classification: It's a statement that "Your bank details are needed to complete this transaction." It's a request for bank details, but more context: Are they asking for bank details? It may be considered phishing or solicitation for personal info. So classify as INVALID (spam). The answer should be a classification. The instructions: "Classify content as VA
DEBUG: Extracted analysis length: 1036, answer length: 7
DEBUG: Analysis preview: The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, wh...
DEBUG: Answer preview: INVALID...

Version Details

Version ID: 6db83709d19776b2302ae67ce9f761771f18eb928881cb9bead50c49d62d6b0b
Version Created: November 4, 2025

Run on Replicate →