lucataco/gpt-oss-safeguard-20b 🔢📝❓ → ❓

▶️ 25 runs 📅 Oct 2025 ⚙️ Cog 0.16.8 🔗 GitHub 📄 Paper ⚖️ License

content-moderation safety-evaluation text-classification

About

classify text content based on safety policies that you provide and perform a suite of foundational safety tasks

Example Output

Prompt:

"Your bank details are needed to complete this transaction."

Output

{"answer":"INVALID","analysis":"We need to see if this is spam. The user says: "Your bank details are needed to complete this transaction." That is a request for bank details. It's a request from presumably user to some service, but the question asks to classify the content as VALID or INVALID. According to spam policy: Spam includes unsolicited promotional content, deceptive messages, or phishing attempts. This statement looks like it is a phishing attempt: requesting bank details. Also, we might treat it as spam: content is a phishing attempt to get bank details. So INVALID. However, the content is just a short sentence. But we need to classify it. The content is a direct request: "Your bank details are needed to complete this transaction.". That is suspicious. Probably a spam or phishing attempt. So INVALID. The model should output "INVALID". Let's produce that."}

Performance Metrics

7.49s Prediction Time

7.50s Total Time

All Input Parameters

{
  "top_p": 1,
  "policy": "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts.",
  "prompt": "Your bank details are needed to complete this transaction.",
  "temperature": 1,
  "max_new_tokens": 512,
  "reasoning_effort": "medium",
  "repetition_penalty": 1
}

Input Parameters

top_p Type: numberDefault: 1Range: 0 - 1: Nucleus sampling top-p
policy Type: stringDefault:: Safety policy or rules to evaluate the prompt against. This defines the criteria for classification.
prompt Type: stringDefault: Your bank details are needed to complete this transaction.: Input prompt or question to evaluate
temperature Type: numberDefault: 1Range: 0 - 2: Sampling temperature (0 for greedy decoding)
max_new_tokens Type: integerDefault: 512Range: 1 - 8192: Maximum number of tokens to generate
reasoning_effort Default: medium: Reasoning effort level: 'low' for faster responses on straightforward cases, 'medium' for balanced reasoning (default), or 'high' for deeper analysis of complex edge cases
repetition_penalty Type: numberDefault: 1Range: 0.8 - 2: Repetition penalty to reduce repetitive text (1.0 = no penalty)

Output Schema

Output

Type: object

Example Execution Logs

DEBUG: Raw output preview (first 1000 chars):
<|channel|>analysis<|message|>We need to see if this is spam. The user says: "Your bank details are needed to complete this transaction." That is a request for bank details. It's a request from presumably user to some service, but the question asks to classify the content as VALID or INVALID. According to spam policy: Spam includes unsolicited promotional content, deceptive messages, or phishing attempts. This statement looks like it is a phishing attempt: requesting bank details.
Also, we might treat it as spam: content is a phishing attempt to get bank details. So INVALID. However, the content is just a short sentence. But we need to classify it. The content is a direct request: "Your bank details are needed to complete this transaction.". That is suspicious. Probably a spam or phishing attempt. So INVALID. The model should output "INVALID". Let's produce that.<|end|><|start|>assistant<|channel|>final<|message|>INVALID<|return|>
DEBUG: Extracted analysis length: 845, answer length: 7
DEBUG: Analysis preview: We need to see if this is spam. The user says: "Your bank details are needed to complete this transaction." That is a request for bank details. It's a request from presumably user to some service, but...
DEBUG: Answer preview: INVALID...

Version Details

Version ID: f10ffd6deb8a2fe73746ade74b5281b33d20c90d7c90b54d6068cfc0c072cbbe
Version Created: November 4, 2025

Run on Replicate →