lucataco/gpt-oss-safeguard-20b 🔢📝❓ → ❓
About
classify text content based on safety policies that you provide and perform a suite of foundational safety tasks
Example Output
Prompt:
"Your bank details are needed to complete this transaction."
Output
{"answer":"INVALID","analysis":"We need to see if this is spam. The user says: "Your bank details are needed to complete this transaction." That is a request for bank details. It's a request from presumably user to some service, but the question asks to classify the content as VALID or INVALID. According to spam policy: Spam includes unsolicited promotional content, deceptive messages, or phishing attempts. This statement looks like it is a phishing attempt: requesting bank details. Also, we might treat it as spam: content is a phishing attempt to get bank details. So INVALID. However, the content is just a short sentence. But we need to classify it. The content is a direct request: "Your bank details are needed to complete this transaction.". That is suspicious. Probably a spam or phishing attempt. So INVALID. The model should output "INVALID". Let's produce that."}
Performance Metrics
7.49s
Prediction Time
7.50s
Total Time
All Input Parameters
{
"top_p": 1,
"policy": "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts.",
"prompt": "Your bank details are needed to complete this transaction.",
"temperature": 1,
"max_new_tokens": 512,
"reasoning_effort": "medium",
"repetition_penalty": 1
}
Input Parameters
- top_p
- Nucleus sampling top-p
- policy
- Safety policy or rules to evaluate the prompt against. This defines the criteria for classification.
- prompt
- Input prompt or question to evaluate
- temperature
- Sampling temperature (0 for greedy decoding)
- max_new_tokens
- Maximum number of tokens to generate
- reasoning_effort
- Reasoning effort level: 'low' for faster responses on straightforward cases, 'medium' for balanced reasoning (default), or 'high' for deeper analysis of complex edge cases
- repetition_penalty
- Repetition penalty to reduce repetitive text (1.0 = no penalty)
Output Schema
Output
Example Execution Logs
DEBUG: Raw output preview (first 1000 chars): <|channel|>analysis<|message|>We need to see if this is spam. The user says: "Your bank details are needed to complete this transaction." That is a request for bank details. It's a request from presumably user to some service, but the question asks to classify the content as VALID or INVALID. According to spam policy: Spam includes unsolicited promotional content, deceptive messages, or phishing attempts. This statement looks like it is a phishing attempt: requesting bank details. Also, we might treat it as spam: content is a phishing attempt to get bank details. So INVALID. However, the content is just a short sentence. But we need to classify it. The content is a direct request: "Your bank details are needed to complete this transaction.". That is suspicious. Probably a spam or phishing attempt. So INVALID. The model should output "INVALID". Let's produce that.<|end|><|start|>assistant<|channel|>final<|message|>INVALID<|return|> DEBUG: Extracted analysis length: 845, answer length: 7 DEBUG: Analysis preview: We need to see if this is spam. The user says: "Your bank details are needed to complete this transaction." That is a request for bank details. It's a request from presumably user to some service, but... DEBUG: Answer preview: INVALID...
Version Details
- Version ID
f10ffd6deb8a2fe73746ade74b5281b33d20c90d7c90b54d6068cfc0c072cbbe- Version Created
- November 4, 2025