lucataco/gpt-oss-safeguard-120b 🔢📝❓ → ❓
About
classify text content based on safety policies that you provide and perform a suite of foundational safety tasks
Example Output
Prompt:
"Your bank details are needed to complete this transaction."
Output
{"answer":"INVALID","analysis":"The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, which could be a phishing attempt? The prompt says: "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts." The content "Your bank details are needed to complete this transaction." That is presumably a request for personal info, possibly a scam or phishing. It is likely "INVALID" as spam / phishing. However classification: It's a statement that "Your bank details are needed to complete this transaction." It's a request for bank details, but more context: Are they asking for bank details? It may be considered phishing or solicitation for personal info. So classify as INVALID (spam). The answer should be a classification. The instructions: "Classify content as VALID or INVALID." So output classification. Likely output: INVALID."}
Performance Metrics
18.12s
Prediction Time
18.13s
Total Time
All Input Parameters
{
"top_p": 1,
"policy": "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts.",
"prompt": "Your bank details are needed to complete this transaction.",
"temperature": 1,
"max_new_tokens": 512,
"reasoning_effort": "medium",
"repetition_penalty": 1
}
Input Parameters
- top_p
- Nucleus sampling top-p
- policy
- Safety policy or rules to evaluate the prompt against. This defines the criteria for classification.
- prompt
- Input prompt or question to evaluate
- temperature
- Sampling temperature (0 for greedy decoding)
- max_new_tokens
- Maximum number of tokens to generate
- reasoning_effort
- Reasoning effort level: 'low' for faster responses on straightforward cases, 'medium' for balanced reasoning (default), or 'high' for deeper analysis of complex edge cases
- repetition_penalty
- Repetition penalty to reduce repetitive text (1.0 = no penalty)
Output Schema
Output
Example Execution Logs
DEBUG: Raw output preview (first 1000 chars): <|channel|>analysis<|message|>The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, which could be a phishing attempt? The prompt says: "Spam Policy: Classify content as VALID (no spam) or INVALID (spam). Spam includes unsolicited promotional content, deceptive messages, or phishing attempts." The content "Your bank details are needed to complete this transaction." That is presumably a request for personal info, possibly a scam or phishing. It is likely "INVALID" as spam / phishing. However classification: It's a statement that "Your bank details are needed to complete this transaction." It's a request for bank details, but more context: Are they asking for bank details? It may be considered phishing or solicitation for personal info. So classify as INVALID (spam). The answer should be a classification. The instructions: "Classify content as VA DEBUG: Extracted analysis length: 1036, answer length: 7 DEBUG: Analysis preview: The user says: "Your bank details are needed to complete this transaction." This is possibly a request for bank details? The classification is about spam. The content is a request for bank details, wh... DEBUG: Answer preview: INVALID...
Version Details
- Version ID
6db83709d19776b2302ae67ce9f761771f18eb928881cb9bead50c49d62d6b0b- Version Created
- November 4, 2025