Skip to main content

Toxicity Guard

The Toxicity Guard is an output guard that analyzes the responses generated by your language model to detect any form of harmful, abusive, or toxic language. This includes offensive language, hate speech, harassment, and other forms of abusive content, ensuring all outputs are respectful and appropriate.

info

ToxicityGuard is only available as an output guard.

Example

from deepeval.guardrails import ToxicityGuard

model_output = "You're a complete idiot for thinking that way."

toxicity_guard = ToxicityGuard()
guard_result = toxicity_guard.guard(response=model_output)

There are no required arguments when initializing the ToxicityGuard object. The guard function accepts a single parameter response, which is the output of your LLM application.

Interpreting Guard Result

print(guard_result.score)
print(guard_result.score_breakdown)

guard_result.score is an integer that is 1 if the guard has been breached. The score_breakdown for ToxicityGuard is a dictionary containing:

  • score: A binary value (1 or 0), where 1 indicates that toxic content was detected.
  • reason: A brief explanation of why the score was assigned.
{
"score": 1,
"reason": "The output contains a personal attack, specifically calling someone an 'idiot' in a derogatory manner."
}