Guardrails for LLMs in Production
Confident AI enables you to effortlessly safeguard your LLM applications against malicious inputs and unsafe outputs with just one line of code. Our comprehensive suite of guardrails acts as binary metrics to evaluate user inputs and/or LLM responses for malicious intent and unsafe behavior.
Confident AI offers 10+ guards designed to detect 20+ LLM vulnerabilities.
Types of Guardrails
There are two types of guardrails: input guards, which protect against malicious inputs before they reach your LLM, and output guards, which evaluate the LLM's responses before they reach your users.
The number of guards you choose to set up and whether you decide to utilize both types of guards depends on your priorities regarding latency, cost, and LLM safety.
While most guards are strictly for input or output guarding, some guards such as cybersecurity offer both input and output guarding capabilities. For these guards, you can specify the guard_type
during initialization.
from deepeval.guardrails import CybersecurityGuard, GuardType
guard = CyberSecurityGuard(
guard_type=GuardType.INPUT # or GuardType.OUTPUT
...
)
List of Guards
Confident AI offers a robust selection of input and output guards for comprehensive protection:
Input Guards
Input Guard | Description |
---|---|
CyberSecurityGuard | Detects cybersecurity attacks such as SQL injection, shell command injection, or role-based access control (RBAC) violations. |
PromptInjectionGuard | Detects prompt injection attacks where secondary commands are embedded within user inputs to manipulate the system's behavior. |
JailbreakingGuard | Detects jailbreaking attacks disguised as safe inputs, which attempt to bypass system restrictions or ethical guidelines. |
PrivacyGuard | Detects inputs that leak private information such as personally identifiable information (PII), API keys, or sensitive credentials. |
TopicalGuard | Detects inputs that are irrelevant to the context or violate the defined topics or usage boundaries. |
Output Guards
Output Guard | Description |
---|---|
GraphicContentGuard | Detects outputs containing explicit, violent, or graphic content to ensure safe and appropriate responses. |
HallucinationGuard | Identifies outputs that include factually inaccurate or fabricated information generated by the model. |
IllegalGuard | Detects outputs that promote, support, or facilitate illegal activities or behavior. |
ModernizationGuard | Detects outputs containing outdated information that do not reflect modern standards or current knowledge. |
SyntaxGuard | Detects outputs that are syntactically incorrect or do not adhere to proper grammatical standards, ensuring clear communication. |
ToxicityGuard | Detects harmful, offensive, or toxic outputs to maintain respectful and safe interactions. |
CyberSecurityGuard | Detects outputs that may have been breached or manipulated as a result of a cyberattack. |
Guarding your LLM Application
Guardrails Object
To begin guarding your LLM application, simply import and initialize the guards you desire from deepeval.guardrails
and pass them to the Guardrails
object. Then, call the guard
method with the user input and LLM response.
from deepeval.guardrails import Guardrails
from deepeval.guardrails import HallucinationGuard, TopicalGuard
# Example LLM response to evaluate
input = "Is the earth flat"
llm_response = "The earth is flat"
# Initialize your guards
guardrails = Guardrails(
guards = [
HallucinationGuard(),
TopicalGuard(allowed_topics=["health and technology"])
]
)
# Call the guard method
guard_result = guardrails.guard(input, llm_response)
When using Guardrails.guard
, you must provide both input
and llm_response
, even if you have selected only one type of guard.
Using Individual Guards
Alternatively, you can also call the guard
method on individual guards.
from deepeval.guardrails import HallucinationGuard
llm_response = "The earth is flat"
hallucination_guard = HallucinationGuard()
guard_result = hallucination_guard.guard(response=llm_response)
Async Mode
Confident AI also supports asynchronous guard calls (for individual guards) through the a_guard
function.
from deepeval.guardrails import HallucinationGuard, ToxicityGuard
import asyncio
# Example LLM response to evaluate
llm_response = "The earth is flat"
# Initialize multiple guards
hallucination_guard = HallucinationGuard()
toxicity_guard = ToxicityGuard()
# Create asynchronous guard tasks
guard_tasks = [
hallucination_guard.a_guard(response=llm_response), # Check for hallucinations
toxicity_guard.a_guard(response=llm_response) # Check for toxicity
]
# Run all guard tasks concurrently and gather results
results = await asyncio.gather(*guard_tasks)
Interpreting Guard Results
The Guardrails.guard
method returns a GuardResult
object, which can be used to retry LLM response generations when necessary.
class GuardResult(BaseModel):
breached: bool
guard_scores: List[GuardScore]
The breached
property is true
if a guard has failed. Detailed scores and breakdowns for each guard are available in the guard_scores
list. A score
of 1 indicates the guard was breached, while 0 means it passed.
If using individual guards, the guard
function will directly return a GuardScore
.
class GuardScore(BaseModel):
guard: str
score: int
score_breakdown: Union[List, Dict]