Skip to content

Commit 5b2f338

Browse files
committed
Parameterize LLM returning reasoning
1 parent 8b2e4c3 commit 5b2f338

17 files changed

+278
-76
lines changed

docs/ref/checks/custom_prompt_check.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
2020
- **`model`** (required): Model to use for the check (e.g., "gpt-5")
2121
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2222
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
23+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
24+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
25+
- When `true`: Additionally, returns detailed reasoning for its decisions
26+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
2327

2428
## Implementation Notes
2529

@@ -42,3 +46,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
4246
- **`flagged`**: Whether the custom validation criteria were met
4347
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
4448
- **`threshold`**: The confidence threshold that was configured
49+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

docs/ref/checks/hallucination_detection.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
1414
"config": {
1515
"model": "gpt-4.1-mini",
1616
"confidence_threshold": 0.7,
17-
"knowledge_source": "vs_abc123"
17+
"knowledge_source": "vs_abc123",
18+
"include_reasoning": false
1819
}
1920
}
2021
```
@@ -24,6 +25,10 @@ Flags model text containing factual claims that are clearly contradicted or not
2425
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
2526
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2627
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
28+
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
29+
- When `false`: Returns only `flagged` and `confidence` to save tokens
30+
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
31+
- Recommended: Keep disabled for production (default); enable for development/debugging
2732

2833
### Tuning guidance
2934

@@ -102,7 +107,9 @@ See [`examples/hallucination_detection/`](https://github.com/openai/openai-guard
102107

103108
## What It Returns
104109

105-
Returns a `GuardrailResult` with the following `info` dictionary:
110+
Returns a `GuardrailResult` with the following `info` dictionary.
111+
112+
**With `include_reasoning=true`:**
106113

107114
```json
108115
{
@@ -117,15 +124,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
117124
}
118125
```
119126

127+
### Fields
128+
120129
- **`flagged`**: Whether the content was flagged as potentially hallucinated
121130
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
122-
- **`reasoning`**: Explanation of why the content was flagged
123-
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
124-
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
125-
- **`verified_statements`**: Statements that are supported by your documents
126131
- **`threshold`**: The confidence threshold that was configured
127-
128-
Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
132+
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
133+
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
134+
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
135+
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*
129136

130137
## Benchmark Results
131138

docs/ref/checks/jailbreak.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
3333
"name": "Jailbreak",
3434
"config": {
3535
"model": "gpt-4.1-mini",
36-
"confidence_threshold": 0.7
36+
"confidence_threshold": 0.7,
37+
"include_reasoning": false
3738
}
3839
}
3940
```
@@ -42,6 +43,10 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
4243

4344
- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
4445
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
46+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
47+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
48+
- When `true`: Additionally, returns detailed reasoning for its decisions
49+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
4550

4651
### Tuning guidance
4752

@@ -70,7 +75,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
7075
- **`flagged`**: Whether a jailbreak attempt was detected
7176
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
7277
- **`threshold`**: The confidence threshold that was configured
73-
- **`reason`**: Explanation of why the input was flagged (or not flagged)
78+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
7479
- **`used_conversation_history`**: Boolean indicating whether conversation history was analyzed
7580
- **`checked_text`**: JSON payload containing the conversation history and latest input that was analyzed
7681

docs/ref/checks/llm_base.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
99
"name": "LLM Base",
1010
"config": {
1111
"model": "gpt-5",
12-
"confidence_threshold": 0.7
12+
"confidence_threshold": 0.7,
13+
"include_reasoning": false
1314
}
1415
}
1516
```
@@ -18,6 +19,10 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
1819

1920
- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
2021
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
22+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
23+
- When `true`: The LLM generates and returns detailed reasoning for its decisions (e.g., `reason`, `reasoning`, `observation`, `evidence` fields)
24+
- When `false`: The LLM only returns the essential fields (`flagged` and `confidence`), reducing token generation costs
25+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
2126

2227
## What It Does
2328

docs/ref/checks/nsfw.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
2929

3030
- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
3131
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
32+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
33+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
34+
- When `true`: Additionally, returns detailed reasoning for its decisions
35+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
3236

3337
### Tuning guidance
3438

@@ -51,6 +55,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
5155
- **`flagged`**: Whether NSFW content was detected
5256
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
5357
- **`threshold`**: The confidence threshold that was configured
58+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
5459

5560
### Examples
5661

docs/ref/checks/off_topic_prompts.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
2020
- **`model`** (required): Model to use for analysis (e.g., "gpt-5")
2121
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2222
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
23+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
24+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
25+
- When `true`: Additionally, returns detailed reasoning for its decisions
26+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
2327

2428
## Implementation Notes
2529

@@ -40,5 +44,6 @@ Returns a `GuardrailResult` with the following `info` dictionary:
4044
```
4145

4246
- **`flagged`**: Whether the content aligns with your business scope
43-
- **`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
47+
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
4448
- **`threshold`**: The confidence threshold that was configured
49+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

docs/ref/checks/prompt_injection_detection.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
3131
"name": "Prompt Injection Detection",
3232
"config": {
3333
"model": "gpt-4.1-mini",
34-
"confidence_threshold": 0.7
34+
"confidence_threshold": 0.7,
35+
"include_reasoning": false
3536
}
3637
}
3738
```
@@ -40,6 +41,10 @@ After tool execution, the prompt injection detection check validates that the re
4041

4142
- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
4243
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
44+
- **`include_reasoning`** (optional): Whether to include the `observation` and `evidence` fields in the output (default: `false`)
45+
- When `true`: Returns detailed `observation` explaining what the action is doing and `evidence` with specific quotes/details
46+
- When `false`: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
47+
- Recommended: Keep disabled for production (default); enable for development/debugging
4348

4449
**Flags as MISALIGNED:**
4550

@@ -77,13 +82,16 @@ Returns a `GuardrailResult` with the following `info` dictionary:
7782
}
7883
```
7984

80-
- **`observation`**: What the AI action is doing
85+
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
8186
- **`flagged`**: Whether the action is misaligned (boolean)
8287
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
88+
- **`evidence`**: Specific evidence from conversation supporting the decision - *only included when `include_reasoning=true`*
8389
- **`threshold`**: The confidence threshold that was configured
8490
- **`user_goal`**: The tracked user intent from conversation
8591
- **`action`**: The list of function calls or tool outputs analyzed for alignment
8692

93+
**Note**: When `include_reasoning=false` (the default), the `observation` and `evidence` fields are omitted to reduce token generation costs.
94+
8795
## Benchmark Results
8896

8997
### Dataset Description

src/guardrails/checks/text/hallucination_detection.py

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,8 @@ class HallucinationDetectionOutput(LLMOutput):
9494
Extends the base LLM output with hallucination-specific details.
9595
9696
Attributes:
97-
flagged (bool): Whether the content was flagged as potentially hallucinated.
98-
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated.
97+
flagged (bool): Whether the content was flagged as potentially hallucinated (inherited).
98+
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated (inherited).
9999
reasoning (str): Detailed explanation of the analysis.
100100
hallucination_type (str | None): Type of hallucination detected.
101101
hallucinated_statements (list[str] | None): Specific statements flagged as
@@ -104,16 +104,6 @@ class HallucinationDetectionOutput(LLMOutput):
104104
by the documents.
105105
"""
106106

107-
flagged: bool = Field(
108-
...,
109-
description="Indicates whether the content was flagged as potentially hallucinated.",
110-
)
111-
confidence: float = Field(
112-
...,
113-
description="Confidence score (0.0 to 1.0) that the input is hallucinated.",
114-
ge=0.0,
115-
le=1.0,
116-
)
117107
reasoning: str = Field(
118108
...,
119109
description="Detailed explanation of the hallucination analysis.",
@@ -245,12 +235,15 @@ async def hallucination_detection(
245235
# Create the validation query
246236
validation_query = f"{VALIDATION_PROMPT}\n\nText to validate:\n{candidate}"
247237

238+
# Use HallucinationDetectionOutput (with reasoning fields) if enabled, otherwise base LLMOutput
239+
output_format = HallucinationDetectionOutput if config.include_reasoning else LLMOutput
240+
248241
# Use the Responses API with file search and structured output
249242
response = await _invoke_openai_callable(
250243
ctx.guardrail_llm.responses.parse,
251244
input=validation_query,
252245
model=config.model,
253-
text_format=HallucinationDetectionOutput,
246+
text_format=output_format,
254247
tools=[{"type": "file_search", "vector_store_ids": [config.knowledge_source]}],
255248
)
256249

src/guardrails/checks/text/jailbreak.py

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,6 @@
4040
import textwrap
4141
from typing import Any
4242

43-
from pydantic import Field
44-
4543
from guardrails.registry import default_spec_registry
4644
from guardrails.spec import GuardrailSpecMetadata
4745
from guardrails.types import GuardrailLLMContextProto, GuardrailResult, token_usage_to_dict
@@ -50,6 +48,7 @@
5048
LLMConfig,
5149
LLMErrorOutput,
5250
LLMOutput,
51+
LLMReasoningOutput,
5352
create_error_result,
5453
run_llm,
5554
)
@@ -226,15 +225,6 @@
226225
MAX_CONTEXT_TURNS = 10
227226

228227

229-
class JailbreakLLMOutput(LLMOutput):
230-
"""LLM output schema including rationale for jailbreak classification."""
231-
232-
reason: str = Field(
233-
...,
234-
description=("Justification for why the input was flagged or not flagged as a jailbreak."),
235-
)
236-
237-
238228
def _build_analysis_payload(conversation_history: list[Any] | None, latest_input: str) -> str:
239229
"""Return a JSON payload with recent turns and the latest input."""
240230
trimmed_input = latest_input.strip()
@@ -251,12 +241,15 @@ async def jailbreak(ctx: GuardrailLLMContextProto, data: str, config: LLMConfig)
251241
conversation_history = getattr(ctx, "get_conversation_history", lambda: None)() or []
252242
analysis_payload = _build_analysis_payload(conversation_history, data)
253243

244+
# Use LLMReasoningOutput (with reason) if reasoning is enabled, otherwise use base LLMOutput
245+
output_model = LLMReasoningOutput if config.include_reasoning else LLMOutput
246+
254247
analysis, token_usage = await run_llm(
255248
analysis_payload,
256249
SYSTEM_PROMPT,
257250
ctx.guardrail_llm,
258251
config.model,
259-
JailbreakLLMOutput,
252+
output_model,
260253
)
261254

262255
if isinstance(analysis, LLMErrorOutput):

src/guardrails/checks/text/llm_base.py

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ class MyLLMOutput(LLMOutput):
7373
"LLMConfig",
7474
"LLMErrorOutput",
7575
"LLMOutput",
76+
"LLMReasoningOutput",
7677
"create_error_result",
7778
"create_llm_check_fn",
7879
]
@@ -87,6 +88,9 @@ class LLMConfig(BaseModel):
8788
model (str): The LLM model to use for checking the text.
8889
confidence_threshold (float): Minimum confidence required to trigger the guardrail,
8990
as a float between 0.0 and 1.0.
91+
include_reasoning (bool): Whether to include reasoning/explanation in guardrail
92+
output. Useful for development and debugging, but can be disabled in production
93+
to save tokens. Defaults to True.
9094
"""
9195

9296
model: str = Field(..., description="LLM model to use for checking the text")
@@ -96,6 +100,13 @@ class LLMConfig(BaseModel):
96100
ge=0.0,
97101
le=1.0,
98102
)
103+
include_reasoning: bool = Field(
104+
False,
105+
description=(
106+
"Include reasoning/explanation fields in output. "
107+
"Defaults to False for token efficiency. Enable for development/debugging."
108+
),
109+
)
99110

100111
model_config = ConfigDict(extra="forbid")
101112

@@ -117,6 +128,19 @@ class LLMOutput(BaseModel):
117128
confidence: float
118129

119130

131+
class LLMReasoningOutput(LLMOutput):
132+
"""Extended LLM output schema with reasoning explanation.
133+
134+
Extends LLMOutput to include a reason field explaining the decision.
135+
This is the standard extended output for guardrails that include reasoning.
136+
137+
Attributes:
138+
reason (str): Explanation for why the input was flagged or not flagged.
139+
"""
140+
141+
reason: str = Field(..., description="Explanation for the flagging decision")
142+
143+
120144
class LLMErrorOutput(LLMOutput):
121145
"""Extended LLM output schema with error information.
122146
@@ -399,7 +423,7 @@ def create_llm_check_fn(
399423
name: str,
400424
description: str,
401425
system_prompt: str,
402-
output_model: type[LLMOutput] = LLMOutput,
426+
output_model: type[LLMOutput] | None = None,
403427
config_model: type[TLLMCfg] = LLMConfig, # type: ignore[assignment]
404428
) -> CheckFn[GuardrailLLMContextProto, str, TLLMCfg]:
405429
"""Factory for constructing and registering an LLM-based guardrail check_fn.
@@ -409,17 +433,25 @@ def create_llm_check_fn(
409433
use the configured LLM to analyze text, validate the result, and trigger if
410434
confidence exceeds the provided threshold.
411435
436+
When `include_reasoning=True` in the config, the guardrail will automatically
437+
use an extended output model with a `reason` field. When `include_reasoning=False`,
438+
it uses the base `LLMOutput` model (only `flagged` and `confidence` fields).
439+
412440
Args:
413441
name (str): Name under which to register the guardrail.
414442
description (str): Short explanation of the guardrail's logic.
415443
system_prompt (str): Prompt passed to the LLM to control analysis.
416-
output_model (type[LLMOutput]): Schema for parsing the LLM output.
444+
output_model (type[LLMOutput] | None): Custom schema for parsing the LLM output.
445+
If None (default), uses `LLMReasoningOutput` when reasoning is enabled.
446+
Provide a custom model only if you need additional fields beyond `reason`.
417447
config_model (type[LLMConfig]): Configuration schema for the check_fn.
418448
419449
Returns:
420450
CheckFn[GuardrailLLMContextProto, str, TLLMCfg]: Async check function
421451
to be registered as a guardrail.
422452
"""
453+
# Default to LLMReasoningOutput if no custom model provided
454+
extended_output_model = output_model or LLMReasoningOutput
423455

424456
async def guardrail_func(
425457
ctx: GuardrailLLMContextProto,
@@ -441,12 +473,16 @@ async def guardrail_func(
441473
else:
442474
rendered_system_prompt = system_prompt
443475

476+
# Use base LLMOutput if reasoning is disabled, otherwise use the extended model
477+
include_reasoning = getattr(config, "include_reasoning", False)
478+
selected_output_model = extended_output_model if include_reasoning else LLMOutput
479+
444480
analysis, token_usage = await run_llm(
445481
data,
446482
rendered_system_prompt,
447483
ctx.guardrail_llm,
448484
config.model,
449-
output_model,
485+
selected_output_model,
450486
)
451487

452488
# Check if this is an error result

0 commit comments

Comments
 (0)