Handling Hallucinations — Ensuring Agent Accuracy

Why Hallucination Handling Matters

All LLMs sometimes generate plausible-sounding but incorrect information — this is called hallucination. In a multi-agent system, a hallucination in one agent's output can cascade through the entire crew, producing completely wrong results.

Why this matters for your career:

Hallucination handling separates demo-quality from production-quality agents
Building trust in AI systems requires managing errors transparently
Regulatory compliance (GDPR, HIPAA) may demand accuracy guarantees
Clients and users expect reliable outputs from AI systems

What Are Hallucinations?

| Type | Description | Example | |------|-------------|--------| | Factual error | Incorrect fact presented as true | "Taiwan's highest peak is 2,500m" (correct: 3,952m) | | Fabricated data | Making up entities or events | Describing a campsite that doesn't exist | | Calculation error | Wrong arithmetic or logic | "3 nights × $45 = $120" (correct: $135) | | False citation | Referencing non-existent sources | "According to a 2025 Taiwan Camping Survey..." | | Location error | Incorrect geographic information | "Sun Moon Lake is in Taipei" (it's in Nantou) | | Temporal error | Wrong dates, seasons, or hours | "Open in January" when actually closed in winter |

Why Agents Hallucinate

| Cause | Explanation | Mitigation | |-------|-------------|------------| | Ambiguous instructions | Agent doesn't know what's real vs. generated | Be specific about data sources | | No data access | Agent invents facts when lacking information | Provide lookup tools for real data | | Overconfidence | LLMs are trained to be helpful, not to say "I don't know" | Instruct agents to acknowledge uncertainty | | Context limits | Agent forgets early parts of long conversations | Summarize and pass only relevant context | | Model bias | Training data contains outdated or incorrect info | Use up-to-date models with retrieval augmentation |

Detection Strategies

1. Confidence Scoring

Ask agents to rate their own confidence after each response:

After providing your answer, add a confidence score:

CONFIDENCE: HIGH / MEDIUM / LOW

- HIGH: You are certain — the information comes from your tools or reliable data sources
- MEDIUM: You are reasonably sure but some details might vary
- LOW: You are not sure — this may be incorrect or based on incomplete information

If your confidence is LOW, clearly state that the user should verify the information.

2. Tool-Based Fact-Checking

Require agents to use tools for factual claims instead of guessing:

from crewai_tools import tool
import requests

@tool("Lookup Campsite")
def lookup_campsite(name: str) -> str:
    """Look up a campsite by name in the database. Returns NOT_FOUND if it doesn't exist."""
    response = requests.get(f"https://api.example.com/campsites?name={name}")
    if response.status_code == 200:
        data = response.json()
        if data:
            return str(data[0])
    return "NOT_FOUND"

@tool("Get Elevation")
def get_elevation(lat: float, lng: float) -> str:
    """Get elevation for a coordinate location."""
    response = requests.get(
        f"https://api.open-elevation.com/api/v1/lookup?locations={lat},{lng}"
    )
    if response.status_code == 200:
        return str(response.json()['results'][0]['elevation'])
    return "UNAVAILABLE"

# Agent can only make factual claims by calling tools
agent = Agent(
    role='Campsite Researcher',
    goal='Provide accurate campsite information using database lookups.',
    tools=[lookup_campsite, get_elevation],
    verbose=True
)

3. Cross-Verification

Call the same model twice independently for critical claims:

from crewai import Agent, Task, Crew, Process

checker_1 = Agent(
    role='Fact Checker 1',
    goal='Independently verify a factual claim.',
    llm='gpt-4o',
    verbose=True
)

checker_2 = Agent(
    role='Fact Checker 2',
    goal='Independently verify the same factual claim.',
    llm='gpt-4o',
    verbose=True
)

verify_task_1 = Task(
    description='Is this claim correct: "Yu Shan is 3,952 meters tall." Answer only YES or NO.',
    agent=checker_1,
    expected_output='YES or NO'
)

verify_task_2 = Task(
    description='Is this claim correct: "Yu Shan is 3,952 meters tall." Answer only YES or NO.',
    agent=checker_2,
    expected_output='YES or NO'
)

crew = Crew(
    agents=[checker_1, checker_2],
    tasks=[verify_task_1, verify_task_2],
    process=Process.sequential
)

result = crew.kickoff()
# Both say YES → high confidence
# One says NO → flag for human review

4. Pydantic Validation

Use Pydantic models to validate structured outputs:

from pydantic import BaseModel, Field, validator
from typing import Optional

class CampsiteInfo(BaseModel):
    name: str = Field(..., min_length=2, max_length=100)
    elevation: Optional[int] = Field(None, ge=0, le=10000)
    has_water: bool
    has_toilet: bool
    price_per_night: Optional[float] = Field(None, ge=0, le=10000)
    region: str = Field(..., pattern='^(northern|central|southern|eastern)$')

    @validator('name')
    def must_be_known_campsite(cls, v):
        known_campsites = ['Sunset Ridge', 'Forest Creek', 'High Peak', 'Lakeside Haven']
        if v not in known_campsites:
            raise ValueError(f'Unknown campsite: {v}')
        return v

# Use the model in a task
from crewai_tools import JSONReporterTool

reporter = JSONReporterTool(schema=CampsiteInfo.model_json_schema())

task = Task(
    description='Provide information about the campsite: Sunset Ridge',
    agent=camping_expert,
    output_pydantic=CampsiteInfo  # CrewAI will validate against this schema
)

crew = Crew(
    agents=[camping_expert],
    tasks=[task]
)

result = crew.kickoff()
# result will be a CampsiteInfo instance (validated)
print(f"Name: {result.name}")
print(f"Elevation: {result.elevation}m")
print(f"Has water: {result.has_water}")

5. Graceful Fallbacks

When an agent produces low-confidence output, fall back gracefully:

def safe_execute_crew(crew, inputs):
    """Execute a crew with hallucination detection and graceful fallback."""
    try:
        result = crew.kickoff(inputs=inputs)
        result_str = str(result).lower()

        # Check for low-confidence indicators
        uncertainty_markers = ['I think', 'maybe', 'not sure', 'approximately', 'could be', 'possibly', 'might be']

        if any(marker in result_str for marker in uncertainty_markers):
            # Flag for human review
            print(f"Low confidence detected — flagging for review")
            return {
                'status': 'needs_review',
                'output': result,
                'message': 'The agent was uncertain. A human should verify this output.'
            }

        # Check for refusal indicators
        refusal_markers = ['I cannot', 'I can', 'unable to', 'do not have enough']
        if any(marker in result_str for marker in refusal_markers):
            print(f"Agent could not complete the task")
            return {
                'status': 'incomplete',
                'output': result,
                'message': 'The agent could not fully complete this task.'
            }

        return {'status': 'success', 'output': result}

    except Exception as e:
        print(f"Crew execution failed: {e}")
        return {
            'status': 'error',
            'output': None,
            'error': str(e),
            'message': 'An error occurred. Please try again or contact support.'
        }

Putting It All Together

# Complete production pattern:
# 1. Tools provide real data
# 2. Task asks for confidence score
# 3. Pydantic model validates output
# 4. Cross-verify critical claims
# 5. Graceful fallback on failure

from crewai import Agent, Task, Crew, Process
from pydantic import BaseModel, Field

class Recommendation(BaseModel):
    campsite_name: str = Field(..., min_length=2)
    confidence: str = Field(..., pattern='^(HIGH|MEDIUM|LOW)$')
    reason: str = Field(..., min_length=10)

research_task = Task(
    description='''
    Recommend a campsite near Taipei that has water and toilet facilities.
    Use the lookup tool to verify the campsite exists.
    After your recommendation, add CONFIDENCE: HIGH / MEDIUM / LOW.
    ''',
    agent=camping_expert,
    output_pydantic=Recommendation,
    expected_output='A validated Recommendation with HIGH confidence'
)

crew = Crew(
    agents=[camping_expert],
    tasks=[research_task],
    process=Process.sequential,
    verbose=True
)

result = safe_execute_crew(crew, inputs={'query': 'camping near Taipei with water and toilet'})

if result['status'] == 'success':
    print(f"✅ Recommendation: {result['output']}")
elif result['status'] == 'needs_review':
    print(f"⚠️ Needs human review: {result['output']}")
else:
    print(f"❌ Error: {result['message']}")

Best Practices Summary

| Practice | Why | |----------|-----| | Always provide tools for real data | Agents should look up facts, never guess | | Include confidence scoring in tasks | Users and downstream systems know how much to trust | | Cross-verify critical claims | Two independent calls reduce hallucination risk | | Validate outputs with Pydantic schemas | Catch malformed or invalid responses early | | Implement graceful fallbacks | System should degrade gracefully, not crash | | Log all agent outputs | Detect hallucination patterns over time | | Add human review gates for critical decisions | Some decisions need human judgment | | Use up-to-date models | Newer models hallucinate less | | Keep prompts specific | Vague prompts increase hallucination probability | | Limit each agent's scope | Narrower expertise means fewer opportunities to hallucinate | | Include context from previous tasks | Maintain awareness of the full conversation | | Test with adversarial inputs | Verify your hallucination detection works |

Summary

Hallucinations are a fundamental challenge in LLM-based systems. Mitigate them by providing real data tools, requiring confidence scores, cross-verifying claims, validating with schemas, and implementing graceful fallbacks. Production systems always combine automated validation with human oversight.

Key takeaways:

Hallucinations = plausible-sounding false information generated by LLMs
Causes: ambiguous instructions, no data access, overconfidence, context limits, model bias
Detection strategies: confidence scoring, tool-based fact-checking, cross-verification
Validation: Pydantic schemas catch structural errors in agent outputs
Fallbacks: degrade gracefully when confidence is low or execution fails
Always provide real data tools — never let agents guess facts
Log all outputs to detect hallucination patterns over time
Combine automated checks with human review for critical decisions
Use up-to-date models and keep prompts specific
Limit each agent's scope to reduce hallucination opportunities

What's Next: Output Parsers

The next chapter covers output parsers — using Pydantic models to validate, structure, and parse agent outputs for reliable downstream processing.