How to Build AI Analytics Agent That Doesn't Hallucinate
Blog Post

How to Build AI Analytics Agent That Doesn't Hallucinate

Jake McCluskey
Back to blog

You build an AI analytics agent by separating what LLMs do well (understanding intent, explaining results) from what they do poorly (executing precise calculations on real data). The solution is a hybrid architecture where an LLM translates natural language queries into strict JSON specifications, a deterministic execution engine runs pre-written code against your data, and the LLM only touches the results for interpretation. This prevents the agent from inventing numbers or improvising data transformations, which is the core failure mode when LLMs handle analytics end to end.

Why LLMs Fail at Enterprise Analytics

LLMs are trained to generate plausible text, not accurate calculations. When you ask GPT-4 or Claude to "analyze sales data and find the top performing regions," it's optimized to produce an answer that sounds reasonable, not one that's mathematically verifiable. In testing, LLM-generated analytics code produces incorrect results roughly 30-40% of the time when handling real-world data edge cases like null values, date parsing, or multi-step aggregations.

The problem gets worse with scale. A single hallucinated number in a quarterly report can cascade into bad strategic decisions. When LLMs write pandas code on the fly, they might use deprecated methods, mishandle data types, or create logic errors that only surface with specific data patterns. You can't reproduce the analysis because the LLM might generate different code next time.

Traditional approaches like RAG pipelines help with factual grounding but don't solve the execution problem. RAG can provide context about your data schema, but it doesn't prevent the LLM from writing buggy transformation logic or making arithmetic mistakes.

What Is Hybrid AI Architecture for Analytics Agents

A hybrid AI architecture splits your analytics agent into three distinct components with clear boundaries. The Analysis Planner uses an LLM to interpret user intent and translate it into a structured specification. The Analysis Engine executes pre-written, tested code based on that specification without any LLM involvement. The Results Interpreter uses an LLM to explain findings in natural language.

Here's the critical insight: the LLM never sees or processes your raw data directly. It only works with metadata (column names, data types, available operations) and final results (aggregated numbers, not individual records). All actual data manipulation happens in deterministic code that you've written, tested, and version-controlled.

This pattern mirrors how senior data analysts actually work. They listen to stakeholder questions, mentally map them to known analytical patterns, execute familiar code, then explain what the numbers mean. The difference is you're automating the pattern-matching and explanation steps while keeping the execution step fully deterministic.

How to Build a Hybrid Analytics Agent Step by Step

Start by defining your analytical operations as a library of functions. Don't let the LLM write pandas code dynamically. Instead, create pre-written functions for common operations: calculate_growth_rate(), segment_by_quartile(), compare_time_periods(), find_outliers(). Each function should handle edge cases, validate inputs, and return structured error messages when data doesn't meet requirements.

Create Your Operation Library

Your operation library is the foundation. Write functions that accept parameters but execute fixed logic. Here's an example structure:


def calculate_segment_performance(df, segment_col, metric_col, time_col, period):
    """
    Pre-written function with fixed logic, parameterized inputs.
    Returns structured results or raises specific exceptions.
    """
    if segment_col not in df.columns:
        raise ValueError(f"Column {segment_col} not found in dataset")
    
    if df[metric_col].isna().sum() > len(df) * 0.3:
        raise ValueError(f"Metric column has >30% null values, analysis unreliable")
    
    # Fixed, tested aggregation logic
    results = df.groupby([segment_col, time_col])[metric_col].agg(['sum', 'mean', 'count'])
    return results.to_dict('records')

Build 15-20 of these functions covering your most common analytical patterns. The goal is to handle 80% of requests with pre-written code. You're not trying to cover every possible analysis, just the repeatable ones.

Build the Analysis Planner

The Analysis Planner is where your LLM operates. It takes natural language input and outputs a JSON specification that maps to your operation library. Use function calling (available in GPT-4, Claude 3.5, and other modern models) to constrain the output format.


analysis_planner_prompt = """
You are an analysis planner. Translate user requests into JSON specifications.

Available operations: {operation_list}
Available columns: {schema_info}

User request: {user_query}

Output a JSON specification with:
- operation: function name from available operations
- parameters: dict of parameter names and values
- validation_rules: list of data quality checks to run first

If the request cannot be fulfilled with available operations, return an error specification.
Do not improvise new operations or parameters.
"""

The planner's job is pattern matching, not execution. It should fail gracefully when it can't map a request to available operations rather than trying to improvise. This "return error not guess" principle is what makes the system reliable.

Implement the Deterministic Execution Engine

Your execution engine receives the JSON specification from the planner and runs the corresponding function. It never calls an LLM. It validates inputs, executes the function, catches exceptions, and returns structured results or error messages.


class AnalysisEngine:
    def __init__(self, operations_library):
        self.operations = operations_library
    
    def execute(self, specification):
        """
        Execute analysis based on JSON spec.
        No LLM calls happen here.
        """
        operation_name = specification['operation']
        params = specification['parameters']
        
        if operation_name not in self.operations:
            return {'error': f'Unknown operation: {operation_name}'}
        
        try:
            # Run validation rules first
            for rule in specification.get('validation_rules', []):
                self.validate(rule, params)
            
            # Execute the pre-written function
            result = self.operations[operation_name](**params)
            return {'success': True, 'data': result}
        
        except Exception as e:
            return {'error': str(e), 'operation': operation_name}

This layer is entirely reproducible. Given the same specification and data, you get identical results every time. You can log specifications, replay analyses, and debug failures without dealing with LLM non-determinism.

Add the Results Interpreter

After execution, send the results (not raw data) back to an LLM for interpretation. The LLM sees aggregated numbers and metadata about what operation was performed. It generates natural language explanations, highlights notable patterns, and suggests follow-up questions.


interpreter_prompt = """
Analysis performed: {operation_name}
Parameters: {parameters}
Results: {aggregated_results}

Explain these results in 2-3 sentences for a business stakeholder.
Highlight the most significant finding.
Suggest one relevant follow-up question.
"""

The interpreter adds value through context and communication, not calculation. Here's where the LLM's language capabilities shine without the risk of computational errors.

How to Prevent AI from Making Up Data in Analytics

The "return error not guess" principle is your primary defense against hallucination. When the Analysis Planner can't map a request to available operations, it should explicitly say so rather than attempting to improvise. When the Execution Engine encounters data quality issues, it should halt and report the problem rather than making assumptions.

Implement validation checks before every operation. Test for null percentages above acceptable thresholds (typically 20-30%), verify date ranges are sensible, confirm numeric columns don't contain string values, and check that join keys exist in both datasets. These checks run in deterministic code, not LLM-generated logic.

Version control your operation library like any production code. When you need to add a new analytical pattern, write the function, test it thoroughly with edge cases, and deploy it through your normal code review process. Don't let the LLM extend its own capabilities at runtime. This is probably the single most important architectural decision you'll make.

Log every specification that the Analysis Planner generates. Store them with timestamps, user identifiers, and execution results. This creates an audit trail and lets you identify when the planner is consistently failing to handle certain request types, which tells you where to add new operations to your library.

Building Reliable Agentic AI Systems for Enterprise

Enterprise reliability requires more than just preventing hallucinations. You need observability, error handling, and graceful degradation. Use tools like LangSmith for tracing to monitor how your Analysis Planner interprets requests and where it fails.

Implement circuit breakers for your LLM calls. If the Analysis Planner fails to generate valid specifications more than 3 times in 10 requests, fall back to a simpler pattern-matching system or queue requests for human review. This prevents cascading failures when the LLM service has issues or when you encounter a cluster of edge-case requests.

Set up automated testing for your operation library with realistic data samples. Test each function with normal cases, edge cases (empty datasets, single-row datasets, all-null columns), and adversarial cases (mismatched data types, circular date ranges). Aim for 90%+ code coverage on your execution layer since that's where correctness matters most.

Create a feedback loop where users can flag incorrect analyses. When someone reports a problem, determine whether it was a planning failure (LLM chose wrong operation), execution failure (bug in your code), or interpretation failure (LLM misexplained correct results). Each category requires different fixes.

For organizations handling sensitive data, consider running models locally for the planning and interpretation steps. You can use smaller models like Llama 3 8B or Mistral 7B for the planner since you're constraining outputs to structured JSON anyway. The execution engine never needs an external API call.

When to Use AI vs Traditional Code in Your Agent

Use LLMs for intent understanding, ambiguity resolution, and natural language generation. They excel at mapping messy human requests to structured specifications and explaining complex results in accessible language. Don't use them for arithmetic, data transformation, or any operation where correctness is binary.

Use deterministic code for all data processing, calculations, and business logic. This includes aggregations, joins, filtering, sorting, and statistical operations. These operations have objectively correct answers that you can test and verify.

The boundary between AI and code should be explicit in your architecture. Data flows from user input through the LLM planner, into the code execution engine, and back through the LLM interpreter. At no point should the LLM have the ability to modify, filter, or transform your actual data records.

For complex multi-step analyses, chain multiple operations at the specification level, not the execution level. Let the Analysis Planner generate a sequence of operations, then execute them one by one with validation between each step. This keeps the execution path predictable while allowing the AI to handle analytical complexity.

Look, when you need to add genuinely novel analytical capabilities, the right approach is to have a human write the new operation, test it, and add it to the library. The system should recognize its limitations and request human intervention rather than attempting to synthesize new capabilities on the fly. This constraint is a feature, not a limitation. And honestly, most teams skip this part.

Building analytics agents this way takes more upfront work than letting an LLM generate code dynamically, but you end up with a system that's actually deployable in production. You get reproducible results, clear audit trails, and the ability to iterate on your analytical capabilities through normal software development practices. The hybrid architecture trades some flexibility for reliability, which is exactly the right tradeoff when accuracy matters more than plausibility.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.