Text-2-SQL Agent

Most business questions need data answers, but most people who ask them can’t write SQL—and the analysts who can are a bottleneck. The Text-2-SQL Agent decomposes a plain-English question, writes the query, scores its own answer across seven quality dimensions, and retries when quality is low. A Gies research project competing on the AgentBeats benchmark.

Project Lead: Ash Castelino

7 Scoring Dimensions
A2A AgentBeats Compatible
SSE Streaming API

The Query Pipeline

Each node does one job and hands state to the next—a LangGraph workflow with a quality-gated retry loop

1

Schema Analyzer

Introspects the database via PRAGMA—no LLM call. Hashes the schema (SHA-256) and caches with TTL, so repeat questions against the same DB skip the roundtrip entirely.

2

Planner

GPT-5 produces a structured QueryPlan using JSON Schema mode—guaranteed parseable. Decides whether one query or a multi-step chain is needed; predecessor results are injected into later steps.

3

Query Generator

Writes SQL for each sub-task. On retry, the previous attempt’s targeted feedback (what was wrong, which dimension failed) is injected so the model can correct specifically rather than guess.

4

Executor & Evaluator

Runs the SQL, scores it across 7 dimensions, then runs an independent LLM relevance check. Blends the scores (85% eval + 15% relevance) into a final quality number.

5

Quality Gate & Retry

If the score falls below threshold, the pipeline loops back to the generator with category-specific feedback. If it passes, the task completes and the next sub-task begins. Final results are synthesized into a human-readable answer.

Why This Architecture

Text-to-SQL is easy to prototype and hard to productionize

Self-Evaluating

A single LLM call that writes SQL is fragile—bad joins, wrong aggregations, missing filters. Scoring each result across 7 dimensions gives the agent a clear signal about whether its own output is trustworthy.

Targeted Retries

Generic “try again” loops waste tokens. This agent tells the model which dimension failed and why, so the retry is corrective—not a random re-roll.

Multi-Step by Default

Real analytical questions rarely map to one SQL statement. The planner decomposes them, runs queries in sequence, and feeds predecessor results into later steps—just like a human analyst would.

Stack

Python 3.10+ LangGraph GPT-5 JSON Schema Mode SQLite Server-Sent Events A2A Protocol