Boosting Text2SQL Performance with Human-in-the-Loop Synthetic Data

Turing logo Black

Turing Staff

12 min read

  • LLM training and enhancement
  • AI/ML
LLMs and AGI training

To train and fine-tune Large Language Models (LLMs) effectively, high-quality datasets are essential. It has been shown that even a small number of fine-tuning examples can lead to significant performance improvements. In this post, we’ll walk through our recent proof-of-concept study at Turing, focusing on Text2SQL—a task that translates natural language questions into SQL queries.

Our approach uses a unique combination of synthetic data and human expertise, underscoring the critical role of human-in-the-loop methods in generating and validating data. Synthetic data scales up the data generation protocols established by humans, who are uniquely equipped to define how Natural Language Queries (NLQs) should look and ensure they authentically represent real-world usage. This methodology has not only improved our models' performance but also lays a foundation for future work on optimally combining synthetic and human-generated data to accelerate model fine-tuning.

Goal

This study aims to show that synthetic data, enhanced by a human-in-the-loop process, can significantly improve model performance in Text2SQL tasks. By combining outputs from a state-of-the-art LLM with human expertise, we created a dataset of Natural Language Questions (NLQs) paired with SQL queries to fine-tune a language model. This hybrid approach delivered notable performance gains. Our long-term goal is to apply this methodology to other coding and reasoning tasks.

Overview

Our approach uses a structured, human-in-the-loop workflow to produce high-quality Text2SQL datasets with NLQ-SQL pairs. This hybrid process leverages generative AI for large-scale data augmentation while incorporating human review and refinement to ensure precision and alignment with complex SQL patterns.

Boosting Text2SQL performance: Overview

Key steps

  1. NLQ creation: Initial NLQs are generated by human labelers following standard practices.
  2. Synthetic augmentation: The initial set of human-generated NLQs is expanded using AI-based augmentation to improve coverage across tables, columns, SQL syntax, and NLQ patterns. This step increases dataset size and diversity, allowing the model to learn a broader range of SQL-related tasks.
  3. Guideline development: Reviewers develop specific guidelines for SQL generation, ensuring consistency in structure and relevance across tasks.
  4. Iterative Text2SQL generation: Both human annotators and AI agents collaborate to generate SQL queries for a percentage of the augmented NLQs. Reviewers intervene in cases of disagreement between agents and annotators, refining guidelines and reducing discrepancies in subsequent iterations.
  5. Model fine-tuning: The final dataset, consisting of NLQ-SQL pairs validated through this iterative, human-in-the-loop process, is used to fine-tune the model. This workflow ensures high-quality training data, optimizing the model’s performance on Text2SQL tasks.

Through this iterative feedback loop, disagreement rates are minimized with each cycle, leading to an enhanced dataset ideal for supervised fine-tuning (SFT).

Data generation process

The data generation process involved multiple phases to create a robust dataset that supports a wide range of SQL patterns and NLQ complexities. This process included augmenting NLQs across various dimensions—complexity, semantics, and syntax—to ensure comprehensive model training. We used a state-of-the-art large language model (henceforth SOTA LLM) in all data generation pipelines.

a. Human-generated NLQs

Our initial NLQ set was created by human annotators using publicly available datasets from UNITE. Annotators crafted NLQs with dataset-specific instructions, detailed table information, and access to an SQL runner. They were encouraged to explore the entire database, allowing them to write NLQs that reflect a comprehensive understanding of the data. We focused exclusively on databases within UNITE's training split, limiting datasets to Criteria2SQL, WikiSQL, CoQL, Spider, SQUALL, and SparC. To maintain a balanced representation, we sampled equally from each dataset.

b. NLQ augmentation

We then augmented the human-generated NLQs using 3 distinct pipelines, aiming at increasing diversity across three dimensions:

  1. Complexity augmentation: Starting with 300 NLQs crafted by human annotators, we applied techniques inspired by WizardCoder to increase question complexity. This involved replacing common requirements with more specific conditions (e.g., “patients with diabetes" → “patients with type 2 diabetes") and adding constraints (e.g., “list 2-door cars" → “list 2-door, 45-HP cars"). These changes expanded the dataset’s scope and updated 150 NLQs.
  2. Semantic variety: To enhance the dataset’s adaptability to different database schemas, synthetic NLQs were generated to cover new tables and columns in the database. Since the SOTA LLM struggled to naturally replicate human phrasing, we provided a human-generated NLQ as a style reference and instructed the SOTA LLM to imitate it. Specifically, we asked the SOTA LLM to create NLQs that included randomly selected columns and tables, ensuring alignment with the style of the reference NLQ. The column and table names were chosen randomly from all possible database combinations, while the reference NLQs were sampled from those crafted by our annotators. This approach produced an additional 204 NLQs, increasing the dataset size to 504.
  3. Syntactic variation: Paraphrasing techniques were employed to introduce syntactic diversity while preserving the original meaning of the questions. For example, “Which devs know Java?” was rephrased as “List the devs that know Java.” A variety of styles—including imperative, morphological, and lexical changes—were applied to enhance the dataset’s robustness, resulting in 174 paraphrased NLQs and a final total of 504 NLQs with diverse syntactic coverage.

This process yielded a number of datasets, each resulting from the application of one of the pipelines above on the previous dataset.

Boosting Text2SQL performance: Data Generation process

c. SQL generation

In this approach, we developed a specialized Text2SQL agent equipped with a set of tools to autonomously generate SQL queries from the NLQs. This agentic system was integral to our goal of producing accurate and diverse SQL queries while reducing the amount of manual labor required.

Agent architecture and workflow

The Text2SQL agent operates within a structured workflow similar to ReAct, where it iteratively interprets NLQs, interacts with database schema information, and generates SQL queries.

Boosting Text2SQL performance: SQL Generation- agent architecture

The agent has access to four primary tools designed to aid it in building and validating SQL queries:

  1. explore_db: This tool allows the agent to examine the database schema, retrieving table structures and sample rows to understand the data better. The agent uses this information to make informed decisions on which columns and tables to query.
  2. run_sql_query: After formulating a query, the agent executes it using this tool. This step enables the agent to check for errors, verify the query’s validity, and assess its alignment with the NLQ’s intent. In cases of query failure or timeouts, the agent can re-evaluate and adjust its approach.
  3. scratchpad: This serves as a workspace where the agent records its observations, reflections, and next steps. The scratchpad enables a “chain-of-thought” process, allowing the agent to break down complex NLQs into smaller components and build towards a complete query iteratively.
  4. give_final_answer: Once the agent arrives at a final SQL query, it uses this tool to structure the output. This step involves ensuring the query is valid, efficient, and free from common errors, such as over-reliance on SELECT *. The agent also provides an explanation of the query’s logic, which enhances interpretability and transparency.
Iterative SQL generation and validation

The SQL generation process is iterative, with the agent continually revisiting the scratchpad to refine its approach based on feedback from previous query executions. This iterative cycle continues until the agent produces a valid SQL query that aligns with the NLQ’s intent. Key criteria for a successful SQL query include:

  • Validity: The query must execute without errors. Any runtime errors are logged in the chat history, allowing the agent to use this information to debug and correct the query. Additionally, the agent evaluates the query results to ensure they align with the NLQ’s intent. For example, the agent often performs additional checks when queries return empty results.
  • Generalizability: The SQL syntax should be flexible enough to handle similar questions without relying on hard-coded values or overly specific patterns. This is ensured through a rules-based checker that validates the query’s adaptability.
  • Efficiency: The agent strives to create optimized queries by minimizing the use of unnecessary columns or aliases. While these aspects are not strictly enforced, they are included in the agent’s best practice guidelines.

This architecture enables the agent to achieve high accuracy in SQL generation with minimal human intervention. In cases where the agent struggles to generate an appropriate query, human annotators step in to review and adjust the guidelines, providing the agent with feedback for future iterations.

Results

Our results are divided into two key phases: Pre-training validation and Post-training analysis. This breakdown allows us to assess both the quality of the synthetic data generated and the impact of fine-tuning on model performance.

Phase 1: Pre-training validation

In this phase, we analyzed the quality and diversity of the generated NLQs and SQL queries through several key metrics, including NLQ embedding cosine similarity, table and column coverage, and SQL pattern Shannon entropy scores.

Metric definitions

  1. NLQ cosine similarity: We converted NLQs to vectors using OpenAI’s vector embeddings and calculated the mean cosine similarity between all NLQ pairs to monitor semantic diversity at each augmentation stage. This metric serves as a guardrail: as the dataset grows, the density of vectors in the embedding space naturally increases due to the addition of new vectors in a finite space.
  2. Table and column coverage: As another assessment of semantic variety, we measured the number of unique tables and columns used in agent-generated SQLs required to answer the NLQs.
  3. SQL unique patterns: To assess complexity, we analyzed the number of unique SQL patterns generated by the agent within the dataset. This process involved parsing queries, abstracting their structure by identifying key clauses, recursively analyzing subqueries, and replacing specific column names, table names, and values with placeholders. By generalizing these elements, we derived unique query structures, providing a schema-independent metric that effectively captures the dataset's complexity and diversity.
  4. SQL pattern Shannon entropy: Since the count of unique SQL patterns increases as more data points are added, we used the Shannon entropy of SQL patterns as a dataset-size-independent and more conservative measure of complexity.

Results and discussion

Our results indicate that the augmentation process maintained high semantic diversity and complexity across the dataset (see table below).

Boosting Text2SQL performance: Pre-Training Validation results

The mean cosine similarity between NLQs remained consistent, with only a 1.67% variation across stages, confirming that the additional NLQs preserved semantic variety without clustering. Table and column coverage expanded by 16.37% and 44.28%, respectively, as new NLQs were introduced, reflecting broad schema engagement that strengthened the dataset's representational range. A slightly larger table coverage in stage 2 compared to stage 3 is attributed to random fluctuations.

Moreover, the number of unique SQL patterns increased by 49.21%, while Shannon entropy rose by up to 4.30%, indicating that each augmentation step enriched query complexity. These metrics together affirm that our human-in-the-loop augmentation strategy successfully produced a diverse, semantically rich, and complex dataset, setting a strong foundation for model fine-tuning.

Phase 2: Post-training analysis

In the post-training phase, we evaluated the performance of models fine-tuned on our augmented datasets against those fine-tuned on alternative datasets and GPT-4o as a reference model. The results of this analysis illustrate the effectiveness of our human-in-the-loop synthetic data in enhancing model accuracy on Text2SQL tasks.

Training overview

For this phase, we fine-tuned multiple models on two versions of the dataset: (1) Dataset 3, which included 504 NLQ-SQL pairs, and (2) an extended dataset with 5,000 NLQ-SQL pairs, created similarly to Dataset 3. Both datasets maintained the same number of human-generated seed NLQs, ensuring consistency in human input. Fine-tuning was conducted using LoRA (Low-Rank Adaptation) with hyperparameters optimized through Optuna on an A100 80GB GPU. The optimization process took approximately one day per model, supporting efficient tuning across the following models: Llama 2 7B Chat, Llama 3 8B Base, and Mistral 7B Instruct.

Test set and evaluation metric

To evaluate model performance, we created a test set of 825 samples uniformly sampled across UNITE’s datasets, with 55 unique NLQ-SQL pairs per dataset, representing a balanced sample from 15 datasets. Note we only considered databases within UNITE’s test split. We used execution accuracy (EX) as the evaluation metric, following recent research practices in Text2SQL. Additionally, a manual evaluation of a subset of the test set revealed that approximately 23% of the NLQ-SQL pairs were incorrect, setting a theoretical maximum performance on this test set at 77%.

Comparison with reference models

For context, we compared our fine-tuned models with three reference benchmarks:

  1. GPT-4o, which represents the performance of a high-performing large model in this task.
  2. Models trained using synthetic data from Gretel.
  3. Results with models fine-tuned using only 150 human-generated NLQ-SQL pairs.

For a fair comparison, we downsampled Gretel’s data to 504 and 5000 examples sampling uniformly across domains.

Results

The table below summarizes the performance of each model across various training configurations. Execution accuracy and gain in accuracy compared to the pre-trained model are reported for each case:

Boosting Text2SQL performance: Post-Training Analysis result 1

We also report EX as a percentage of GPT-4o’s performance in the table below

Boosting Text2SQL performance: Post-Training Analysis result 2

Discussion

The results show that our human-in-the-loop synthetic data significantly improved EX across models. Notably, Llama 2 7B Chat and Mistral 7B Instruct demonstrated marked accuracy gains compared to their pre-trained baselines. This underscores the efficacy of our fine-tuning data.

Our models consistently achieved higher EX than those trained on Gretel’s data — a fully synthetic dataset. For instance, we reached a 366.7% gain in EX by fine-tuning Llama 2 7B Chat. This surpasses the 215.8% gain achieved with Gretel’s data. Even with smaller datasets, such as the 500-example set, our fine-tuned models outperformed those trained on Gretel’s data. This affirms the advantages of human-in-the-loop data generation in enhancing data relevance and diversity.

Our synthetic data augmentation improves the EX of models trained only on human-generated examples. Adding more synthetic data consistently increased EX and we observed this trend for all models. This underscores the scaling power of our synthetic data generation pipelines.

In terms of performance relative to GPT-4o, our models showed competitive results. For example, Llama 3 8B Base achieved 84.92% of GPT-4o’s accuracy. This reflects the high quality of our data. 

In summary, Phase 2 results highlight the potential of our synthetic data generation approach. It can produce datasets that close the gap with state-of-the-art models, even with a modest dataset size. This validation supports our methodology as an effective and scalable alternative for high-performance model fine-tuning, particularly in data-scarce scenarios.

Conclusion

This study demonstrates the power of human-in-the-loop synthetic data in elevating the performance of models on the Text2SQL task. By combining human insight with generative augmentation, we crafted a dataset that not only increased execution accuracy across multiple models but also maintained semantic diversity, schema coverage, and query complexity. The results reveal that our approach consistently outperforms models trained on Gretel’s synthetic data, and closely approaches the capabilities of state-of-the-art models like GPT-4o.

The integration of human expertise was pivotal, particularly in generating authentic NLQs that resemble real-world queries. This human input laid a strong foundation for meaningful augmentation, ensuring that the dataset remained realistic and relevant for model training. These findings validate human-in-the-loop data generation as an effective and scalable approach, particularly valuable in scenarios where access to large, labeled datasets is limited.

Looking ahead, our methodology shows promise not only for Text2SQL tasks but for other domains. This framework opens new avenues for Turing to offer high-quality, human-in-the-loop synthetic data services, helping clients bridge the gap between pre-trained models and their specific data needs.

Want to accelerate your business with AI?

Talk to one of our solutions architects and get a
complimentary GenAI advisory session.

Turing logo Black

Author
Turing Staff

Share this post