SQL Generation from Natural Language: A Sequence-to-Sequence Model Powered by the Transformers Architecture and Association Rules

: Using Natural Language (NL) to interacting with relational databases allows users from any background to easily query and analyze large amounts of data. This requires a system that understands user questions and automatically converts them into structured query language such as SQL. The best performing Text-to-SQL systems use supervised learning (usually formulated as a classification problem) by approaching this task as a sketch-based slot-filling problem, or by first converting questions into an Intermediate Logical Form (ILF) then convert it to the corresponding SQL query. However, non-supervised modeling that directly converts questions to SQL queries has proven more difficult. In this sense, we propose an approach to directly translate NL questions into SQL statements. In this study, we present a Sequence-to-Sequence (Seq2Seq) parsing model for the NL to SQL task, powered by the Transformers Architecture exploring the two Language Models (LM): Text-To-Text Transfer Transformer (T5) and the Multilingual pre-trained Text-To-Text Transformer (mT5). Besides, we adopt the transformation-based learning algorithm to update the aggregation predictions based on association rules. The resulting model achieves a new state-of-the-art on the WikiSQL DataSet, for the weakly supervised SQL generation.


Introduction
Semantic Parsing (SP) is one of the most important tasks in NLP, it requires both understanding the meaning of Natural Language (NL) sentences and mapping them to formal meaning representations (Zelle and Mooney, 1996;Panait and Luke, 2005;Clarke et al., 2010;Liang et al., 2011) often to machine-executable programs, for a range of tasks such as question-answering (Yih et al., 2014), robotic control (Matuszek et al., 2013) and intelligent tutoring systems (Graesser et al., 2005). As a sub-area of SP, we address the problem of mapping natural language utterances to executable relational DB queries, which is known to be difficult due to the flexibility and ambiguity in natural language and the complexity of relational databases.
In database areas (Androutsopoulos et al., 1995;Popescu et al., 2003;Affolter et al., 2019), the general problem was known as "Natural Language Interface to Databases (NLIDBs)", in particular, we are interested in translate natural language questions to SQL, due to the popularity of SQL as the domain-specific language used to query and manage data stored in most available relational databases (Ramakrsihnan et al., 1998). Despite the importance of the task, researchers have recently appeared to approach Deep Learning (DL) methods for the crucial problem of NLIDBs.
In this study, we treat the Text-to-SQL task with WikiSQL1 1 (Zhong et al., 2017). This DataSet is the first large-scale dataset for Text-to-SQL, with about 80 K human-annotated pairs of Natural Language question and SQL query. WikiSQL is very challenging because tables and questions are very diverse. This DataSet contains about 24K different tables.
There are two leaderboards for the WikiSQL challenge: Weakly supervised (without using logical form during training) and supervised (with logical form during training). On the supervised challenge, there are two results: Those with Execution Guided (EG) inference and those without EG inference.
The previous state-of-the-art weakly supervised model SeqGenSQL+EG (Li et al., 2020) achieved 90.5% execution accuracy on the test DataSet. On the supervised challenge, IE-SQL (Ma et al., 2020) achieves 87.8% execution accuracy without EG inference on the test dataset and 92.5% execution accuracy with EG inference. In this study, we are interested only in a weakly supervised challenge with execution accuracy as a metric.
In this study, we revisit the Seq2Seq approach, but this time powered by the transformers architecture, precisely using T5 (Raffel et al., 2019) and mT5 (Xue et al., 2020) language models. The preliminary results show that the prediction of the Aggregation function (AGG) decreases the performance of the model, which brings us to adopt the learning algorithm based on the transformation inspired by (Brill, 1995), to update the aggregation predictions based on association rules, which improve the AGG prediction and the whole model's results.
We organize our paper as follows: In section 2 we review the related work. In section 3 we formulate the problem of Text-to-SQL and describe the WikiSQL DataSet in more detail. In section 4, we describe our methods, then we present the obtained results in section 5. After that, in section 6 we analyze and discuss some errors to improve and finally, we draw the conclusion and future work.

Related Work
Building natural Language Interfaces for Databases (NLIDBs) has been a significant challenge in the SP area. Old works (Warren and Pereira, 1982;Androutsopoulos et al., 1995;Popescu et al., 2004) focused on rule-based approaches with handcrafted features, then later systems enabled users to query the databases with simple keywords (Simitsis et al., 2008;Blunschi et al., 2012;Bast and Haussmann, 2015). The next step was to enable the processing of more complex NL questions by applying a pattern-based approach (Popescu et al., 2004;Zheng et al., 2017).
Moreover, to improve the precision of natural language interfaces, grammar-based approaches were introduced by restricting users to formulate queries according to certain pre-defined rules (Song et al., 2015;Ferré, 2017).
In recent works, that operated on WikiSQL DataSet for training and evaluation; many approaches share a similar encoder-decoder architecture. In this case, information from both the NL and table schema is encoded into a hidden representation by the encoder. Some of those works encode the question with each column name separately (Xu et al., 2017;Yu et al., 2018;Hwang et al., 2019) and other choose encoding the concatenation of the question with columns name (Zhong et al., 2017;Dong and Lapata, 2018;Chang et al., 2020;Hwang et al., 2019;He et al., 2019).
There are also some works that do both at different layers (Hwang et al., 2019;He et al., 2019). Then the hidden representation is decoded with a decoder to a SQL query.
Some early work tried the seq2seq architecture, one step decoding (Zhong et al., 2017;Dong and Lapata, 2018), by using advanced Neural Network (NN) architectures to synthesize SQL queries given a user question, precisely the use of a classical encoder-decoder architecture based on Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM). However, it is found challenging in output syntax.
Later more works treat the Text-to-SQL as a classification problem by approaching this task as a sketch-based slot-filling and predicting several parts of the SQL query like SELECT column, WHERE column, WHERE operator, WHERE value, etc. (Xu et al., 2017;Yu et al., 2018;Hwang et al., 2019;He et al., 2019). That way, the chance of output syntax problems is reduced.
"Hybrid Ranking Network" (Lyu et al., 2020), is an example of a model which also based on a BERT/RoBERTa pre-trained model and achieved 92% execution accuracy using an annotated logical form and EG inference predictions returning empty results will be dropped and the next most probable prediction is chosen.
The previous state-of-the-art model on the weaklysupervised challenge "SeqGenSQL" on which we are inspired in this study, also based on a pre-trained T5 model, exploring the use of increasing questions with table schema information (column name, type and database content) and the use of automatically augmented training data.

Task Definition
The specific semantic parsing problem we study in this study is to map a natural language question to a SQL query, which can be executed in a given DB to find the answer to the original question. In particular, we use the currently cross-domain largest natural language questions to SQL DataSet "WikiSQL" (described with more details in the next paragraph) to evaluate our model.

DataSet
We operate on WikiSQL (Zhong et al., 2017), a DataSet for Text-to-SQL task which contains a collection of questions, corresponding SQL queries and SQL tables. This is the largest hand-annotated Semantic Analysis DataSet to date, it is larger than other DataSets that have logical shapes, either in terms of the number of tables or number of examples. It provides tree sets: Train, dev and test set.
The WikiSQL dataset consists of 80,654 pairs of SQL questions and queries spread across 24,241 Wikipedia tables. Besides the question in NL, the entry also contains a unique table schema. Each table is present in a single set, train, development or test, which forces models to generalize to invisible tables. The SQL structure of the WikiSQL dataset queries is restricted and always follows the sketch in " Fig. 1".
$COLUMN is a single table column and $AGG is an aggregator function (empty, COUNT, SUM, MAX, MIN, AVG). The WHERE segment is a sequence of conditions. Each $OP is a filtering operator (=, <, >) and the filtering value $VALUE is mentioned in the question. Although the DataSet ships with a "standard" linear order of conditions, the order is irrelevant given the semantics of the WHERE clause (the semantic of "AND"). Figure 2 gives an example from the DataSet.

Methods
We use the original T5&mT5 pre-trained models as our Seq2Seq baseline model.
We notice that the following described methods are generic and can be used for any DataSet handling the Text-to-SQL task.

Input Representation
Given a question Q and a table schema T with columns name C1, C2, …, Cn, we form the input sequence as follow:

Preprocessing
We only transformed all DataSet text to lowercase format and store the input and the output sequences as Tab-Separated Values in TSV files, the format required for fine-tuning T5&mT5 (we notice that the original format of WikiSQL DataSet files is JSON).

T5&mT5 Fine-Tuning
In recent years, Transfer Learning (TL) has led to a new wave of cutting-edge results in Natural Language Processing (NLP). The power of TL comes from pretraining a model on abundantly available unlabeled text data with a self-supervised task. After that, the model can be refined on smaller labeled data sets, which often results in (much) better performance than training on the labeled data alone. The recent success of transfer learning was sparked in 2018 by ULMFiT (Howard and Ruder, 2018), ELMo (Peters et al., 2018) and BERT. The 2019 year saw the development of a wide variety of new methods like GPT (Radford et al., 2019), XLNet (Yang et al., 2019), RoBERTa, ALBERT (Lan et al., 2019), Reformer and MT-DNN (Liu et al., 2019a). The pace of progress on the ground has made it difficult to assess the most significant improvements and their effectiveness when combined.
In this sense, we used T5 (for Text-to-Text Transfert Transformer), a language model pre-trained on Colossal Clean Crawled Corpus (C4). This model achieves top results on many NLP benchmarks while being flexible enough to be fine-tuned for a variety of important downstream tasks. We fine-tune T5 on WikiSQL considering the input sequence (question + schema table) and the output SQL query as texts, which is illustrated in " Fig. 3".
We notice that WikiSQL DataSet treats another language besides English, However, T5 is pre-trained model only in English. In fact, it is unable to decode some no English questions. Also within this T5's limits over WikiSQL, it's unable to decode some special characters in the DataSet.  To leverage those limits, we fine-tune alternatively mT5 (m for multilingual), which is pre-trained on mC4 DataSet covering 101 languages.

T5&mT5
According to studies and statistics, T5 is more efficient than mT5 on English sequences. Since mT5 allows to decode and give results on non-English and as mentioned below, the WikiSQL dataset also contains non-English sentences with characters not managed by T5, but managed by mT5, we decided to use T5&mT5 together to benefit from the advantages of each pre-trained models. The following algorithm explains how we proceeded.

Gated Extraction Network
Seq2Seq models suffer from generation completely new words, or not generate all words completely. In our case, the majority of words to predict in the output sequence are present in the input sequence, so in most cases, it is better to copy them directly from the input rather than to try to generate them.
An example of generation new words by T5&mT5 is as follow: Question: In which country is the city of Netanya Predicted SQL query: Select country from 1-14937957-1 where city = "netheranya" Expected SQL query: Select country from 1-14937957-1 where city = "netanya" An example of not generate all complete words by T5&mT5 is as follow: Question: Are there registration notes on usek.edu.lb? Predicted SQL query: Select official registration notes from 1-1160660-1 where website = 'usk.edu.lb' Expected SQL query: Select official registration notes from 1-1160660-1 where website = 'usek.edu.lb' To handle this problem and encourage extraction from the input, we use a gated extraction T5&mT5 network, similar to (Li et al., 2020) and to a Pointer Generation Network (See et al., 2017). Between the encoder (Henc) and decoder (Hdec), we implement a cross attention layer, then we create a gate layer from the attention layer to control whether the output should be generated by the decoder or extracted from the encoder. The cross attention layer is implemented the same way as the T5&mT5 cross attention layer where the score is the product of Henc and Hdec, as illustrated in " Fig. 4 The final step is to merge both generation and extraction using element-wise operation:

AGG Prediction Reinforcement
The results of (Hwang et al., 2019) indicate that AGG annotations in WikiSQL contain up to 10% errors which negatively affects the prediction of the entire SQL query. In such a case, a learning model fails to train well. So, we decided to improve AGG results compared to the original model, using only simple association signals in the training data. Notably, we adopt a transformationbased learning algorithm (Brill, 1995) to update the aggregation function predictions based on association rules in the form of "change from X0 to X1", (example: Change from COUNT to SUM), considering some word occurrences. The algorithm mine and rank those rules from the training data " Fig. 5". For example, for the question: What is the total number of assists for players with under 55 games and over 6 assists per game average?
The predicted SQL is: Select count total assists where games inferior 55 and ast avg > 6, while the expected SQL is: Select sum total assists where games inferior 55 and ast avg > 6.
In this case, the rule "Change from COUNT to SUM" is triggered and the predicted SQL query is corrected to: Select sum total assists where games inferior 55 and ast avg > 6; which corresponds to the expected one.

Execution Guided Inference
It is always difficult to generate a perfect SQL query. To improve the prediction, Wang et al. (2018) introduced Execution Guided (EG) inference.
EG models send generated SQL queries to the SQL database engine and make adjustments if the database engine returns an empty result or run-time errors.
We experimented with beam search using run-time error or empty result during execution.
To apply EG inference, we used beam search by trying to execute each output sequence to the SQL database engine. If the SQL database engine returned a run-time error or an empty result, we drop the current prediction and return the next most efficient output sequence and so on. The algorithm of the EG is as following.

Implementation Details
We used Colab pro with TPU and Google Cloud Storage (GCS) as our environment and Pytorch with Tensorflow for codding.
240 is the length of input tokens and 75 is the length of the output sequence.
For tokenization, we used the default tokenizers T5Tokenizer and MT5Tokenizer of T5&mT5 respectively. And since the raw data splits are stored as JSON files, we first converted them to TSV format to make them parsable in TensorFlow. We also take the opportunity to do a bit of cleaning of the text.
For the loss function, we used CrossEntropyLoss and AdamW (Loshchilov and Hutter, 2017) to optimize the models with default hyperparameters.
We train both T5&mT5 on 35 epochs with the batch size set in 64.

Evaluation and Results
"Table 1" shows the performances of our work. Using T5-base and mT5-base, our baselines models achieve 84.1 and 86.2% respectively, on test data execution accuracy.
Using a gated extraction network, the precision is improved by about 1%, achieving 85.1 and 87.2% using T5&mT5 respectively.
Ensembling T5&mT5 and using them at the same time improve the performance by 5% compared to our baseline, achieving 90% on test data execution accuracy.
By adding association rules, the accuracy is augmented by 0.3% achieving 90.3%.
Finally, by using EG and replacing T5-base and mT5-base by T5-large and mT5-large respectively, the test data execution accuracy improved more and achieves 91.0%. " Figure 6" presents a bar chart for visualizing the improvements of experimentation results on WikiSQL DataSet.

Errors Analysis and Discussion
Even with association rules, analysis of the errors on the test data shows that most of them were in the prediction of bad aggregation functions.
Besides, some questions contain ambiguous words. For example, the word "total" can be interpreted as a SUM function in SQL and the word "sum" can be interpreted as a COUNT function.
Even with the gated extraction network, the model still predicts new words or predicts an SQL query with fewer characters or words.
We remarked also that some of the questions simply didn't provide sufficient information to compose an accurate SQL statement, particularly, questions that do not contain the name of columns, so in this case, the model predicts usually the wrong column in the SQL query (select column or condition column).
We also note that there are cases where the expected SQL query doesn't match the questions and in some of these cases, our model predicts the SQL query correctly in manual verification, even though they don't match the expected one.
Despite all these constraints, Sequence generation simplifies the process of generating SQL statements and T5 baseline T5-base + gated extraction network mT5 + baseline mT5 + gated extraction network T5-base&T5-base T5-base&T5-base + Association rules T5-large&T5-large + Association rules performs one-step prediction for this task with high accuracy. Besides, we still consider the following things for improving the accuracy of the model:  We assume larger base models could provide even more improvement. I fact, we can use T5-3B and mT5-xl (about * 4 larger than T5-large and mT5large), but this requires more resources precisely in terms of TPU and RAM  Improve the gated extraction network  Carefully designed question augmentation (with more information like data-type, POS), as well as accessing the content of the database and include some rows as input  Training the model on more data by doing data augmentation on train and dev set

Conclusion
In this study, we present our work for the generation of SQL queries from natural language. We used T5&mT5 for SQL sequence generation and we adopt a transformation-based learning algorithm to update the aggregation function predictions based on simple association rules, which helps to improve the general results. We trained and evaluated our model on WikiSQL DataSet, outperforming all previous works and achieving a new state-of-the-art on the weakly supervised challenge. As future work, we plan first to improve our model over WikiSQL Dataset, then use more complex DataSet like Spider, which treats multitables, containing nested SQL queries, with jointures and more components.