SQL Dialog: Everything you need to know about Training Data

Oracle Digital Assistant release 22.12 brings us a ground-breaking new capability – SQL Dialog – which can translate a user’s natural language input into SQL queries, send the queries to a backend data source, and display the response. All of these without the need to create intents, utterances, or custom components!

If you read the first article in this series, you should have (or know how to do it) a SQL Dialog Skill. This article will focus on how can we improve the baseline model.

Train the Skill to Convert Natural Language Utterances into SQL

After importing the model, the initial step is to enhance the SQL Dialog skill’s comprehension of natural language phrases and its ability to map them to the data structure.

We begin by examining the identified phrases – training corpus – that the skill is designed to handle. This analysis will reveal the various ways in which users refer to entities and attributes. Again, this is an exercise where the business / end-user input is critical.

Provide Training Data Through Names and Synonyms

What happens when the user uses “employee” in the query, but the table name is “emp”? Or if he asks to see an employee’s “salary”, but the attribute name is “sal”?

That is why we need to start by compiling a list of the ways people refer to entities and attributes and choose the term that best represents the primary name of each entity and attribute.

Selecting terms that closely match the most commonly used phrases is best.

But that alone is not enough as there are many other ways to refer to the same names. To address that, we use synonyms!

The primary names are also the names displayed in the result columns, so having a name that self-describes the content is good practice.

The same principle applies to all attributes, for which we can also provide synonyms.

You can also use other entities like value lists, regular expressions, and dynamic entities. This provides extra ability to recognize the user’s input and match it to a certain attribute.

To facilitate, you can make use of the auto-generation feature that will create a value list entity with just one click.

It generates the value list entity based on the existing values in the data source! Quite handy right?

Obviously, this also should be extended with appropriate synonyms – For example: Add NYC, NY, and Big Apple as synonyms of New York.

After this exercise, the model will better understand the natural language queries and associate them with SQL.

“can you show me all the employees in the big apple?”

Provide Custom Training Data

What happens when the skill is not able to convert the utterance into an OMRQL* query?

When that happens we can instruct the skill on how to accurately interpret and convert the utterance into OMRQL, by adding utterances with a mapping to an OMRQL query.

*If this is the first time hearing about OMRQL, please check this post first.

For example, “show me all the managers“ will not create a correct OMRQL as it does not filter by the job attribute.

We can manually correct the query.

This approach allows us to improve the model when it comes to converting natural language to OMRQL and subsequently to SQL.

Provide Autocomplete suggestions

How can the user know what kind of queries can he/she perform?

One way to assist users in understanding the types of database queries they can make is by offering to autocomplete suggestions. These suggestions give clues about the types of questions the logical model can answer.

These are only visible while using the WebSDK.

When the user starts typing, immediately the autocomplete suggestions appear.

Routing Data

If your skill has intents or is integrated into a DA, how can we make sure the skill directs SQL queries to the appropriate SQL Dialogs conversation?

The routing mechanism utilizes autocomplete suggestions, training data, generated and manually crafted routing phrases to learn how to identify SQL queries.

Accurate routing of utterances to the SQL conversation requires a lot of sample utterances with the translated query. On the Generated Routing Data tab, you can quickly generate 100 utterances that reflect questions the logical model can answer.

Those generated utterances are based on templates (buckets) and can be used to filter the data.

Approved utterances are included in the overall routing information. If a phrase has been modified, it will be labeled as Refined in the Combined Routing Data tab. If it has not been altered, it will be labeled as Synthetic.

If there are valid SQL queries that the DA or skill is not routing to the SQL conversation, then you need to add those utterances to the routing data from the Combined Routing Data tab. these are called Handcrafted Routing Data.