School of Informatics

Menu

Researchers propose improvements to SQL databases

Paolo Guagliardo and Leonid Libkin have recently published two papers that address flaws in SQL and relational databases.

They found examples showing unexpected and counterintuitive behaviour of even simple SQL queries that made them question analytics results obtained from relational DBMSs. Errors are especially likely to occur if some data is missing.

LinkSQL (Structured Query Language) is the most popular programming language used for relational databases. It is the most commonly used tool among data scientists. Various companies are relying on it to build database management systems (DMBS), whether commercial, or open source. Databases these days are used to handle vast amounts of data such as customer details for businesses, medical records at surgeries and hospitals and stock items in warehouses. They also power many websites that we use routinely on daily basis. It is crucial for the database to be easily managed and for the information to be accessed and updated quickly and efficiently. But researchers at the School of Informatics are questioning the reliability of SQL databases and proposing improvements. 

database icon

In their two recent papers Paolo Guagliardo and Leonid Libkin address flaws in SQL and relational databases. They found examples showing unexpected and counterintuitive behaviour of even simple SQL queries that made them question analytics results obtained from relational DBMSs. Errors are especially likely to occur if some data is missing.

Semantics of SQL

There are a few possible reasons for that. First of all, we do not understand the full semantics of SQL. While some formal semantics for SQL have been proposed, they were too simplified to describe the real language as used by programmers. The Standard that guides SQL use and development can be interpreted differently by different companies. Secondly, we do not know how to handle missing data and optimize common database queries in the presence of such data. That’s why transforming data from a raw form (process known as data wrangling) is crucial and research into it is significant.

Guagliardo and Libkin have been pursuing two lines of work: one concerns a formal semantics of SQL, and tries to at least eliminate the element of surprise in query results. The other presents a revised evaluation scheme that restores correctness to the notoriously unpredictable behaviour of SQL queries over databases with incomplete information.

Proposed solutions

Researchers are proposing formal semantics of a basic fragment of SQL that behaves exactly like the real-life SQL does, as opposed to its theoretical reconstructions with their many simplifications. From a practical point of view, this could be a useful tool for both users and implementers in understanding the behaviour of SQL queries. It is much more concise than the natural language specification of the Standard, as well as being very easy to implement and modify. They also advocate extending the formal semantics, and its experimental validation, to include more features of the language.

The findings also address multiple issues with SQL’s handling of queries where the data is missing. SQL disregards the standard notion of correctness on incomplete due to its high complexity. As a result, the evaluation of SQL queries on databases with missing data may produce answers that are just plain wrong. However, the scientists have proven that SQL evaluation can be modified, for a basic but rather expressive and very common fragment of the language in a way that queries return only correct answers.

Further reading

A Formal Semantics of SQL Queries, Its Validation, and Applications. PVLDB 11(1): 27-39 (2017)

Correctness of SQL Queries on Databases with Nulls. SIGMOD Record 46 (2017): 5-16.