3 Swaps to Improve Your SQL Query Performance
With SQL Code Examples
Whether you’re a data analyst, data engineer, data scientist, or analytics engineer, it’s imperative that you write clean, fast code. You shouldn’t be the only one that can understand what you are writing. It needs to be readable by others on your team so that it can be used and improved for years to come. It also needs to be optimized for query performance.
One of my biggest annoyances is when code takes forever to run because someone didn’t use the most efficient functions. This is a constant struggle in my current role where I am rewriting all of our core data models.
Believe it or not, there are good and bad ways of writing SQL code. Just because it runs and gets you the result you’re looking for, does NOT mean it is well-written code. It needs to be accurate, readable AND fast to be considered good code.
In this article I include SQL code examples, including the queries and their answers, as well as explanation of the important SQL commands used in them. We’ll go over the SQL row number function, SQL subqueries and CTEs, and the SQL between function. These swaps are guaranteed to improve the query performance of your SQL code.
SQL ROW NUMBER INSTEAD OF TOP
While rewriting old data models, I often came across subqueries using the SQL function TOP
. While this may get you the solution you are looking for, it is hard to read and can take a long time to run.
Here’s a piece of example code highlighting what I found:
SELECT customer_id, ( SELECT TOP order_id FROM orders WHERE orders.customer_id = customers.customer_id ORDER BY date DESC ) FROM customers
As engineers and analysts, we want to make sure we are writing the cleanest and most efficient code possible. So, if you’re using TOP
, I recommend using the SQL row number function instead.
ROW_NUMBER()
is a window function that allows you to assign a sequence number to each row that meets a certain condition. These conditions are specified using ORDER BY
and PARTITION BY
.
ORDER BY
specifies how you wish to order the values in your table. A date is typically used for this.
PARTITION BY
specifies the column you want to group your values by. This depends on the purpose of your query.
With this example, we would want to ORDER BY
date and PARTITION BY
customer_id. Since we are trying to find the most recent order for each customer, partitioning by the customer’s id will count a new sequence for each unique id.
SELECT customer_id, order_id, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY date) AS order_sequence_number FROM orders
Now, this will result in a table with the same number of rows as the original table, however, this time there is a new column order_sequence_number
that indicates the order sequence for each customer. In order to get a customer’s first order, we need to write another query that chooses only the sequence numbers that are equal to 1.
WITH order_row_number_calculated AS ( SELECT customer_id, order_id, ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY date) AS order_sequence_number FROM orders) SELECT customer_id, order_id FROM order_row_number_calculated WHERE order_sequence_number = 1
Now we have a CTE which finds the row number of each row in the first query and then filters that query by looking for only the row numbers that are first in their grouping. This is much easier for anyone to read and will typically run faster.
Subqueries → CTEs
That last SQL subquery example shown above can also be used for this swap! If you look above, we start with a subquery within a query and then finish by using CTEs.
What exactly is a CTE? CTE stands for common table expression. It creates a temporary set of results that you can use in your proceeding queries. It requires that you start it with WITH
and use table_name AS
before each individual CTE. Your last CTE in the sequence should be a simple SELECT statement without a table_name nor AS
.
Here’s a SQL CTE example: what it should look like:
WITH money_summed AS ( SELECT customer_id, date, SUM(balance) AS balance, SUM(debt) AS debt FROM bank_balance GROUP BY customer_id, date),money_available_calculated AS ( SELECT customer_id, date, (balance - debt) AS money_available FROM money_summed GROUP BY customer_id, date)SELECT customer_id, money_available FROM money_available_calculated WHERE date = '2022-01-01'
Notice that before the last SELECT
statement that we don’t use a table name or set of parentheses. There is also no comma after the previous set of parenthesis. The last query in a series of CTEs is always written as a normal query. Be sure to remember this because errors due to formatting are common when using CTEs!
SQL BETWEEN FUNCTION INSTEAD OF > or <
It is very common to see two tables being joined together using a date column. Most of the time, when I do see this, they are being joined by using ≥
and ≤
. While this works, it looks messy. Luckily, there is a better way to do this!
BETWEEN
is a SQL function that allows you to select a range between two given values. It returns the values between the values you feed it. It can be used with numbers, dates, and even strings.
However, the most important thing to note is that this SQL between function is inclusive. This means that the values specified will also be included in your result. So, it can only really replace ≤
or ≥
rather than <
or >
.
First, let’s look at a query that uses ≤
and ≥
signs to join two tables using a date.
SELECT customers.customer_id, customers.activated_at, campaigns.campaign_idFROM customersLEFT JOIN campaigns ON customers.campaign_id = campaigns.campaign_id WHERE customers.activated_at >= campaigns.started_at AND customers.activated_at <= campaigns.ended_at
Within the WHERE
clause, we need to specify two different statements using activated_at
. One where this column is ≥ started_at
and another where it is ≤ ended_at
.
Now, let’s use the BETWEEN function
.
SELECT customers.customer_id, customers.activated_at, campaigns.campaign_idFROM customersLEFT JOIN campaigns ON customers.campaign_id = campaigns.campaign_id WHERE customers.activated_at BETWEEN campaigns.started_at AND campaigns.ended_at
Here, we only have to specify activated_at
once rather than twice.
Which query is easier to read and understand? While the first may not be too complicated, it is small differences in your code that make a huge difference. It’s what separates the good from the great.
Conclusion
When first learning SQL, you want to focus on being able to solve the problems given to you and get the right answer. As you develop your skills and become more confident, you need to work on improving how you’re finding the solutions. This includes switching out less efficient functions for more efficient ones like SQL CTEs, the SQL row number function, and between SQL function.
You are no longer just writing working code, you are writing code to optimize your SQL query performance. You want your models to run as fast as possible!
Not only this, every engineer and analyst want their teammates to be able to read their code and easily understand it. Not only will you hate having to answer a million questions, but they’ll also hate having to ask you about it. Work on making your SQL code efficient and readable. If there’s a SQL function that acts as a shortcut, use it!
Also, don’t forget the importance of documentation. Commenting your code can make someone else’s job a lot easier and save the need for questions. If you think something you wrote may be hard to understand, explain what it means! I’m a firm believer that code can never have too many comments.