SQL Window Functions Introduction
Window functions perform a calculation over a set of rows that are connected in some way to the current row. It can be compared with GROUP BY
and aggregate functions. However with window functions, rows keep their separate identities in the output instead of being grouped into a single output row. The window function is able access more than just the current row of the query result under the hood. Window definition has the following syntax:
WINDOW window_name AS
(
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
)
The ORDER and PARTITION define what is called the "window", the ordered subset of data over which calculations are made.
PARTITION BY
Let's focus on the partitioning concept. When we partition by a field, the table is divided to the partitions/groups and each row individually can access to the items from its own partition. For example:
SELECT emp_no,
department,
salary,
AVG(salary) OVER w
FROM emp_salary
WINDOW w AS (PARTITION BY department)
The aggregate function AVG
is used OVER
the window w
and the result occurs to be:
Note that each color represents a partition.
The rows and the first 3 columns are the directly from the emp_salary
table and we would see that exact part with SELECT * FROM emp_salary
query, the last column comes with the aggregate function AVG
that operates on the partition of the current row and calculates the average of salary of the partition.
For each row, the window function is calculated over the rows that fall in the same partition as the current row.
As you can see from the following explain analyse result, it is important to notice that the PARTITION BY
statement first orders the table by the partitioned column:
+--------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------------------------------------------------+
|WindowAgg (cost=83.37..104.37 rows=1200 width=72) (actual time=0.589..0.879 rows=10 loops=1) |
| -> Sort (cost=83.37..86.37 rows=1200 width=40) (actual time=0.371..0.477 rows=10 loops=1) |
| Sort Key: department |
| Sort Method: quicksort Memory: 25kB |
| -> Seq Scan on emp_salary (cost=0.00..22.00 rows=1200 width=40) (actual time=0.037..0.129 rows=10 loops=1)|
|Planning Time: 0.068 ms |
|Execution Time: 1.053 ms |
+--------------------------------------------------------------------------------------------------------------------+
ORDER BY
ORDER BY
statement can be omitted as it is optional, however there is another important concept to understand the behaviour of using it. For each row, there is a set of rows within its partition called its window frame.
- When the
ORDER BY
is supplied then the frame consists of all rows from the start of the partition up through the current row. - When
ORDER BY
is omitted the default frame consists of all rows in the partition.
If we compare result of the following query with the previous one we can see the AVG
is calculated within the frame, which means from start to the current row:
SELECT emp_no,
department,
salary,
AVG(salary) OVER w
FROM emp_salary
WINDOW w AS (PARTITION BY department ORDER BY salary)
As we can see in the example emp#5
has the average of the first 4 rows.
Usual Aggregates: SUM, COUNT, and AVG
We can use the aggregate functions that are used normally without windows.
SELECT emp_no,
department,
salary,
AVG(salary) OVER w,
SUM(salary) OVER w,
COUNT(salary) OVER w
FROM emp_salary
WINDOW w AS (PARTITION BY department ORDER BY salary)
Remember that the frame would be the whole partition if we did not use ORDER BY
statement.
ROW_NUMBER
As the name implies, it shows the number of the row within the partition, and it does not take any parameter.
SELECT emp_no,
department,
salary,
ROW_NUMBER() over w
FROM emp_salary
WINDOW w AS (PARTITION BY department ORDER BY salary)
RANK and DENSE_RANK
Rank
is similar to the row_number
however when 2 fields are having the same order (based on the order by clause) their rank is considered as the same, and the next rank is omitted, and +1 value is given for the upcoming row. dense_rank
function does not skip the next rank and assigns it to the upcoming row.
SELECT emp_no,
department,
salary,
RANK() over w,
DENSE_RANK() over w
FROM emp_salary
WINDOW w AS (PARTITION BY department ORDER BY salary)
LAG and LEAD
It can often be useful to compare rows with preceding or following rows, especially if you have the data in a meaningful order. LAG
function can access data of the previous rows, and LEAD
function can access the next rows relative to the current row. Both functions take offset as parameter to provide the number of rows to go backward of forward.
SELECT emp_no,
department,
salary,
LAG(salary, 1) over w as preceding_salary,
LEAD(salary, 1) over w as following_salary
FROM emp_salary
WINDOW w AS (PARTITION BY department ORDER BY salary ASC)
FIRST_VALUE and LAST VALUE
These functions return the first or last value of the window frame.
SELECT emp_no,
department,
salary,
first_value(salary) over w as smallest_salary,
last_value(salary) over w as biggest_salary
FROM emp_salary
WINDOW w AS (PARTITION BY department ORDER BY salary ASC)
Please note as we can use the named windows, we could define it directly in the projection column as well:
SELECT emp_no,
department,
salary,
first_value(salary) over (PARTITION BY department ORDER BY salary) as smallest_salary,
last_value(salary) over (PARTITION BY department ORDER BY salary) as biggest_salary
FROM emp_salary