There’s a common T-SQL need that involves computing for each current row a value from a previous or next row. For this purpose, T-SQL has the LAG and LEAD window functions. Things get trickier, though, when you need to add a condition. For example, suppose you need to compute the last col1 value that was greater than x; or, based on col1 order, compute the last col2 value that was greater than x. In this article I’ll explain how you can achieve such tasks.

The Challenge

To demonstrate computing previous and next with a condition, I’ll use daily rainfall data in tables called Locations and Precipitation. The Locations table holds locations where rainfall data is collected, and the Precipitation table holds daily precipitation per location in millimeters. Use the code in Listing 1 to create the tables and fill them with a small set of sample data.

Listing 1: DDL and small set of sample data

SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID(N'dbo.Precipitation', N'U') IS NOT NULL
  DROP TABLE dbo.Precipitation;
IF OBJECT_ID(N'dbo.Locations', N'U') IS NOT NULL
  DROP TABLE dbo.Locations;
GO

CREATE TABLE dbo.Locations
(
  locid INT  NOT NULL
    CONSTRAINT PK_Locations PRIMARY KEY
);

CREATE TABLE dbo.Precipitation
(
  locid INT  NOT NULL,
  dt    DATE NOT NULL,
  val   INT  NOT NULL,
  CONSTRAINT PK_Precipitation PRIMARY KEY(locid, dt, val),
  CONSTRAINT FK_Precipitation_Locations
    FOREIGN KEY(locid) REFERENCES dbo.Locations(locid)
);

INSERT INTO dbo.Locations(locid) VALUES(1),(2);

INSERT INTO dbo.Precipitation(locid, dt, val) VALUES
  (1, '20151214', 10),
  (1, '20151215', 0),
  (1, '20151216', 0),
  (1, '20151217', 100),
  (1, '20151218', 50),
  (1, '20151219', 20),
  (1, '20151220', 210),
  (1, '20151221', 20),
  (1, '20151222', 0),
  (1, '20151223', 0),
  (1, '20151224', 0),
  (1, '20151225', 40),
  (2, '20151214', 0),
  (2, '20151215', 140),
  (2, '20151216', 60),
  (2, '20151217', 40),
  (2, '20151218', 0),
  (2, '20151219', 20),
  (2, '20151220', 0),
  (2, '20151221', 0),
  (2, '20151222', 0),
  (2, '20151223', 45),
  (2, '20151224', 120),
  (2, '20151225', 130);

 

Suppose you get a task to compute, for each row in Precipitation, two things:

1. The number of days that passed since the last day the precipitation was more than 24 millimeters (not including today). Call the result column diffprev.

2. The number of days that will pass until the next day the precipitation will be more than 24 millimeters (not including today). Call the result column diffnext.

Table 1 has the desired result for the small set of sample data in Listing 1.

Table 1: Desired result

locid       dt         val         diffprev    diffnext
----------- ---------- ----------- ----------- -----------
1           2015-12-14 10          NULL        3
1           2015-12-15 0           NULL        2
1           2015-12-16 0           NULL        1
1           2015-12-17 100         NULL        1
1           2015-12-18 50          1           2
1           2015-12-19 20          1           1
1           2015-12-20 210         2           5
1           2015-12-21 20          1           4
1           2015-12-22 0           2           3
1           2015-12-23 0           3           2
1           2015-12-24 0           4           1
1           2015-12-25 40          5           NULL
2           2015-12-14 0           NULL        1
2           2015-12-15 140         NULL        1
2           2015-12-16 60          1           1
2           2015-12-17 40          1           6
2           2015-12-18 0           1           5
2           2015-12-19 20          2           4
2           2015-12-20 0           3           3
2           2015-12-21 0           4           2
2           2015-12-22 0           5           1
2           2015-12-23 45          6           1
2           2015-12-24 120         1           1
2           2015-12-25 130         1           NULL

Try and figure out what’s the most efficient solution you can come up with for this challenge.

Use the small set of sample data from Listing 1 to check the validity of your solution. To test your solution’s performance, you’ll need much more data. For this purpose, use the code in Listing 2 to create a helper function called GetNums, which generates a sequence of integers in a requested range.

Listing 2: Helper function GetNums

IF OBJECT_ID(N'dbo.GetNums', N'IF') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
             FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

Use the code below to populate the tables with 10,000 locations with three years’ worth of daily measurements for each (about 10,000,000 measurements in total).

Large set of sample data

-- 10,000 locations x 3 years ~= 10,000,000 measurements
TRUNCATE TABLE dbo.Precipitation;
DELETE FROM dbo.Locations;

INSERT INTO dbo.Locations(locid)
  SELECT n FROM dbo.GetNums(1, 10000);

INSERT INTO dbo.Precipitation(locid, dt, val)
  SELECT L.n AS locid, DATEADD(day, D.n, '20130101') AS dt,
    ABS(CHECKSUM(NEWID())) % 2 * ABS(CHECKSUM(NEWID())) % 100 AS val
  FROM dbo.GetNums(0, DATEDIFF(day, '20130101', '20151231')) AS D
    CROSS JOIN dbo.GetNums(1, 10000) AS L;

To work!

Solution with TOP

Probably the most obvious solution is to use a subquery with a TOP filter to get the desired previous or next value. For example, to get the previous date where the precipitation value was greater than 24 (call it prevdt), you’d use the following subquery (assuming you alias the outer instance of Precipitation P1):

( SELECT TOP (1) dt
    FROM dbo.Precipitation AS P2
    WHERE P2.locid = P1.locid
      AND P2.dt < P1.dt
      AND val > 24
    ORDER BY P2.dt DESC ) AS prevdt

Similarly, to get the next date where the precipitation value is greater than 24 (call it nextdt), you’d use the following subquery:

( SELECT TOP (1) dt
    FROM dbo.Precipitation AS P2
    WHERE P2.locid = P1.locid
      AND P2.dt > P1.dt
      AND val > 24
    ORDER BY P2.dt ) AS nextdt

Here the complete query computing both prevdt and nextdt:

SELECT locid, dt, val,
  ( SELECT TOP (1) dt
    FROM dbo.Precipitation AS P2
    WHERE P2.locid = P1.locid
      AND P2.dt < P1.dt
      AND val > 24
    ORDER BY P2.dt DESC ) AS prevdt,
  ( SELECT TOP (1) dt
    FROM dbo.Precipitation AS P2
    WHERE P2.locid = P1.locid
      AND P2.dt > P1.dt
      AND val > 24
    ORDER BY P2.dt ) AS nextdt
FROM dbo.Precipitation AS P1
/* ORDER BY locid, dt */ ; -- uncomment to present ordered

This query generates the output shown in Table 2 (presented sorted by location and date).

Table 2: Result of Step1

locid       dt         val         prevdt     nextdt
----------- ---------- ----------- ---------- ----------
1           2015-12-14 10          NULL       2015-12-17
1           2015-12-15 0           NULL       2015-12-17
1           2015-12-16 0           NULL       2015-12-17
1           2015-12-17 100         NULL       2015-12-18
1           2015-12-18 50          2015-12-17 2015-12-20
1           2015-12-19 20          2015-12-18 2015-12-20
1           2015-12-20 210         2015-12-18 2015-12-25
1           2015-12-21 20          2015-12-20 2015-12-25
1           2015-12-22 0           2015-12-20 2015-12-25
1           2015-12-23 0           2015-12-20 2015-12-25
1           2015-12-24 0           2015-12-20 2015-12-25
1           2015-12-25 40          2015-12-20 NULL
2           2015-12-14 0           NULL       2015-12-15
2           2015-12-15 140         NULL       2015-12-16
2           2015-12-16 60          2015-12-15 2015-12-17
2           2015-12-17 40          2015-12-16 2015-12-23
2           2015-12-18 0           2015-12-17 2015-12-23
2           2015-12-19 20          2015-12-17 2015-12-23
2           2015-12-20 0           2015-12-17 2015-12-23
2           2015-12-21 0           2015-12-17 2015-12-23
2           2015-12-22 0           2015-12-17 2015-12-23
2           2015-12-23 45          2015-12-17 2015-12-24
2           2015-12-24 120         2015-12-23 2015-12-25
2           2015-12-25 130         2015-12-24 NULL

Remember that you need to compute the difference in terms of days between prevdt and dt (call it diffprev), and between dt and nextdt (call it diffnext). For this, you need to use the DATEDIFF function with inputs based on the current date and the results of the TOP subqueries. If you want to avoid providing the subqueries directly as inputs to the DATEDIFF function, you can use the CROSS APPLY operator and a derived table based on the VALUES clause to define the prevdt and nextdt aliases in the FROM clause. Since the FROM clause is logically evaluated before the SELECT clause, aliases that you define in the FROM clause are made available to computations in the SELECT clause. Here’s the completer solution query applying this approach:

SELECT locid, dt, val,
  DATEDIFF(day, prevdt, dt) AS diffprev,
  DATEDIFF(day, dt, nextdt) AS diffnext
FROM dbo.Precipitation AS P1
  CROSS APPLY
    ( VALUES(
        ( SELECT TOP (1) dt
          FROM dbo.Precipitation AS P2
          WHERE P2.locid = P1.locid
            AND P2.dt < P1.dt
            AND val > 24
          ORDER BY P2.dt DESC ),
        ( SELECT TOP (1) dt
          FROM dbo.Precipitation AS P2
          WHERE P2.locid = P1.locid
            AND P2.dt > P1.dt
            AND val > 24
          ORDER BY P2.dt )
      ) ) AS A(prevdt, nextdt)
/* ORDER BY locid, dt */ ; -- uncomment to present ordered

The query plan I got for this query is shown in Figure 1.

Figure 1: Plan for solution with TOP

Notice that for each row in Precipitation you get two index seek operations—one to compute prvdt and another to compute nextdt. This results in a very high number of logical reads. The execution of this query took 71 seconds to complete on my system, preforming 65,844,026 logical reads (reported by an Extended Events session). Note that this is the only query that I ran on my laptop. In reality, if this query runs in a system where lots of other queries are executed, there’re will be contention due to the excessive I/O operations it performs. It’s probably a good idea to look for a solution that performs fewer reads.

By the way: Often you only need to compute a previous or next value, but not both. When I ran this solution computing only diffprev, removing the computation of diffnext, it took 42 seconds to complete. Still, the number of reads is in the dozens of millions.

Solution with window functions

You can achieve the same task using window functions, resulting in a plan that performs significantly fewer reads. First, compute a value called gooddt representing the current date when the precipitation value is greater than 24 and NULL otherwise. Then, compute prevdt with the MAX window function applied to gooddt, with the frame ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING. Similarly, compute nextdt with the MIN window function applied to gooddt, with the frame ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING. The rest is the same as in the previous solution. Here’s the complete solution query:

WITH C AS
(
  SELECT *,
    MAX(gooddt) OVER(PARTITION BY locid
                     ORDER BY dt
                     ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevdt,
    MIN(gooddt) OVER(PARTITION BY locid
                     ORDER BY dt
                     ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nextdt
  FROM dbo.Precipitation
    CROSS APPLY ( VALUES( CASE WHEN val > 24 THEN dt END ) ) AS A(gooddt)
)
SELECT locid, dt, val,
  DATEDIFF(day, prevdt, dt) AS diffprev,
  DATEDIFF(day, dt, nextdt) AS diffnext
FROM C
/* ORDER BY locid, dt */ ; -- uncomment to present ordered

I got the plan shown in Figure 2 for this query.

Figure 2: Plan for solution with window functions

It took this query 51 seconds to complete on my system (and 33 seconds when computing only diffprev). The important thing, though, is that the number of reads dropped by three orders of magnitude to 64,769 (including for spilling the sort to tempdb). This means that this query will generate much less contention on I/O resources in an environment with multiple queries running concurrently.

You can achieve further optimization using Adam Machanic’s parallel CROSS APPLY trick where you query the Locations table, and with the CROSS APPLY operator, apply the solution’s logic to a single location. This trick tends to get better treatment of parallelism, and also breaks operations like sorting that scale in an extra linear fashion to multiple smaller operations that in total perform better.

To implement this technique first use the code in Listing 3 to encapsulate the solution’s logic for a single location in an inline table-valued function.

Listing 3: Encapsulate logic for single location in inline TVF

IF OBJECT_ID(N'dbo.GetDiff', N'IF') IS NOT NULL DROP FUNCTION dbo.GetDiff;
GO
CREATE FUNCTION dbo.GetDiff(@locid AS INT, @minprecip AS INT) RETURNS TABLE
AS
RETURN
  WITH C AS
  (
    SELECT dt, val,
      MAX(gooddt) OVER(ORDER BY dt
                       ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevdt,
      MIN(gooddt) OVER(ORDER BY dt
                       ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nextdt
    FROM dbo.Precipitation
      CROSS APPLY ( VALUES( CASE WHEN val > @minprecip THEN dt END ) ) AS A(gooddt)
    WHERE locid = @locid
  )
  SELECT dt, val,
    DATEDIFF(day, prevdt, dt) AS diffprev,
    DATEDIFF(day, dt, nextdt) AS diffnext
  FROM C;
GO

Use the following query to apply the function to each location:

SELECT L.locid, A.dt, A.val, A.diffprev, A.diffnext
FROM dbo.Locations AS L
  CROSS APPLY dbo.GetDiff( L.locid, 24 ) AS A
/* ORDER BY locid, dt */; -- uncomment to present ordered

 

I got the plan shown in Figure 3 for this query.

Figure 3: Plan for solution with APPLY

Notice also that the sort spill disappeared in this plan. It took this query 41 seconds to complete on my system (20 seconds when computing only diffprev, since there’s no sort required at all in such a case). This query performed 67,375 logical reads.

Return a different value than the ordering element

The last challenge involved returning the same value that is considered also the ordering element (the date in our example). But what if you need to return a different value than the ordering element? For example, what if the task was: Return the precipitation values of the previous and next days where the value was greater than 24? To achieve this, when you compute goodval, instead of holding just the date, have it hold a concatenated string made of the date plus the value, using the following expression which preserve correct ordering behavior:

CASE WHEN val > 24 THEN CONVERT(CHAR(8), dt, 112) + STR(val, 10) END

Use the MIN and MAX window functions like before (call the result columns this time prevgoodval and nextgoodval). Then, in the outer query, extract from each of the result columns the 10 rightmost characters, and convert them to integers. Here’s the complete solution query:

WITH C AS
(
  SELECT *,
    MAX(goodval)
      OVER(PARTITION BY locid
           ORDER BY dt
           ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevgoodval,
    MIN(goodval)
      OVER(PARTITION BY locid
           ORDER BY dt
           ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) AS nextgoodval
  FROM dbo.Precipitation
    CROSS APPLY ( VALUES( CASE WHEN val > 24 THEN
                    CONVERT(CHAR(8), dt, 112) + STR(val, 10) END ) )
    AS A(goodval)
)
SELECT locid, dt, val, prevval, nextval
FROM C
  CROSS APPLY ( VALUES( CAST(RIGHT(prevgoodval, 10) AS INT),
                        CAST(RIGHT(nextgoodval, 10) AS INT) ) )
    AS A(prevval, nextval)
/* ORDER BY locid, dt */ ; -- uncomment to present ordered

This query generates the output shown in Table 3 for the small set of sample data.

Table 3: Returning different value than ordering element

locid       dt         val         prevval     nextval
----------- ---------- ----------- ----------- -----------
1           2015-12-14 10          NULL        100
1           2015-12-15 0           NULL        100
1           2015-12-16 0           NULL        100
1           2015-12-17 100         NULL        50
1           2015-12-18 50          100         210
1           2015-12-19 20          50          210
1           2015-12-20 210         50          40
1           2015-12-21 20          210         40
1           2015-12-22 0           210         40
1           2015-12-23 0           210         40
1           2015-12-24 0           210         40
1           2015-12-25 40          210         NULL
2           2015-12-14 0           NULL        140
2           2015-12-15 140         NULL        60
2           2015-12-16 60          140         40
2           2015-12-17 40          60          45
2           2015-12-18 0           40          45
2           2015-12-19 20          40          45
2           2015-12-20 0           40          45
2           2015-12-21 0           40          45
2           2015-12-22 0           40          45
2           2015-12-23 45          40          120
2           2015-12-24 120         45          130
2           2015-12-25 130         120         NULL

More Windowing Power

Window functions are the best thing that happened since sliced bread. And I’m not talking about just T-SQL features, but generally. Seriously--if you asked me which of the two I’d rather give up if I had to, I’d probably give up bread. It keeps amazing me what wide range of problems can be solved so elegantly and efficiently with window functions. There’s so much more in standard SQL involving window functions, like nested window functions, more powerful RANGE capabilities and other features. Hopefully, some folks from Microsoft are reading this and will continue the investment to add the significant functionality that is still missing.