“Lara walked along the tracks following a path worn by pilgrims and then turned into the fields. Here she stopped and, closing her eyes, took a deep breath of the flower-scented air of the broad expanse around her. It was dearer to her than her kin, better than a lover, wiser than a book. For a moment she rediscovered the purpose of her life. She was here on earth to grasp the meaning of its wild enchantment and to call each thing by its right name …”

--Doctor Zhivago, Boris Pasternak, 1957

It's probably not what Boris Pasternak had in mind when he wrote these words, but, to many, mathematics captures nature’s wild enchantment. People stumble into logical problems, and then figure out mathematical patterns that enable them to solve those problems. Humanity can then save time by reusing those solutions and continuously improving them. Key to this, though, is naming things—the problems and the solutions. Later, by calling things by their right names, your path to an efficient solution is shorter.

A few years ago, I stumbled into a T-SQL puzzle that went like this (names of tables and columns have been changed to protect the innocent):

You are given two tables. Table G with columns x, y, and a unique key defined on the combination (x, y). Table M with columns x, y, and two unique keys, one defined on x and another on y. G is already populated with some rows and M is empty. Your task is to copy as many rows as you can from G to M.

When I stumbled into this puzzle for the first time, I didn’t really give much thought to its practical use cases. It just seemed interesting intellectually, so I went ahead and tried to solve it. After spending a few hours on the puzzle, I didn’t manage to find a solution. So, as I often do with puzzles that I don’t have a good solution for, I stashed it in a drawer and planned to revisit it once in a while.

Just recently, it occurred to me that there must be interesting practical use cases for this puzzle, as well as a more formal definition of the problem. If I managed to find those, I could probably also find existing optimized algorithmic solutions--and, based on those, work on T-SQL adaptions. And I was right …

Let’s start with a practical use case. Suppose that table G holds pairs of job applicants (column x) and jobs/tasks they are qualified to fulfill (column y). In a given timeframe (say, a shift), each applicant can handle only one job, and each job requires only one applicant to handle it. The challenge is to write to M a maximum number of applicant-to-job pairs. As you can imagine, there are many other examples, like scheduling teachers to classes, pilots to flights, and so on.

In the mathematical field of graph theory this problem is known as finding a maximum-cardinality matching (M) in an unweighted bipartite graph (G). There are existing algorithmic solutions--for example, the Ford–Fulkerson algorithm, which involves repeatedly finding an M-augmenting path in G and updating M with the symmetric difference of that path and M.

If you’re currently glaring at the text with glazing eyes and scratching your head, worry not. Soon enough you will own this topic. Then, when you reread the last paragraph, your reaction will be, “But of course!”

It will take a few articles, though. In this article, I’ll explain the concepts and terms involved in graph matching problems. I will also describe classic problems like maximal matching and maximum matching, and provide conceptual descriptions of algorithms to solve them. In future articles, I’ll discuss T-SQL adaptions for those algorithms.

Concepts and Terms

We’re going to have a much easier time discussing graph matching problems once the terms involved are clear. Let’s start with the most basic: nodes, edges and graphs.

Node: In graph theory, a node (a.k.a., vertex) is a term for an object that is a member of a set. In my earlier example, a specific applicant (say, with ID A) is a node that is a member of the set of applicants, and a specific job (say, with ID 1) is a node that is a member of the set of jobs.

Edge: An edge represents a relationship between two nodes. For example, given applicant A and job 1, the edge (A, 1) conveys the fact that applicant A is qualified to perform job 1.

Graph  A graph is a set of edges connecting pairs of nodes. In our example, graph G represents the set of all possible (applicant, job) edges where the applicant is qualified to perform the job.

Graphs can have different properties, some of which are of particular interest to our discussion: weighted, unweighted and bipartite.

  • Weighted/unweighted graph:  A graph is weighted when you assign a weight to each edge. For example, suppose that you need to take into account the fee that an applicant will charge for the job or the time it will take to complete it. If there’s no weight associated with the edges, the graph is unweighted. Our graph G is unweighted since we don’t currently have weights associated with the edges.
  • Bipartite graph: A graph is a bipartite kind when the edges connect nodes from two disjoint sets. Our graph G is indeed bipartite since the edges always connect a member of the set of applicants with a member of the set of jobs. Our graph never connects applicants with applicants or jobs with jobs.

Our bipartite graph G can be expressed as:

G = (X, Y, E)

X denotes the set of applicants, Y the set of jobs, and E the set of edges representing applicant-to-job relationships.

Figure 1 has an illustration of our graph with some sample data.

Figure 1 - Graph G

Implementation of Nodes and Edges in SQL Server

To implement the graph in SQL Server with traditional tables, create tables X and Y for the nodes and G for the graph edges:

SET NOCOUNT ON;
USE tempdb;

-- Cleanup
DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
GO

-- Set X, e.g., Applicants
CREATE TABLE dbo.X
(
  x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
  moreonx VARCHAR(20) NOT NULL
);

INSERT INTO dbo.X(x, moreonx) VALUES
  ('A', 'Applicant A'),
  ('B', 'Applicant B'),
  ('C', 'Applicant C'),
  ('D', 'Applicant D');

-- Set Y, e.g., Jobs
CREATE TABLE dbo.Y
(
  y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
  moreony VARCHAR(20) NOT NULL
);

INSERT INTO dbo.Y(y, moreony) VALUES
  (1, 'Job 1'),
  (2, 'Job 2'),
  (3, 'Job 3'),
  (4, 'Job 4');

-- Graph G = (X, Y, E), e.g., possible applicant-job connections
CREATE TABLE dbo.G
(
  x VARCHAR(10) NOT NULL
    CONSTRAINT FK_G_X FOREIGN KEY REFERENCES dbo.X,
  y INT NOT NULL
    CONSTRAINT FK_G_Y FOREIGN KEY REFERENCES dbo.Y,
  CONSTRAINT PK_G PRIMARY KEY (x, y)
);

INSERT INTO dbo.G(x, y) VALUES
  ('A', 1),
  ('A', 2),
  ('B', 1),
  ('B', 3),
  ('C', 3),
  ('C', 4),
  ('D', 3);

To show all existing applicant-to-job relationships, you can naturally use basic inner joins, like so:

SELECT
  X.x AS applicantid, X.moreonx AS applicantname,
  Y.y AS jobid, Y.moreony AS jobname
FROM dbo.X
  INNER JOIN dbo.G
    ON X.x = G.x
  INNER JOIN dbo.Y
    ON G.y = Y.y;

You get the following output:

applicantid applicantname        jobid       jobname
----------- -------------------- ----------- --------------------
A           Applicant A          1           Job 1
A           Applicant A          2           Job 2
B           Applicant B          1           Job 1
B           Applicant B          3           Job 3
C           Applicant C          3           Job 3
C           Applicant C          4           Job 4
D           Applicant D          3           Job 3

You may have heard that SQL Server 2017 introduces native graph support via a feature called SQL Graph. The initial implementation allows you to explicitly define node and edge tables (with clauses AS NODE and AS EDGE, respectively), and instead of using traditional joins to match nodes via edges, as shown above, you use a clause called MATCH in the WHERE clause.

With a node table, you simply add the clause AS NODE at the end of the table definition. For each node row that you insert, SQL Server automatically creates an internal unique id (column called $node_id), to be later used in an edge table to connect nodes. Let’s recreate our node tables as native SQL Graph ones:

-- Cleanup
DROP TABLE IF EXISTS dbo.G, dbo.X, dbo.Y;
GO

-- Set X, e.g., Applicants
CREATE TABLE dbo.X
(
  x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
  moreonx VARCHAR(20) NOT NULL
) AS NODE;

INSERT INTO dbo.X(x, moreonx) VALUES
  ('A', 'Applicant A'),
  ('B', 'Applicant B'),
  ('C', 'Applicant C'),
  ('D', 'Applicant D');

SELECT $node_id, x, moreonx FROM dbo.X;

-- Set Y, e.g., Jobs
CREATE TABLE dbo.Y
(
  y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
  moreony VARCHAR(20) NOT NULL
) AS NODE;

INSERT INTO dbo.Y(y, moreony) VALUES
  (1, 'Job 1'),
  (2, 'Job 2'),
  (3, 'Job 3'),
  (4, 'Job 4');

SELECT $node_id, y, moreony FROM dbo.Y;

This code generates the following output:

$node_id_7B0758FCBD5F4EC49316BCA4A4DD9000          x  moreonx
-------------------------------------------------- -- ------------
{"type":"node","schema":"dbo","table":"X","id":0}  A  Applicant A
{"type":"node","schema":"dbo","table":"X","id":1}  B  Applicant B
{"type":"node","schema":"dbo","table":"X","id":2}  C  Applicant C
{"type":"node","schema":"dbo","table":"X","id":3}  D  Applicant D

$node_id_29698B2A330F41ED8B7812BABB71701E          y   moreony
-------------------------------------------------- --- -----------
{"type":"node","schema":"dbo","table":"Y","id":0}  1   Job 1
{"type":"node","schema":"dbo","table":"Y","id":1}  2   Job 2
{"type":"node","schema":"dbo","table":"Y","id":2}  3   Job 3
{"type":"node","schema":"dbo","table":"Y","id":3}  4   Job 4

The $node_id column is a computed canonical representation as a JSON string of an internal graph_id column, which is naturally more economic (BIGINT).

As for the edge table, when you create it, you add a clause called AS EDGE at the end of the table definition. SQL Server implicitly creates three columns: one called $edge_id, which uniquely identifies the edge, and two called $from_id and $to_id representing the vertices of the edge. When you add edge rows to the table, you need to provide the $node_id values of the nodes that the edge connects for the $from_id and $to_id columns. SQL Server automatically generates the $edge_id column values. You don’t have to have any additional columns in the table, so the code can certainly be one with no explicit column definitions. If you do need to assign additional properties to the edge--e.g., a weight in case of a weighted graph--you certainly can. In our example, G has no additional properties, so you create and populate it like so:

-- Graph G = (V = (X, Y), E), e.g., possible applicant-job connections
CREATE TABLE dbo.G AS EDGE;

INSERT INTO dbo.G($from_id, $to_id)
  SELECT X.$node_id, Y.$node_id
  FROM ( VALUES('A', 1),
               ('A', 2),
               ('B', 1),
               ('B', 3),
               ('C', 3),
               ('C', 4),
               ('D', 3) ) AS XY(x, y)
    INNER JOIN dbo.X
      ON XY.x = X.x
    INNER JOIN dbo.Y
      ON XY.y = Y.y;

SELECT $edge_id, $from_id, $to_id
FROM dbo.G;

This code generates the following output:

$edge_id_C6C1428B31F24F52A47DDCB975B93DC2          $from_id_0E5FF12FB4FC4EE98F38127715EE0D62          $to_id_E8D645584C8A44D3A05AA6D13B9F9E02
-------------------------------------------------- -------------------------------------------------- --------------------------------------------------
{"type":"edge","schema":"dbo","table":"G","id":0}  {"type":"node","schema":"dbo","table":"X","id":0}  {"type":"node","schema":"dbo","table":"Y","id":0}
{"type":"edge","schema":"dbo","table":"G","id":1}  {"type":"node","schema":"dbo","table":"X","id":0}  {"type":"node","schema":"dbo","table":"Y","id":1}
{"type":"edge","schema":"dbo","table":"G","id":2}  {"type":"node","schema":"dbo","table":"X","id":1}  {"type":"node","schema":"dbo","table":"Y","id":0}
{"type":"edge","schema":"dbo","table":"G","id":3}  {"type":"node","schema":"dbo","table":"X","id":1}  {"type":"node","schema":"dbo","table":"Y","id":2}
{"type":"edge","schema":"dbo","table":"G","id":4}  {"type":"node","schema":"dbo","table":"X","id":2}  {"type":"node","schema":"dbo","table":"Y","id":2}
{"type":"edge","schema":"dbo","table":"G","id":5}  {"type":"node","schema":"dbo","table":"X","id":2}  {"type":"node","schema":"dbo","table":"Y","id":3}
{"type":"edge","schema":"dbo","table":"G","id":6}  {"type":"node","schema":"dbo","table":"X","id":3}  {"type":"node","schema":"dbo","table":"Y","id":2}

To show all applicant-to-job relationships, instead of using joins, specify the node and edge tables in the query’s FROM clause separated by commas; in the WHERE clause, specify the MATCH clause connecting X with Y via G using ASCII art-style symbols, like so:

SELECT
  X.x AS applicantid, X.moreonx AS applicantname,
  Y.y AS jobid, Y.moreony AS jobname
FROM dbo.X, dbo.G, dbo.Y
WHERE MATCH(X-(G)->Y);

There’s elegance to this form compared to the traditional join form, and you really appreciate the savings when you deal with more involved relationships.

The initial implementation of the SQL Graph feature in SQL Server 2017 doesn’t yet include more sophisticated capabilities, like polymorphism and transitive closure. It also doesn’t include functions like shortest and longest paths, with the ability to indicate breadth-first search (BFS) or depth-first search (DFS). Such capabilities will likely be added over time as the feature matures. As is common with new T-SQL features nowadays, you will probably see them appearing in Azure SQL Database first, and at a later point in the box product. Such capabilities could be very handy in tackling graph matching and assignment problems. In the meanwhile, we’ll use the more traditional modeling and tools.

More Concepts and Terms

Let’s continue by visiting more concepts and terms related to graphs—this time focused on graph matching aspects. Figure 2 illustrates some of the concepts through examples. M1, M2, M3 and M4 represent different subsets of edges from G. For some Mn, edges unique to G are represented by thin gray lines, and edges common to G and Mn are represented by thick green lines.

Figure 2 - Graph matching examples

 

Matching: A matching M in a graph G is a subset of the edges of G such that no two edges share the same node. In Figure 2, M1, M3 and M4 are matchings in G, but M2 isn’t since it has two edges that share node B.

Maximal matching: A matching M in graph G is a maximal matching if you cannot add more edges from G that are not already in M to M. In Figure 2, both M3 and M4 are maximal matchings. M1 is a matching, but it isn’t maximal since you can add the edge (C, 4).

Maximum matching: A matching M in graph G is a maximum matching, also known as a maximum-cardinality matching, if it has the largest possible number of edges. Every maximum matching is maximal, but not every maximal matching is maximum. In Figure 2, M4 is a maximum matching but M3 isn’t. Although not the case with the sample data that we used for graph G, for any given graph there may be multiple different maximum matchings.

At this point, the description of the matching problem I provided earlier should be completely clear to you:

“This problem is known as finding a maximum-cardinality matching (M) in an unweighted bipartite graph (G).”

But of course!

Figure 3 has additional illustrations to help you understand certain types of paths in graphs.

Figure 3 - Types of paths

Matched node/vertex: Given a matching M in graph G, a matched node is a node that is incident to an edge in M.

Free node/vertex: Given a matching M in graph G, a free node is a node in G that is not incident to any edge in M.

Alternating path: An alternating path is a path that alternates between edges in the matching and not in the matching. Both paths in Figure 3 are alternating paths.

Augmenting path: An augmenting path is an alternating path that starts and ends with free nodes. In Figure 3 the path in the right illustration is an augmenting path whereas the path in the left illustration isn’t. If P is an augmenting path for matching M in graph G, you can use the terminology P is an M-augmenting path in G.

Symmetric difference: The symmetric difference of sets U and V (also known as the disjunctive union) is the set of elements in either of the sets and not in their intersection. The symbols ⊕, ⊖ and △ are commonly used to denote a symmetric difference operation. For example, you can use the notation W = U ⊕ V to indicate that the set W is equal to the symmetric difference of U and V. For example, if U = {2, 3, 5} and V = {5, 7, 11), U ⊕ V = {2, 3, 7, 11}. Figure 4 has a Venn diagram of the symmetric difference of two sets.

Figure 4 – Symmetric Difference

At this point the description of the Ford-Fulkerson algorithm for finding a maximum-cardinality matching M in G, which I provided earlier, should be clear to you at the conceptual level:

Repeatedly find an M-augmenting path in G and update M with the symmetric difference of that path and M.”

But of course!

We will naturally delve into this problem and solution more deeply in a future article.

Your Mission

Your mission, should you choose to accept it, is to write a T-SQL solution in the form of a stored procedure called MaximalMatching that is as efficient as possible for finding a maximal (not maximum yet) matching M in G.

Concerning G: To check the correctness of your solution, use the small set of sample data that I provided earlier in this article. If you need to recreate G, X and Y as traditional tables and repopulate them, use the following code:

-- Traditional tables X, Y and G
SET NOCOUNT ON;
USE tempdb;

-- Cleanup
DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
GO

-- Set X, e.g., Applicants
CREATE TABLE dbo.X
(
  x VARCHAR(10) NOT NULL CONSTRAINT PK_X PRIMARY KEY,
  moreonx VARCHAR(20) NOT NULL
);

INSERT INTO dbo.X(x, moreonx) VALUES
  ('A', 'Applicant A'),
  ('B', 'Applicant B'),
  ('C', 'Applicant C'),
  ('D', 'Applicant D');

-- Set Y, e.g., Jobs
CREATE TABLE dbo.Y
(
  y INT NOT NULL CONSTRAINT PK_Y PRIMARY KEY,
  moreony VARCHAR(20) NOT NULL
);

INSERT INTO dbo.Y(y, moreony) VALUES
  (1, 'Job 1'),
  (2, 'Job 2'),
  (3, 'Job 3'),
  (4, 'Job 4');

-- Graph G = (X, Y, E), e.g., possible applicant-job connections
CREATE TABLE dbo.G
(
  x VARCHAR(10) NOT NULL
    CONSTRAINT FK_G_X FOREIGN KEY REFERENCES dbo.X,
  y INT NOT NULL
    CONSTRAINT FK_G_Y FOREIGN KEY REFERENCES dbo.Y,
  CONSTRAINT PK_G PRIMARY KEY (x, y)
);

INSERT INTO dbo.G(x, y) VALUES
  ('A', 1),
  ('A', 2),
  ('B', 1),
  ('B', 3),
  ('C', 3),
  ('C', 4),
  ('D', 3);

As for M, use the following table definition:

-- M is a matching in G
DROP TABLE IF EXISTS dbo.M;

CREATE TABLE dbo.M
(
  x VARCHAR(10) NOT NULL
    CONSTRAINT UQ_M_x UNIQUE,
  y INT NOT NULL
    CONSTRAINT UQ_M_y UNIQUE,
  CONSTRAINT PK_M PRIMARY KEY (x, y),
  CONSTRAINT FK_M_G FOREIGN KEY (x, y) REFERENCES dbo.G(x, y)
);

Your stored procedure should start by emptying M, and then populating it with a maximal matching in G.

To test the performance of your solution, use the following code to create a larger set of sample data (X and Y removed since they are not needed for our problem):

-- Helper function dbo.GetNums
DROP FUNCTION IF EXISTS dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@low AS BIGINT, @high AS BIGINT) RETURNS TABLE
AS
RETURN
  WITH
    L0   AS (SELECT c FROM (SELECT 1 UNION ALL SELECT 1) AS D(c)),
    L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
    L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
    L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
    L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
    L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
             FROM L5)
  SELECT TOP(@high - @low + 1) @low + rownum - 1 AS n
  FROM Nums
  ORDER BY rownum;
GO

-- Larger set of sample data to test performance
DROP TABLE IF EXISTS dbo.M, dbo.G, dbo.X, dbo.Y;
GO

CREATE TABLE dbo.G
(
  x VARCHAR(10) NOT NULL,
  y INT NOT NULL,
  CONSTRAINT PK_G PRIMARY KEY (x, y)
);

CREATE TABLE dbo.M
(
  x VARCHAR(10) NOT NULL
    CONSTRAINT UQ_M_x UNIQUE,
  y INT NOT NULL
    CONSTRAINT UQ_M_y UNIQUE,
  CONSTRAINT PK_M PRIMARY KEY (x, y),
  CONSTRAINT FK_M_G FOREIGN KEY (x, y) REFERENCES dbo.G(x, y)
);

DECLARE @n AS INT = 10000000; -- ~ total num rows
SET @n = SQRT(@n * 2); -- num members of arithmetic sequence

INSERT INTO dbo.G WITH (TABLOCK) (x, y)
  SELECT RIGHT('000000000' + CAST(X.n AS VARCHAR(10)), 10) AS x, Y.n AS y
  FROM dbo.GetNums(1, @n) AS X
    CROSS APPLY (SELECT TOP (@n - X.n + 1) n
                 FROM dbo.GetNums(1, @n)
                 ORDER BY NEWID()) AS Y;

If you make it out alive from this mission, we’ll meet next month and discuss solutions. Then we can talk about your next mission—finding maximum matching.