Trees in SQL: Nested Sets and Materialized Path

By: Vadim Tropashko
To read more DBA articles, visit http://dba.fyicenter.com/article/

Relational databases are universally conceived of as an advance over their predecessors network and hierarchical models. Superior in every querying respect, they turned out to be surprisingly incomplete when modeling transitive dependencies. Almost every couple of months a question about how to model a tree in the database pops up at the comp.database.theory newsgroup. In this article I'll investigate two out of four well known approaches to accomplishing this and show a connection between them. We'll discover a new method that could be considered as a "mix-in" between materialized path and nested sets.

Adjacency List
Tree structure is a special case of Directed Acyclic Graph (DAG). One way to represent DAG structure is:
create table emp (
ename varchar2(100),
mgrname varchar2(100)
);

Each record of the emp table identified by ename is referring to its parent mgrname. For example, if JONES reports to KING, then the emp table contains record. Suppose, the emp table also includes . Then, if the emp table doesn't contain the record, and the same is true for every pair of adjoined records, then it is called adjacency list. If the opposite is true, then the emp table is a transitively closed relation.

A typical hierarchical query would ask if SCOTT indirectly reports to KING. Since we don't know the number of levels between the two, we can't tell how many times to selfjoin emp, so that the task can't be solved in traditional SQL. If transitive closure tcemp of the emp table is known, then the query is trivial:

select 'TRUE' from tcemp
where ename = 'SCOTT' and mgrname = 'KING'

The ease of querying comes at the expense of transitive closure maintenance.
Alternatively, hierarchical queries can be answered with SQL extensions: either SQL3/DB2 recursive query

with tcemp as (
select ename,mgrname from tcemp
union
select tcemp.ename,emp.mgrname from tcemp,emp
where tcemp.mgrname = emp.ename
) select 'TRUE' from tcemp
where ename = 'SCOTT' and mgrname = 'KING';

that calculates tcemp as an intermediate relation, or Oracle proprietary connect-by syntax

select 'TRUE' from (
select ename from emp
connect by prior mgrname = ename
start with ename = 'SCOTT'
) where ename = 'KING';

in which the inner query "chases the pointers" from the SCOTT node to the root of the tree, and then the outer query checks whether the KING node is on the path.

Adjacency list is arguably the most intuitive tree model. Our main focus, however, would be the following two methods.

Materialized Path
In this approach each record stores the whole path to the root. In our previous example, lets assume that KING is a root node. Then, the record with ename = 'SCOTT' is connected to the root via the path SCOTT->JONES->KING. Modern databases allow representing a list of nodes as a single value, but since materialized path has been invented long before then, the convention stuck to plain character string of nodes concatenated with some separator; most often '.' or '/'. In the latter case, an analogy to pathnames in UNIX file system is especially pronounced.

In more compact variation of the method, we use sibling numerators instead of node's primary keys within the path string. Extending our example:
ENAME PATH
KING 1
JONES 1.1
SCOTT 1.1.1
ADAMS 1.1.1.1
FORD 1.1.2
SMITH 1.1.2.1
BLAKE 1.2
ALLEN 1.2.1
WARD 1.2.2
CLARK 1.3
MILLER 1.3.1

Path 1.1.2 indicates that FORD is the second child of the parent JONES.

Let's write some queries.
1. An employee FORD and chain of his supervisors:
select e1.ename from emp e1, emp e2
where e2.path like e1.path || '%'
and e2.name = 'FORD'

2. An employee JONES and all his (indirect) subordinates:
select e1.ename from emp e1, emp e2
where e1.path like e2.path || '%'
and e2.name = 'JONES'

Although both queries look symmetrical, there is a fundamental difference in their respective performances. If a subtree of subordinates is small compared to the size of the whole hierarchy, then the execution where database fetches e2 record by the name primary key, and then performs a range scan of e1.path, which is guaranteed to be quick.

On the other hand, the "supervisors" query is roughly equivalent to
select e1.ename from emp e1, emp e2
where e2.path > e1.path and e2.path < e1.path || 'Z'
and e2.name = 'FORD'

Or, noticing that we essentially know e2.path, it can further be reduced to
select e1.ename from emp e1
where e2path > e1.path and e2path < e1.path || 'Z'

Here, it is clear that indexing on path doesn't work (except for "accidental" cases in which e2path happens to be near the domain boundary, so that predicate e2path > e1.path is selective).
The obvious solution is that we don't have to refer to the database to figure out all the supervisor paths! For example, supervisors of 1.1.2 are 1.1 and 1. A simple recursive string parsing function can extract those paths, and then the supervisor names can be answered by
select e1.ename from emp where e1.path in ('1.1','1')
which should be executed as a fast concatenated plan. Nested Sets
Both the materialized path and Joe Celko's nested sets provide the capability to answer hierarchical queries with standard SQL syntax. In both models, the global position of the node in the hierarchy is "encoded" as opposed to an adjacency list of which each link is a local connection between immediate neighbors only. Similar to materialized path, the nested sets model suffers from supervisors query performance problem:
select p2.emp from Personnel p1, Personnel p2
where p1.lft between p2.lft and p2.rgt
and p1.emp = 'Chuck'

(Note: This query is borrowed from the previously cited Celko article). Here, the problem is even more explicit than in the case of a materialized path: we need to find all the intervals that cover a given point. This problem is known to be difficult. Although there are specialized indexing schemes like R-Tree, none of them is as universally accepted as B-Tree. For example, if the supervisor's path contains just 10 nodes and the size of the whole tree is 1000000, none of indexing techniques could provide 1000000/10=100000 times performance increase. (Such a performance improvement factor is typically associated with index range scan in a similar, very selective, data volume condition.)

Unlike a materialized path, the trick by which we computed all the nodes without querying the database doesn't work for nested sets.

Another — more fundamental — disadvantage of nested sets is that nested sets coding is volatile. If we insert a node into the middle of the hierarchy, all the intervals with the boundaries above the insertion point have to be recomputed. In other words, when we insert a record into the database, roughly half of the other records need to be updated. This is why the nested sets model received only limited acceptance for static hierarchies.

Nested sets are intervals of integers. In an attempt to make the nested sets model more tolerant to insertions, Celko suggested we give up the property that each node always has (rgt-lft+1)/2 children. In my opinion, this is a half-step towards a solution: any gap in a nested set model with large gaps and spreads in the numbering still could be covered with intervals leaving no space for adding more children, if those intervals are allowed to have boundaries at discrete points (i.e., integers) only. One needs to use a dense domain like rational, or real numbers instead.
Nested Intervals
Nested intervals generalize nested sets. A node [clft, crgt] is an (indirect) descendant of [plft, prgt] if:
plft <= clft and crgt >= prgt
The domain for interval boundaries is not limited by integers anymore: we admit rational or even real numbers, if necessary. Now, with a reasonable policy, adding a child node is never a problem. One example of such a policy would be finding an unoccupied segment [lft1, rgt1] within a parent interval [plft, prgt] and inserting a child node [(2*lft1+rgt1)/3, (rgt1+2*lft)/3]:

Full article...