Arrays in SQL that Avoid Repeated Groups

By: Joe Celko
To read more DBA articles, visit http://dba.fyicenter.com/article/

It is certainly possible to fake an Array in SQL, but there are only a few occasions when it would be the best design. Most often, the wish for an array in SQL is a sign of a forlorn struggle against poorly-normalised data. One of the worst sins against Codd is the repeating group, as Joe explains.

For a long time, procedural languages have used arrays to hold and manipulate data, so programmers are used to designing data with them. The relational model states that all data is shown in tables, and we have no other data structures. To be in First Normal Form (1NF) a table can have no repeating groups, which mean that it has only columns of scalar values. You can make a good case that you can only declare a table in the 1NF in SQL. But that does not stop programmers from finding ways to “fake it” and make their SQL look like their familiar application languages.

In procedural languages, arrays are stored in “row major” or “column major” order. Go to Wikipedia (Row-major order) for details, but the important concept is how to take an n-dimensional array and “flatten” it into sequential physical storage. This is important for procedural languages because it determines the algorithms used to access the array elements. In SQL, we do not care about any physical storage.

It is quite common to see repeated groups used in SQL to attempt to simulate an Array. An example of a repeated group is an employee table with a set of columns for dependents, as follows:

CREATE TABLE Personnel
(emp_id INTEGER NOT NULL PRIMARY KEY,
emp_name VARCHAR(25) NOT NULL,
kid_name_1 VARCHAR(25),
kid_name_2 VARCHAR(25),
kid_name_3 VARCHAR(25),
kid_name_4 VARCHAR(25),
kid_name_5 VARCHAR(25),
..);

This table has many problems. All the columns in the repeating group must be NULL-able, otherwise you cannot represent employees with zero, one, two, three, or four dependent children. The alternative of requiring all employees to have exactly five kids does not work. Likewise, the first employee that has six children messes up the table. And you cannot require him to ditch one of his dependents.

If "number one son" dies, you have to decide to leave his slot NULL or to move the others up one notch. Where would you put the name of a new daughter in the group? Position in a repeating group often has meaning in itself, so you cannot use a single simple UPDATE, INSERT, or DELETE procedure for such tables.

Queries are much more difficult. As an exercise, try to write queries against the Personnel table to find:

1. All kids named "George" by the same employee. George Foreman should be the only answer to this one.
2. All employees with three or more offspring
3. All employees with a kid who has the same name as they do (find the juniors)
4. Pairs of employees whose children have the same names

You are unlikely to enjoy this task. Using children makes the silliness easy to see.

Representing Arrays in SQL
SQL cannot represent arrays directly, but vendors often provide array language extensions. Two methods for supporting arrays are to have columns with "array" data types (as a whole) or to allow the referencing of groups of columns by subscript (element by element). Subscripts are also called array indexes, but that term can be confused with table indexes in SQL, so I use the term "subscript" instead.

An array in other programming languages has a name and subscripts by which you reference the array elements. Typically, the array elements all have the same data type and the subscripts are all integers. Some languages start numbering at zero, some at one, and others let the user set the upper and lower bounds.

A Pascal array declaration, for example, would look like this:

MyArray : ARRAY [1..5] OF INTEGER;

This would have elements MyArray[1], MyArray[2], MyArray[3], MyArray[4], and MyArray[5]. The same structure is often mapped into a SQL declaration as:

CREATE TABLE MyArray1
(element1 INTEGER NOT NULL,
element2 INTEGER NOT NULL,
element3 INTEGER NOT NULL,
element5 INTEGER NOT NULL);

You have to go to all this trouble because there is no subscript that you can iterate in a loop. In fact, there is no loop control structure at all! You must use column names.

A better alternative approach to faking an array in the relational model is to map arrays into a table with an integer column for each subscript, as follows:

CREATE TABLE MyArray2
(i INTEGER NOT NULL CHECK (i BETWEEN 1 AND 5),
element INTEGER NOT NULL,
PRIMARY KEY (i));

This looks more complex than the first approach, but it is closer to what the original Pascal, or any other procedural language, declaration does behind the scenes. Subscripts resolve to unique physical addresses, so it is not possible to have two values for MyArray[i]; hence i is a key. The compiler will check to see that the subscripts are within the declared range using the CHECK() clause.

The first advantage of this approach is that you can easily handle multidimensional arrays by adding another column for each subscript. The Pascal declaration:

ThreeD : ARRAY [1..3, 1..4, 1..5] OF REAL;

becomes:
CREATE TABLE ThreeD
(i INTEGER NOT NULL CHECK (i BETWEEN 1 AND 3),
j INTEGER NOT NULL CHECK (j BETWEEN 1 AND 4),
k INTEGER NOT NULL CHECK (k BETWEEN 1 AND 5),
element INTEGER NOT NULL,
PRIMARY KEY (i, j, k));

Obviously, GROUP BY clauses on the subscript columns will produce row and column totals. If you used the original one-element/one-column approach, the table declaration would have 120 columns, named "element111" to "element345." This would be too many names to handle in any reasonable way.

This idiom can support matrix math, but I am not going to go into that in this article. If anyone is interested, try to write the following operations for a classic two dimensional matrix:
1. Matrix equality
2. Matrix addition
3. Matrix multiplications
4. Matrix sorting
5. Compute a determinant

Let's go back the Personnel table and declare it using this approach:
CREATE TABLE Personnel
(emp_id INTEGER NOT NULL PRIMARY KEY,
emp_name VARCHAR(25) NOT NULL,
kid_name VARCHAR(25) NOT NULL,
birth_seq INTEGER NOT NULL, –-subscript with a good name
..);

Actually, you should normalize this table further by splitting it into Personnel and Dependents tables. The Dependents table needs its own constraints and references back to the Personnel table. But the way to think about it is that you are doing explicitly what has been done implicitly.

CREATE TABLE Personnel
(emp_id INTEGER NOT NULL PRIMARY KEY,
emp_name VARCHAR(25) NOT NULL,
..);

CREATE TABLE Dependents
(emp_id INTEGER NOT NULL
REFERENCES Personnel(emp_id)
ON DELETE CASCADE
ON UPDATE CASCADE,
birth_seq INTEGER NOT NULL
CHECK (birth_seq >l 0),
PRIMARY KEY (emp_id, birth_seq),
kid_name VARCHAR(25) NOT NULL,);
The four proposed queries are now simple and will work for families of any size.
1. All kids named "George"
SELECT kid_name
FROM Dependents
WHERE kid_name = 'George';

2. All Personnel with three or more offspring
SELECT P.emp_name, COUNT(*) AS kid_cnt
FROM Dependents AS D, Personnel AS P
WHERE D.emp_id = P.emp_id
GROUP BY P.emp_name
HAVING COUNT(*) >= 3;

3. All Personnel with a kid who has the same name as they do ( find the "juniors"). George Foreman will get several lines in the output…

SELECT P.emp_name AS junio, D.birth_seq
FROM Dependents AS D, Personnel AS P
WHERE D.emp_id = P.emp_id
AND D.kid_name = P.emp_name;

4. The query "find pairs of employees whose children all have the same names" is very restrictive. Both Mr. X and Mr. Y must have exactly the same number of dependents, and both sets of names must match and be in the same birth_seq. (We can assume that nobody has two children with the same name -- except George Foreman.) Begin by constructing a table of sample data:

INSERT INTO Dependents (emp_id, kid_name, birth_seq)
VALUES (1, 'Dick', 2), (1, 'Harry', 3), (1, 'Tom', 1),
(2, 'Dick', 3), (2, 'Harry', 1), (2, 'Tom', 2),
(3, 'Dick', 2), (3, 'Harry', 3), (3, 'Tom', 1),
(4, 'Harry', 1), (4, 'Tom', 2), (5, 'Curly', 2),
(5, 'Harry', 3), (5, 'Moe', 1):

In this test data, employees 1, 2, and 3 all have kids named Tom, Dick, and Harry. The birth order is the same for the children of employee 1 and 3. For testing purposes, you might consider adding an extra child to the family of employee 3, and so on.

While there are many ways to do this query, this approach gives you some flexibility that others do not. Construct a VIEW that gives you the number of each employee's dependents:

Full article...