Exotic Operators: List Aggregate
Exotic Operators: List Aggregate
Exotic Operators
The core of SQL language is fairly compact. Select-project-join, set
operators, nested subqueries, inner views, aggregation make up a
very short but expressive list of operators. This is all that most users
ever need for everyday usage. Once in a while, however, there comes
a problem that cant be solved by these means. Often, such a
problem is evidence of a missing language feature.
List Aggregate
List aggregate is not exactly a missing feature. It is implemented as a
built-in operator in Sybase SQL Anywhere and MySQL. Given the
original Emp relation
1
select deptno, ename from emp;
DEPTNO ENAME
10 CLARK
10 KING
10 MILLER
20 ADAMS
20 FORD
20 JONES
20 SCOTT
20 SMITH
30 ALLEN
30 BLAKE
30 JAMES
30 MARTIN
30 TURNER
30 WARD
the query
select deptno, list(ename||, )
from emp
group by deptno
is expected to return
DEPTNO LIST(ename||, )
10 CLARK, KING, MILLER
20 ADAMS, FORD, JONES, SCOTT, SMITH
30 ALLEN, BLAKE, JAMES, MARTIN, TURNER, WARD
The other vendors dont have built-in list aggregate, but offer
overflowing functionality that allows implementing it easily. If your
platform allows programming user-defined aggregate functions, you
just have to search the code on the net, as it is most likely that
somebody already written the required code. For Oracle you may
easily find string aggregate function implementation named stragg
on the Ask Tom forum.
2
User Defined Functions
Originally, SQL intended to be a pure declarative
language. It had some built-in functions, but soon it
was discovered that introducing User Defined
Functions (UDF) makes SQL engine extensible.
Today, UDF is arguably one of the most abused
features. In the industry, I have seen UDF with 200+
parameters wrapping a trivial insert statement, UDF
used for query purposes, etc. Compare it to the
integer generator UDF from chapter 2, which was
written once only, and which is intended to be used
in numerous applications.
3
The idea behind the recursive SQL solution carries over to the
connect by solution
with concat_enames as (
select deptno, sys_connect_by_path(ename,',') aggr, level depth
from emp e
start with ename=(select min(ename) from emp ee
where e.deptno=ee.deptno)
connect by ename > prior ename and deptno = prior deptno
) select deptno, aggr from concat_enames e
where depth=(select max(depth) from concat_enames ee
where ee.deptno = e.deptno);
SELECT deptno,
CONCAT_LIST(
CAST(MULTISET(
SELECT ename||',' FROM EMP ee WHERE e.deptno=ee.deptno )
AS strings)) empls
FROM emp e
group by deptno;
including the one with a little bit cleaner syntax
SELECT deptno,
CONCAT_LIST(CAST( COLLECT(ename) AS strings )) empls
FROM emp
group by deptno;
4
select distinct
deptno,
CONCAT_LIST(CURSOR(
select ename ||',' from emp ee where e.deptno = ee.deptno
) employees
from emp e;
Product
Product is another aggregate function, which is not on the list of
built-in SQL functions. Randomly browsing any mathematical book,
however, reveals that the product symbol occurs much less
frequently than summation. Therefore, it is unlikely to expect the
product to grow outside a very narrow niche. Yet there are at least
two meaningful applications.
Factorial
If the product aggregate function were available, it would make
factorial calculation effortless
select prod(num) from integers where num <=10 -- 10!
5
logarithms, and then exponentiate the result. The rewritten factorial
query
select exp(sum(ln(num))) from integers where num <=10
is still very succinct.
At this point most people realize that product via logarithms doesnt
work in all the cases. The ln SQL function throws an exception for
negative numbers and zero. After recognizing this fact, however, we
could fix the problem immediately with a case expression.
Numbers in SQL
If you feel uneasy about fixing an elegant
exp(sum(ln())) expression with case analysis, its
an indicator that you as a programmer reached a
certain maturity level. Normally, there are no
problems when generalizing clean and simple
concepts in math. What exactly is broken?
The problem is that the logarithm is a
multivalued function defined on complex
numbers. For example, ln(-1) = i (taken on
principal branch). Then, e i = -1 as expected!
6
Joe Celko suggested one more method - calculating factorial via
Gamma function. A polynomial approximation with an error of less
than 3*10-7 for 0 x 1 is:
Interpolation
Interpolation is more pragmatic justification for the product
aggregate. Consider the following data
X Y
1 2
2 6
3
4 8
5
6
7 6
Can you guess the missing values?
When this SQL puzzle was posted on the Oracle OTN forum,
somebody immediately reacted suggesting the lag and leading
analytic functions as a basis for calculating intermediate values by
averaging them. The problem is that the number of intermediate
7
values is not known in advance, so that my first impression was that
the solution should require leveraging recursion or hierarchical SQL
extension, at least. A breakthrough came from Gabe Romanescu,
who posted the following analytic SQL solution:
select X,Y
,case when Y is null
then yLeft+(rn-1)*(yRight-yLeft)/cnt
else Y end /*as*/ interp_Y
from (
select X, Y
,count(*) over (partition by grpa) cnt
,row_number() over (partition by grpa order by X) rn
,avg(Y) over (partition by grpa) yLeft
,avg(Y) over (partition by grpd) yRight
from (
select X, Y
,sum(case when Y is null then 0 else 1 end)
over (order by X) grpa
,sum(case when Y is null then 0 else 1 end)
over (order by X desc) grpd
from data
)
);
As usual in case of complex queries with multiple levels of inner
views, the best way to understand it is executing the query in small
increments. The inner-most query
select X, Y
,sum(case when Y is null then 0 else 1 end)
over (order by X) grpa,
,sum(case when Y is null then 0 else 1 end)
over (order by X desc) grpd
from data;
X Y GRPA GRPD
1 2 1 4
2 6 2 3
3 2 2
4 8 3 2
5 3 1
6 3 1
7 6 4 1
introduces two new columns, with the sole purpose to group spans
with missing values together.
8
select X, Y
,count(*) over (partition by grpa) cnt
,row_number() over (partition by grpa order by X) rn
,avg(Y) over (partition by grpa) yLeft
,avg(Y) over (partition by grpd) yRight
from (
select X, Y
,sum(case when Y is null then 0 else 1 end)
over (order by X) grpa
,sum(case when Y is null then 0 else 1 end)
over (order by X desc) grpd
from data
)
X Y CNT RN YLEFT YRIGHT
1 2 1 1 2 2
2 6 2 1 6 6
3 2 2 6 8
4 8 3 1 8 8
5 3 2 8 6
6 3 3 8 6
7 6 1 1 6 6
calculates four more columns:
cnt length of each span
Now, everything is ready for final step where we apply the linear
interpolation formula
9
yRight
Y
yLeft
xLeft xRight
X
( yRight yLeft ) ( X xLeft )
Y = yLeft +
xRight xLeft
Figure 3.1: Linear interpolation formula calculates function value Y
at the intermediate coordinate X.
Note that instead of the xLeft and xRight variables we have cnt =
xRight xLeft and rn = X - xLeft.
What if the missing values arent bounded with a known value at the
end of the range? We certainly can address this as a special case, at
the cost of making our query more complicated. Alternatively, this
10
snag is nicely solved with non-linear interpolation, in general, and
Lagrange Interpolating Polynomial, in particular
n
j 1 x xk n x xk
yj
xj xk
x j xk
j= 1 k= 1 k= j+ 1
Here is the product symbol, at last!
y3
y2
y4
y1
x1 x2 x3 x4
11
Pivot
Pivot and Unpivot are two fundamental operators that exchange
rows and columns. Pivot aggregates a set of rows into a single row
with additional columns. Informally, given the Sales relation
Product Month Amount
Shorts Jan 20
Shorts Feb 30
Shorts Mar 50
Jeans Jan 25
Jeans Feb 32
Jeans Mar 37
T-shirt Jan 10
T-shirt Feb 15
the pivot operator transforms it into a relation with fewer rows, and
some column headings changed
Product Jan Feb Mar
Shorts 20 30 50
Jeans 25 32 37
T-shirt 10 15
Unpivot is informally defined as a reverse operator, which alters this
relation back into the original one.
12
Unfortunately, the approach with the straightforward query above
quickly shows its limitations. First, each column has a repetitive
syntax which is impossible to factor in. More important, however, is
our inability to accommodate a dynamic list of values. In this
example, the (full) list of months is static, but change months to
years, and we have a problem.
13
done easily with string concatenation. Therefore, the answer to the
problem is
select scount.*, ssum.* from (
select * from (
(select product, month || Cnt, amount from Sales)
pivot (count(*) for Month in (JanCnt, FebCnt, MarCnt)
) scount, (
select * from (
(select product, month || Sum, amount from Sales)
pivot (sum(Amount) for Month in (JanSum, FebSum, MarSum)
) ssum
where scount.product = ssum.product
2. the pivot clause were allowed in a select list, rather than being a
part of a table reference in the from clause.
14
select * from (
(select product, month || _ || day as Month_Day, amount
from Sales)
pivot (count(*) for Month_Day in (Jan_1, Jan_2, Jan_3, )
)
As this example demonstrates, the number of pivoted values easily
becomes unwieldy, which warrants more syntactic enhancements.
Symmetric Difference
Suppose there are two tables A and B with the same columns, and we
would like to know if there is any difference in their contents.
15
Relation Equality
A reader with an Object Oriented programming
background might wonder why the Relational
model in general, and SQL in particular, doesnt
have an equality operator. You'd think the first
operator you'd implement for a data type,
especially a fundamental data type, would be
equality! Simple: this operator is not relationally
closed, as the result is Boolean value and not a
relation.
B\A
AB
A\B
16
to be scanned twice. Then, four sort operators are applied in order
to exclude duplicates. Next, the two set differences are computed,
and, finally, the two results are combined together with the union
operator.
17
Duality between Set and Join Operators
For two tables A and B with the same columns, set
intersection
select * from A
intersect
select * from B
18
select * from (
select id, name,
sum(case when src=1 then 1 else 0 end) cnt1,
sum(case when src=2 then 1 else 0 end) cnt2
from (
select id, name, 1 src from A
union all
select id, name, 2 src from B
)
group by id, name
)
where cnt1 <> cnt2
This appeared to be a rather elegant solution3 where each table has
to be scanned once only, until we discover that it has about the same
performance as the canonic symmetric difference query.
When comparing data in two tables there are actually two questions
that one might want to ask:
1. Is there any difference? The expected answer is Boolean.
2. What are the rows that one table contains, and the other doesn't.
Question #1 can be answered faster than #2 with a hash value
based technique.
3
suggested by Marco Stefanetti in an exchange at the Ask Tom forum
4
with a dedicated | separator, in order to guarantee uniqueness
19
select 1 * sum( ora_hash(id, POWER(2,16)-1) )
+ 2 * sum( ora_hash(name, POWER(2,16)-1) )
from A
Row hash values are added together with ordinary sum aggregate,
but we could have written a modulo 216-1 user-defined aggregate
hash_sum in the spirit of the CRC (Cyclic Redundancy Check)
technique.
Histograms
The concept of histograms originated in statistics. Admittedly,
statistics never enjoyed the reputation of being the most exciting
math subject. I never was able to overcome a (presumably unfair)
impression that it is just a collection of ad-hoc methods. Yet a
typical database table has huge volume of data, and histograms
provide an opportunity to present it in compact human digestible
report.
20
Value
Index#
Equal-Width Histogram
Equal-width histogram partitions the graph horizontally, as shown
on fig. 3.5.
21
Value
Index#
Figure 3.5: Equal-width histogram partitioned the set {0,1,2,5} of
values into sets {1,2}, {3}, and {5}.
Evidently, there is something wrong here, either with the term equal-
width histogram, or my unorthodox interpretation. First, the
partitioning is vertical; second, why do buckets have to be of
uniform size? Well, if we swap the coordinates in an attempt to
make partitioning conform to its name, then it wouldnt be a
function any more. Also, partitioning doesnt have to be uniform: we
well study histograms with logarithmic partitioning later.
22
Value
Index#
Figure 3.6: Frequency histogram is equal-width histogram with the
finest possible partitioning of the value set.
Now that we have two kinds of objects -- the (indexed) values and
buckets -- we associate them in a query. We could ask either
what bucket a particular value falls in, or
23
Index# Value Bucket
0 0 0
1 0 0
2 0 0
3 1 0
4 2 1
5 5 2
As suggested in fig 3.4, the number in the Bucket column is
determined by the Value.
24
A simplified version of the previous query in verbose form
Assign a bucket number to each record in the IndexedValues table. The record with
value 0 is placed into 0th bucket, value 1 goes into bucket number 1, etc. For each bucket
count the number of values that fall into this bucket.
defines a frequency histogram. It is formally expressed as celebrated
group by query
select value bucket, count(*)
from IndexedValues
group by value
Equal-Height Histogram
Equal-height histogram partitions the graph vertically, as shown on
fig. 3.7.
Value
Index#
Figure 3.7: Equal-height histogram partitioned the set {0,1,2,3,4,5} of
indexes into sets {0,1}, {2,3}, and {4,5}.
As a reader might have guessed already, the development in the rest
of the section would mimic equal-width case with the value and
index# roles reversed. There is some asymmetry, though. Unlike the
equal-width case where partitioning values with the finest granularity
led to the introduction of the frequency histogram, there is nothing
interesting about partitioning indexes to the extreme. Each bucket
corresponds one-to-one to the index, so that we dont need the
concept of a bucket anymore.
25
Lets proceed straight to queries. They require little thought other
than formal substitution of value by index#.
Assign a bucket number to each record in the IndexedValues table. All the records with
indexes in the range from 0 to 1 are placed into 0th bucket, indexes in the range 2 to 3 go into
bucket number 1, etc.
select index#, value, floor(index#/2) bucket
from IndexedValues
There is one subtlety for aggregate query. The record count within
each bucket is not interesting anymore, as in our example we know
its trivially 2. We might ask other aggregate functions, though.
Assign a bucket number to each record in the IndexedValues table. All the records with
indexes in the range from 0 to 1 are placed into 0th bucket, indexes in the range from 2 to 3
go into bucket number 1, etc. For each bucket find the maximum and minimum value.
select floor(index#/2) bucket, min(value), max(value)
from IndexedValues
group by floor(index#/2)
Logarithmic Buckets
Birds-eye view onto big tables is essential for data analysis, with
aggregation and grouping being the tools of the trade. Standard,
equality-based grouping, however, fails to accommodate continuous
distributions of data. Consider a TaxReturns table, for example.
Would simple grouping by GrossIncome achieve anything? Well
undoubtedly spot many individuals with identical incomes, but the
big picture would escape us because of the sheer report size.
Normally, we would like to know the distribution of incomes by
ranges of values. By choosing the ranges carefully we might hope to
produce a much more compact report.
26
In a capitalist system income is not bounded. There are numerous
individuals whose income goes far beyond the ranges where the
most people exist. As of 2005 there were 222 billionaires in the US.
They all dont fit in the same tiny 10K bucket of incomes, of course.
Most likely each would be positioned in its unique range, so that we
have to have 222 10K ranges in order to accommodate only them!
27
select power(10,floor(log(10,GrossIncome))), count(1)
from TaxReturns
group by floor(log(10,GrossIncome));
Skyline Query
Suppose you are shopping for a new car, and are specifically looking
for a big car with decent gas mileage. Unfortunately, we are trying to
satisfy the two conflicting goals. If we are querying the Cars relation
in the database, then we certainly can ignore all models that are
worse than others by both criteria. The remaining set of cars is called
the Skyline.
Figure 3.1 shows the Skyline of all cars in a sample set of 3 models.
The vertical axis is seating capacity while the horizontal axis is the
gas mileage. The vehicles are positioned in such a way that their
locations match their respective profiles. The Hummer is taller than
the Ferrari F1 racing car, which reflects the difference in their
seating accommodations: 4 vs. 1. The Ferrari protrudes to the right
of the Hummer, because it has superior gas mileage.
28
Seats
4
2
1
Mileage
5 10 20
Figure 3.8: Skyline of Car Profiles. Ferrari F1 loses to Roadster by
both gas mileage and seating capacity criteria.
More formally the Skyline is defined as those points which are not
dominated by any other point. A point dominates the other point if
it is as good or better in all the dimensions. In our example Roadster
with mileage=20 and seating=2 dominates Ferrari F1 with mileage=10
and seating=1. This condition can certainly be expressed in SQL. In
our example, the Skyline query is
select * from Cars c
where not exists (
select * from Cars cc
where cc.seats >= c.seats and cc.mileage > c.mileage
or cc.seats > c.seats and cc.mileage >= c.mileage
);
29
leverage an index to match it. Bitmapped indexes, which excel with
Boolean expressions similar to what we have, demand a constant on
one side of the inequality predicate.
Relational Division
Relational Division is truly exotic operator. Search any popular
database forum for problems involving Relational Division. There
certainly would be a few, but chances are that they might be not
recognized as such. For example, chapter 1 problem 6 is almost
literal reproduction of the message posted at the newsgroup
microsoft.public.sqlserver.programming. If nothing else, this section
would help you to be able to quickly identify a problem as Relational
Division. After all, using the proper names is very important for
effective communication.
5
We consider a two dimensional case only. An interested reader is referred to more elaborate
methods in the paper by Stephan Brzsnyi, Donald Kossmann, Konrad Stocker: The Skyline
Operator. http://www.dbis.ethz.ch/research/publications/38.pdf
30
and a set of JobRequirements
Language
SQL
Java
then, the Cartesian Product JobApplicants JobRequirements is
Name Language
Steve SQL
Pete Java
Kate SQL
Steve Java
Pete SQL
Kate Java
Inversely, given the JobApplicants JobRequirements relation (called
the dividend), we could divide it by JobRequirements (the divisor) and
obtain JobApplicants (the quotient).
The analogy between relational algebra and arithmetic goes one step
further. Relational division is similar to integer division. If an integer
dividend x is not multiple of integer divisor y, the quotient is defined
as the maximal number q that satisfies inequality
xqy
or the equality
x=qy+r
where r is called the remainder. In Relational Algebra, given a
dividend X and divisor Y, the quotient Q is defined as a maximum
relation that satisfies inequality
XQY
or the equality
X=QYR
In our example, lets reduce the JobApplicants JobRequirements
relation to
Name Language
Steve SQL
Pete Java
Kate SQL
Kate Java
which we call appropriately as ApplicantSkills. Then,
31
ApplicantSkills = QualifiedApplicants JobRequirements
UnqualifiedSkills
Cartesian Product
Name Language
Steve SQL
Pete Java
Kate SQL
Steve Java
Pete SQL
Kate Java
Subtract ApplicantSkills, then project the result to get the Names of all
the applicants who are not qualified. Finally, find all the applicants
who are qualified as a complement to the set of all applicants.
Formally
QualifiedApplicants = Name(ApplicantSkills)-
- Name( Name(ApplicantSkills)JobRequirements ApplicantSkills )
Translating it into SQL is quite straightforward, although the
resulting query gets quite unwieldy
select distinct Name from ApplicantSkills
minus
select Name from (
select Name, Language from (
select Name from ApplicantSkills
6
This is the very first time I used the projection symbol. The Relational Algebra projection
operator Name(ApplicantSkills) is the succinct equivalent to SQL query select
distinct Name from ApplicantSkills
32
), (
select Language from JobRequirements
)
minus
select Name, Language from ApplicantSkills
)
The minus operator is essentially anti-join, so it is not surprising that
we can express it in another form that is popular in the literature
select distinct Name from ApplicantSkills i
where not exists (
select * from JobRequirements ii
where not exists (
select * from ApplicantSkills iii
where iii.Language = ii.Language
and iii.Name = i.Name
)
)
Admittedly, this query is difficult to understand because of the
double-negative construction not exists clause inside another not
exists. The reason for the double negation is SQLs inability to
express universal quantification in the relational division query
Name the applicants such that for all job requirements there exists a corresponding entry in
the applicant skills
Mathematical logic is notorious for formal transformations of one
logical expression into another. In our case we rewrite
x(y f(x,y) )
into
x(y f(x,y) )
without too much thinking. The Relational division query becomes
Name the applicants such that there is no job requirement such that there doesnt exists a
corresponding entry in the applicant skills
which is a sloppy wording for the SQL query that we were analyzing.
33
AB
is equivalent to
A\B=
Applied to our case it allows us to transform our rough first attempt
to a legitimate SQL query
select distinct Name from ApplicantSkills i
where not exists (
select Language from ApplicantSkills
minus
select Language from JobRequirements ii
where ii.Name = i.Name
)
34
Admittedly, this query doesnt look like relational division at all. As
we have already discussed, the relational division operator has two
inputs: the dividend relation and the divisor relation. All we have so
far is something that might measure up to the divisor role. Lets
suppress the lingering doubt, and promote the relation formally to
the divisor. In fact, why dont we even keep the JobRequirements
relation name from the previous development?
Name Language
2 2
2 4
2 6
3 3
3 6
3 9
35
There is one little optimization that can simplify the answer. The
ApplicantSkills.Language column can be eliminated. Projection of the
ApplicantSkills relation to the Name column is the familiar Integers
relation. Now that there is no longer any column for the equijoin
predicate s.Language = r.Language, the join condition between the
Integers and JobRequirements demands the remainder of integer
division to be 0. With the names that we mindlessly borrowed from
the sample relational division problem, the revised query
select Name from (
select num# as Name from Integers
where num# <= (select min(language) from JobRequirements)
), JobRequirements
where mod(Language,Name)=0
group by Name
having count(*) = (select count(*) from JobRequirements)
may enter an obfuscated SQL coding contest. One more step
switching to more appropriate names, and we have the final GCD
query
select Divisor from (
select num# as Divisor from Integers
where num# <= (select min(Element) from NumberSet)
), NumberSet
where mod(Element, Divisor)=0
group by Divisor
having count(*) = (select count(*) from NumberSet)
The only remaining action is finding maximum in the resulting set.
Outer Union
The union operator definition in Relational Algebra has a problem.
It is can be applied only to compatible relations that have identical
attributes. The Outer Union operator, invented by E.F.Codd7, can
be applied to any pair of relations. Each relation is extended to the
schema that contains all the attributes from both relations. The
newly introduced columns are padded with NULLs. The resulting
relations have the same schema and their tuples can, therefore, be
combined together by (ordinary) union.
7
Codd E.F. Extending the relational database model to capture more meaning. ACM Transactions
on Database Systems 4, 4(Dec), 1979.
36
Dept Emp
10 Smith
20 Jones
and the Department
Dept Mgr
20 Blake
30 James
The outer union Personnel Department is
Dept Emp Mgr
10 Smith null
20 Jones null
20 null Blake
30 null James
37
5.
Dept Emp Mgr
10 Smith null
20 Jones Blake
30 null James
Summary
Extensible RDBMS supports user defined-aggregate functions.
They could be either programmed as user-defined functions, or
implemented by leveraging object-relational features.
38
Exercises
1. Design a method to influence the order of summands of the
LIST aggregate function.
2. Check the execution plan of the symmetric difference query
in your database environment. Does it use set or join
operations? If the former, persuade the optimizer to
transform set into join. Which join method do you see in the
plan? Can you influence the nested loops anti-join8? Is there
noticeable difference in performance between all the
alternative execution plans?
3. Binomial coefficient x choose y is defined by the formula
x = x!
y y! ( x y ) !
where n! is n-factorial. Implement it as SQL query.
4. Write down the following query
select distinct Name from ApplicantSkills
informally in English. Try to make it sound as close as
possible to the Relational Division. What is the difference?
5. A tuple t1 subsumes another tuple t2, if t2 has more null
values than t1, and they coincide in all non- null attributes.
The minimum union of two relations is defined as an outer
union with subsequent removal of all the tuples subsumed by
the others. Suppose we have two relations R1 and R2 with all
the tuples subsumed by the others removed as well. Prove
that the outer join between R1 and R2 is the minimum union
of R1, R2 and R1 R2.
6. Check if your RDBMS of choice supplies some ad-hoc
functions, which could be plugged into histogram queries.
Oracle users are referred to ntile and width_bucket.
7. Write SQL queries that produce natural outer union and
natural outer join.
8. Relational division is the prototypical example of a set join.
Set joins relate database elements on the basis of sets of
values, rather than single values as in a standard natural join.
Thus, the division R(x, y)/S(y) returns a set of single value
tuples
8
Nested Loops have acceptable performance with indexed access to the inner relation. The index
role is somewhat similar to that of hash table in case of the Hash Join.
39
{ (x) | {y | R(x, y)} {y | S(y)} }
More generally, one has the set-containment join of R(x, y)
and S(y, z), which returns a set of pairs
{ (x, z) | {y | R(x, y)} {y | S(y, z)} }
In the job applicants example we can extend the
JobRequirements table with the column JobName. Then, listing all
the applicants together with jobs they are qualified for is a set
containment query. Write it in SQL.
40