Advanced SQL Processing
Advanced SQL Processing
Advanced SQL Processing
Hands-On Workshops
OUTPUT
ABSTRACT This session will bring attendees through advanced uses of SQL, including HAVING, FULL JOINs and creation of Views, Indexes, and Data sets. The joys of re-merging and sub-queries will be introduced and you will gain an understanding of the relative merits of Proc SQL and SAS base. Finally, we will touch on some of the debugging tools available with PROC SQL. We assume that you will have at least one year of experience with SQL. SUMMARY FUNCTIONS A series of functions are provided to work down the columns. A complete list of these functions is given in Q2.5.
PROGRAM EDITOR *Q02E13 Analysis down a column for groups; s elec t mean(retail) as avpric e f rom s aved.c omputer;
RETAIL 750.00 800.00 950.00 950.00 1,150.00 1,150.00 ...etc 2,350.00 2,450.00 3,350.00 3,750.00
VAT 111.70 119.14 141.48 141.48 171.27 171.27 350.00 364.89 498.93 558.51
GROSS 861.70 919.14 1,091.48 1,091.48 1,321.27 1,321.27 2,700.00 2,814.89 3,848.93 4,308.51
With a single argument, but with other selected columns, the function gives a result for all the rows, then merges the summary back with each row:
PROGRAM EDITOR *Q02E15 Merges summary value onto each row of output; s elec t c pu, dis k, (retail -wholes al) as prof it label=Prof it, mean(retail-wholes ale) as avprof it label = Average Prof it, (retail-wholes al) - mean(retail -wholes al) as dif f label = Dif f erenc e f rom s aved.c omputer where s upplier c ontains FLO PPY;
LOG 379 select cpu, 380 381 382 383 384 385 386 387 388 NOTE: disk, (retail -wholesale) as profit label=Profit, mean(retail-wholesal) as avprofit label = Average Profit, (retail-wholesal) - mean(retail -wholesal) as diff label = Difference from saved.computer where supplier contains FLOPPY; The query requires remerging summary statistics back with the original data.
With more than one argument, the function performs for each row:
PROGRAM EDITOR *Q02E14 More then one argument to analyze each row; select retail format= pound10.2, retail * 7/47 as VAT format = pound8.2, sum(retail,retail*7/47) as gross format =pound10.2 from saved.computer;
NESUG 15
Hands-On Workshops
OUTPUT
OUTPUT
CPU
DISK Profit Difference 286 20 200 286 40 200 286 100 200 386SX 40 200 etc... 386DX 386DX 286 100 200 60 300 500 200
Average Profit 231.25 231.25 231.25 231.25 -31.25 -31.25 -31.25 -31.25
Analyzing groups of data is performed using the GROUP BY clause on the SELECT statement. The HAVING clause also affects the result. This option results in 5 styles of query:
SAS PROGRAMMING
STYLE 1 SELECT STATEMENT 2 items: GROUP BY variable and summary function on second variable. RESULT Equivalent of BY statement. Has one row each value of GROUP by variable. Data calculated for each GROUP BY value. Ordered by Group BY. Has one row in original file, subject to WHERE or HAVING clauses. Data calculated for each GROUP BY value. Ordered by GROUP BY. Has one row for each value of GROUP BY variable. Data calculated for each GROUP BY value. Ordered by GROUP BY Has one row for each row in the original file, subject to the HAVING clause. Data is calculated for each GROUP BY value. Data ordered by GROUP BY variable. GROUP BY translated into an ORDER BY option. Has one row for each value in the original table, subject to WHERE and HAVING clauses. Data ordered by GROUP BY variable.
To accomplish the same thing in Data/Proc step either requires use of Proc Means/Summary to create a one-observation, one-variable data set which is then read into the data step alongside saved.computer or two passes of the data in the same data step:
PROGRAM EDITOR data new; retain avprofit; if _n_ = 1 then do; do until(finish); set saved.computer end = finish nobs = numobs; profit=retail-wholesal; totprof+profit; end; avprofit = totprof / numobs; end; set saved.computer; profit = retail - wholesal; diff = (retail - wholesal) - avprofit; run; proc print data=new; var cpu disk profit avprofit diff; label profit=Profit avprofit=Average Profit diff = Difference; run;
Any number of items: GROUP BY variable and several variables, at least one with summary function.
Any number of items: GROUP BY variable and several variables all have summary function
Any number of items: GROUP BY variable and several variables, summary function on HAVING not SELECT.
Any number of items: GROUP BY variable and several variables, no summary function on SELECT or HAVING
Style 1
PROGRAM EDITOR proc means data=saved.computer mean; by disk; var retail; run;
NESUG 15
Hands-On Workshops
PROGRAM EDITOR *Q02E17 Group By will group by statistic on select statement; proc sql; select disk, mean(retail) as avgret from saved.computer group by disk;
Style 2
OUTPUT
Quite often, we need to compare individual values with the average value for the group, instead of the whole file. Traditional SAS programming would comprise:
PROGRAM EDITOR proc sort data=saved.demograf out=demograf; by gender; run; proc means data=demograf mean noprint; var salary; by gender; output out=stats mean=avgsal; run; data lowsal highsal; set demograf; by gender; if first.gender then set stats; if salary < avgsal then output lowsal; else output highsal; run; proc print data=lowsal; title Employees with lower than average salaries; run; proc print data=highsal; title Employees with higher than average salaries; run;
Accumulating values for a column With traditional SAS programming, the data step can be programmed to count the number in a group, or to sum a variable for the unique values of another, as well as any other statistical measure:
PROGRAM EDITOR title What is the total retail for each disk sold?; proc sort data=saved.computer out=sorted; by disk; run; data unique(keep=disk totret); set sorted(keep=disk supplier retail); by disk; if first.disk then totret=0; totret+retail; if last.disk; where supplier=KETCHUP COMPUTERS; run; proc print data=unique; run;
With SQL, we use the summary function - SUM(), and GROUP BY: Program schematic
PROGRAM EDITOR *Q02E18 Summary for each unique value in a column after subsetting;
qdata.demograf
Sort work.demograf
Proc Means
title Total retail for each disk type sold; select disk, sum(retail) as totret from saved.computer where supplier=KETCHUP COMPUTERS group by disk ;
OUTPUT
NESUG 15
Hands-On Workshops
USING SQL
Having Use HAVING when you want to perform a WHERE for groups in the data:
We wish to calculate the average salary and average number of cars owned for each value of the gender column; moreover, we are only interested in those who earn more than 10,000 per year. How do we alter the SQL so that only 2 rows result? To do this, we need to apply summary functions to all items on the SELECT list: The MAX and MIN statistics can be applied to character variables.
PROGRAM EDITOR *Q02E20 One row reporting on statistics for each group; select avg(salary) label=Average Salary format=8.2, avg(cars) label=Average Number of Cars format = 3.1, gender from saved.demograf where salary > 10000 group by gender;
PROGRAM EDITOR *Q02E19 Having allows us to compare against group average; select gender, status, salary, avg(salary) as avgsal from saved.demograf group by gender having salary>avg(salary) ;
LOG The query requires remerging summary statistics back with the original data
OUTPUT
GENDER STATUS SALARY F SEP 18000 F S 30000 F M 15000 F M 13000 F M 15000 F W 30000 F M 18000 M M 23000 M M 23000 M M 12300 M M 40000
AVGSAL 10980.95 10980.95 10980.95 10980.95 10980.95 10980.95 10980.95 12007.14 12007.14 12007.14 12007.14
The HAVING option needs to be replaced by a WHERE clause, so that the SELECT acts on rows, not groups. Otherwise all groups would be held, resulting in all rows.
OUTPUT
Average Average Number Salary of Cars GENDER 19700.00 1.4 F 20383.33 1.3 M
Salary and Avgsal columns have been shown to illustrate the different averages for the 2 groups. Style 3 Lets consider the last output:
OUTPUT
Style 4 Although there is no summary function on the SELECT list, the HAVING clause does have a summary function, and the GROUP BY can group the data for calculation:
PROGRAM EDITOR *Q02E21 Group on the sum function values with having statement; select gender, status, salary from saved.demograf group by gender having salary>avg(salary) ;
NESUG 15
Hands-On Workshops
OUTPUT
GENDER STATUS SALARY F M 30000 F M 15000 F SEP 18000 F S 30000 F S 13000 F M 15000 F M 13000 F M 15000
Example Lets examine the average profit for each CPU type:
PROGRAM EDITOR *Q02E23 Nest query within query, the inner evaluated first;
Style 5 Because there is no summary function on either the SELECT list or the HAVING clause, no grouping can occur, and the GROUP BY is translated into an ORDER by clause:
PROGRAM EDITOR *Q02E22 Group By translated as Order By since no sum function exists; select gender, status, salary from saved.demograf group by status having salary>10000;
OUTPUT CPU 286 386DX 386SX 486DX 486SX PROFIT 194.4444 285 200 250 266.6667
Clearly some CPU types are more profitable, some less so.
LOG WARNING: A GROUP BY clause has been transformed into a ORDER BY clause because neither the SELECT clause nor the optional HAVING clause of the associated table-expression referenced a summary function.
Comparison to overall figures How do we compare the average profit for each CPU type to the overall average profit for all CPU types? We need to use the HAVING option and a subquery:
OUTPUT
GENDER STATUS SALARY M M 12300 M M 23000 F M 13000 M M 40000 F M 30000 M M 23000 F S 13000 F S 30000 M SEP 12000
PROGRAM EDITOR *Q02E24 Compare each type to overall stats regardless of type; select cpu, < OUTER QUERY avg(retail-wholesal) as profit from saved.computer group by cpu having profit > (select avg(retail-wholesal) < INNER QUERY from saved.computer); < or SUBQUERY
NESTED SUBQUERIES
The result of a query may be embedded inside further queries; embedded queries are termed SUBQUERIES or INNER queries. They produce either single results or a set of values which are then part of the main query. The results of a subquery are typically used as part of a WHERE or HAVING clause.
OUTPUT
CPUs with Higher Profit CPU PROFIT 386DX 285 486DX 250 486SX 266.667
NESUG 15
Hands-On Workshops
OUTPUT
DISK 20 40 100 40
One way to accomplish this in traditional database processing steps would be as follows:
PROGRAM EDITOR data new; set saved.computer; profit = retail - wholesal; run; proc summary data=new; class cpu; var profit; output out=new2 mean=meanprof; run; data new3; retain totavprf; set new2; if _type_ = 0 then totavprf = meanprof; if meanprof > totavprf; run; proc print data=new3; run;
60 120 200
Correlated Subqueries This term means using a subquery that depends on values in the outer query. Schematic
OUTPUT OBS TOTAVPRF 1 2 3 233.333 233.333 233.333 CPU TYPE FREQ 386DX 486DX 486SX 1 1 1 10 4 3 MEANPROF 285.000 250.000 266.667
Select name and date from A, but only those whose height is above 1.3:
Table A code aa bb cc dd name jimmy jack sally suzie Table height 1.2000 1.3300 1.3550 1.2200 date 24dec92 12oct91 03aug65 14feb78 B weight 76 51 68 58
Here the power of the SQL query is seen, one query replacing 4 steps. This further example shows an inner query that results in a list of values which is further compared using an IN operator:
PROGRAM EDITOR *Q02E25 Mutually exclusive subset using in option; select cpu, disk, retail - wholesale as profit from saved.computer where cpu in (select distinct cpu from saved.computer where retail < 1200) ;
code aa ee bb ff
select name, date for each row of a: from A where 1.3 < (select height from B where A.code=B.code) Process
row 1: a.code=aa 1.3 a.code=bb 1.3 a.code=cc a.code=dd b.code=aa match < 1.2 ? no b.code=bb match < 1.35 ? yes b.code= no match b.code= no match
NESUG 15
Hands-On Workshops
The inner query cannot be evaluated directly, so the initial evaluation takes place for all rows in the table. Example Records are kept on the suppliers of computer equipment pertaining to delivery targets, product quality and technical support provided. Here is the data:
PROGRAM EDITOR select * from saved.compsupp;
OUTPUT
Supplier Deviation KETCHUP KETCHUP KETCHUP KETCHUP KETCHUP COMPUTERS COMPUTERS COMPUTERS COMPUTERS COMPUTERS 286 386DX 386SX 486DX 486SX
CPU
Profit
OUTPUT
SUPPLIER
DELIVERY 8 7
PRODUCT 4 8
SUPPORT
SAS is a registered trademark or a trademark of the SAS Institute Inc. in the USA and other countries. indicates USA registration.
6
Ratings are out of 10, 1=poor, 10=excellent. We would like to assess which CPU type is the most profitable, and take into account the current quality ratings of our suppliers. The outer query provides statistics on the most profitable CPUs, while the inner query chooses only those suppliers whose average rating over all quality measures is above 6.5.
PROGRAM EDITOR *Q02E26 Outer quality; query checks profit, inner check suppliers
Other brand and product names are registered trademarks or trademarks of their respective companies. Copyright 1999,2002 Destiny Corporation. rights reserved. All
********
select supplier label=Supplier, cpu, avg(retail-wholesal) label=Average Profit format=7.2, std(retail-wholesal) label=Standard Deviation format=5.2 from saved.computer where 6.5 < (select mean(delivery,product,support) from saved.compsupp where computer.supplier=compsupp.supplier) group by supplier,cpu;
Prepared by Destiny Corporation 100 Great Meadow Rd Suite 601 Wethersfield, CT 06109-2379 Phone: (860) 721-1684 1-800-7TRAINING Fax: (860) 721-9784 Web: www.destinycorp.com Email: info@destinycorp.com Copyright 2002
*********************************************************************************************