Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Advanced SQL Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

NESUG 15

Hands-On Workshops

Advanced SQL Processing


Destiny Corporation, Wethersfield, Ct

OUTPUT

ABSTRACT This session will bring attendees through advanced uses of SQL, including HAVING, FULL JOINs and creation of Views, Indexes, and Data sets. The joys of re-merging and sub-queries will be introduced and you will gain an understanding of the relative merits of Proc SQL and SAS base. Finally, we will touch on some of the debugging tools available with PROC SQL. We assume that you will have at least one year of experience with SQL. SUMMARY FUNCTIONS A series of functions are provided to work down the columns. A complete list of these functions is given in Q2.5.
PROGRAM EDITOR *Q02E13 Analysis down a column for groups; s elec t mean(retail) as avpric e f rom s aved.c omputer;

RETAIL 750.00 800.00 950.00 950.00 1,150.00 1,150.00 ...etc 2,350.00 2,450.00 3,350.00 3,750.00

VAT 111.70 119.14 141.48 141.48 171.27 171.27 350.00 364.89 498.93 558.51

GROSS 861.70 919.14 1,091.48 1,091.48 1,321.27 1,321.27 2,700.00 2,814.89 3,848.93 4,308.51

With a single argument, but with other selected columns, the function gives a result for all the rows, then merges the summary back with each row:
PROGRAM EDITOR *Q02E15 Merges summary value onto each row of output; s elec t c pu, dis k, (retail -wholes al) as prof it label=Prof it, mean(retail-wholes ale) as avprof it label = Average Prof it, (retail-wholes al) - mean(retail -wholes al) as dif f label = Dif f erenc e f rom s aved.c omputer where s upplier c ontains FLO PPY;

OUTPUT AVPRICE 1929.167

This is the equivalent of:


PROGRAM EDITOR proc means data = saved.computer mean; var retail; run;

LOG 379 select cpu, 380 381 382 383 384 385 386 387 388 NOTE: disk, (retail -wholesale) as profit label=Profit, mean(retail-wholesal) as avprofit label = Average Profit, (retail-wholesal) - mean(retail -wholesal) as diff label = Difference from saved.computer where supplier contains FLOPPY; The query requires remerging summary statistics back with the original data.

With more than one argument, the function performs for each row:
PROGRAM EDITOR *Q02E14 More then one argument to analyze each row; select retail format= pound10.2, retail * 7/47 as VAT format = pound8.2, sum(retail,retail*7/47) as gross format =pound10.2 from saved.computer;

NESUG 15

Hands-On Workshops

OUTPUT

OUTPUT

CPU

DISK Profit Difference 286 20 200 286 40 200 286 100 200 386SX 40 200 etc... 386DX 386DX 286 100 200 60 300 500 200

Average Profit 231.25 231.25 231.25 231.25 -31.25 -31.25 -31.25 -31.25

NO ROWS 36 AVERAGE 1929.167

231.25 231.25 231.25

68.75 268.75 -31.25

Analyzing groups of data is performed using the GROUP BY clause on the SELECT statement. The HAVING clause also affects the result. This option results in 5 styles of query:
SAS PROGRAMMING
STYLE 1 SELECT STATEMENT 2 items: GROUP BY variable and summary function on second variable. RESULT Equivalent of BY statement. Has one row each value of GROUP by variable. Data calculated for each GROUP BY value. Ordered by Group BY. Has one row in original file, subject to WHERE or HAVING clauses. Data calculated for each GROUP BY value. Ordered by GROUP BY. Has one row for each value of GROUP BY variable. Data calculated for each GROUP BY value. Ordered by GROUP BY Has one row for each row in the original file, subject to the HAVING clause. Data is calculated for each GROUP BY value. Data ordered by GROUP BY variable. GROUP BY translated into an ORDER BY option. Has one row for each value in the original table, subject to WHERE and HAVING clauses. Data ordered by GROUP BY variable.

To accomplish the same thing in Data/Proc step either requires use of Proc Means/Summary to create a one-observation, one-variable data set which is then read into the data step alongside saved.computer or two passes of the data in the same data step:
PROGRAM EDITOR data new; retain avprofit; if _n_ = 1 then do; do until(finish); set saved.computer end = finish nobs = numobs; profit=retail-wholesal; totprof+profit; end; avprofit = totprof / numobs; end; set saved.computer; profit = retail - wholesal; diff = (retail - wholesal) - avprofit; run; proc print data=new; var cpu disk profit avprofit diff; label profit=Profit avprofit=Average Profit diff = Difference; run;

Any number of items: GROUP BY variable and several variables, at least one with summary function.

Any number of items: GROUP BY variable and several variables all have summary function

Any number of items: GROUP BY variable and several variables, summary function on HAVING not SELECT.

An important function is COUNT (*) which gives the number of rows:


PROGRAM EDITOR *Q02E16 The count function supplies the number of rows; select count(*) as no_rows from saved.computer; select sum(retail)/count(*) as average rom saved.computer;

Any number of items: GROUP BY variable and several variables, no summary function on SELECT or HAVING

Style 1
PROGRAM EDITOR proc means data=saved.computer mean; by disk; var retail; run;

NESUG 15

Hands-On Workshops

PROGRAM EDITOR *Q02E17 Group By will group by statistic on select statement; proc sql; select disk, mean(retail) as avgret from saved.computer group by disk;

DISK 20 40 60 100 120

TOTRET 8150 7800 5600 7300 7450

Style 2
OUTPUT

Comparing the averages


DISK AVGRET 20 1483.333 40 1485 60 2300 100 1835 120 2483.333 200 3250

Quite often, we need to compare individual values with the average value for the group, instead of the whole file. Traditional SAS programming would comprise:
PROGRAM EDITOR proc sort data=saved.demograf out=demograf; by gender; run; proc means data=demograf mean noprint; var salary; by gender; output out=stats mean=avgsal; run; data lowsal highsal; set demograf; by gender; if first.gender then set stats; if salary < avgsal then output lowsal; else output highsal; run; proc print data=lowsal; title Employees with lower than average salaries; run; proc print data=highsal; title Employees with higher than average salaries; run;

Accumulating values for a column With traditional SAS programming, the data step can be programmed to count the number in a group, or to sum a variable for the unique values of another, as well as any other statistical measure:
PROGRAM EDITOR title What is the total retail for each disk sold?; proc sort data=saved.computer out=sorted; by disk; run; data unique(keep=disk totret); set sorted(keep=disk supplier retail); by disk; if first.disk then totret=0; totret+retail; if last.disk; where supplier=KETCHUP COMPUTERS; run; proc print data=unique; run;

With SQL, we use the summary function - SUM(), and GROUP BY: Program schematic
PROGRAM EDITOR *Q02E18 Summary for each unique value in a column after subsetting;
qdata.demograf

Sort work.demograf

Proc Means

title Total retail for each disk type sold; select disk, sum(retail) as totret from saved.computer where supplier=KETCHUP COMPUTERS group by disk ;

Proc Print Report one Proc Print Report 2

Data Step stats file

OUTPUT

Total retail for each disk type sold

NESUG 15

Hands-On Workshops

USING SQL

Having Use HAVING when you want to perform a WHERE for groups in the data:

We wish to calculate the average salary and average number of cars owned for each value of the gender column; moreover, we are only interested in those who earn more than 10,000 per year. How do we alter the SQL so that only 2 rows result? To do this, we need to apply summary functions to all items on the SELECT list: The MAX and MIN statistics can be applied to character variables.
PROGRAM EDITOR *Q02E20 One row reporting on statistics for each group; select avg(salary) label=Average Salary format=8.2, avg(cars) label=Average Number of Cars format = 3.1, gender from saved.demograf where salary > 10000 group by gender;

PROGRAM EDITOR *Q02E19 Having allows us to compare against group average; select gender, status, salary, avg(salary) as avgsal from saved.demograf group by gender having salary>avg(salary) ;

LOG The query requires remerging summary statistics back with the original data

OUTPUT

GENDER STATUS SALARY F SEP 18000 F S 30000 F M 15000 F M 13000 F M 15000 F W 30000 F M 18000 M M 23000 M M 23000 M M 12300 M M 40000

AVGSAL 10980.95 10980.95 10980.95 10980.95 10980.95 10980.95 10980.95 12007.14 12007.14 12007.14 12007.14

The HAVING option needs to be replaced by a WHERE clause, so that the SELECT acts on rows, not groups. Otherwise all groups would be held, resulting in all rows.
OUTPUT

Average Average Number Salary of Cars GENDER 19700.00 1.4 F 20383.33 1.3 M

Salary and Avgsal columns have been shown to illustrate the different averages for the 2 groups. Style 3 Lets consider the last output:
OUTPUT

Style 4 Although there is no summary function on the SELECT list, the HAVING clause does have a summary function, and the GROUP BY can group the data for calculation:
PROGRAM EDITOR *Q02E21 Group on the sum function values with having statement; select gender, status, salary from saved.demograf group by gender having salary>avg(salary) ;

GENDER STATUS SALARY F SEP 18000 F S 30000 F M 15000 F M 13000

AVGSAL 10980.95 10980.95 10980.95 10980.95

What if we were only concerned with the last Rows of gender?

Each salary is compared to each genders average salary.

NESUG 15

Hands-On Workshops

OUTPUT

The inner query is evaluated before the outer query.

GENDER STATUS SALARY F M 30000 F M 15000 F SEP 18000 F S 30000 F S 13000 F M 15000 F M 13000 F M 15000

Example Lets examine the average profit for each CPU type:
PROGRAM EDITOR *Q02E23 Nest query within query, the inner evaluated first;

Style 5 Because there is no summary function on either the SELECT list or the HAVING clause, no grouping can occur, and the GROUP BY is translated into an ORDER by clause:
PROGRAM EDITOR *Q02E22 Group By translated as Order By since no sum function exists; select gender, status, salary from saved.demograf group by status having salary>10000;

select cpu, avg(retail-wholesal) as profit from saved.computer group by cpu;

OUTPUT CPU 286 386DX 386SX 486DX 486SX PROFIT 194.4444 285 200 250 266.6667

Clearly some CPU types are more profitable, some less so.

LOG WARNING: A GROUP BY clause has been transformed into a ORDER BY clause because neither the SELECT clause nor the optional HAVING clause of the associated table-expression referenced a summary function.

Comparison to overall figures How do we compare the average profit for each CPU type to the overall average profit for all CPU types? We need to use the HAVING option and a subquery:

OUTPUT

GENDER STATUS SALARY M M 12300 M M 23000 F M 13000 M M 40000 F M 30000 M M 23000 F S 13000 F S 30000 M SEP 12000

PROGRAM EDITOR *Q02E24 Compare each type to overall stats regardless of type; select cpu, < OUTER QUERY avg(retail-wholesal) as profit from saved.computer group by cpu having profit > (select avg(retail-wholesal) < INNER QUERY from saved.computer); < or SUBQUERY

NESTED SUBQUERIES

The result of a query may be embedded inside further queries; embedded queries are termed SUBQUERIES or INNER queries. They produce either single results or a set of values which are then part of the main query. The results of a subquery are typically used as part of a WHERE or HAVING clause.

OUTPUT

CPUs with Higher Profit CPU PROFIT 386DX 285 486DX 250 486SX 266.667

NESUG 15

Hands-On Workshops

OUTPUT

Result of INNER QUERY 233.3333

Full query result:


OUTPUT

CPU 286 286 286 386SX etc... 386DX 386DX 386DX

DISK 20 40 100 40

PROFIT 200 200 200 200

One way to accomplish this in traditional database processing steps would be as follows:
PROGRAM EDITOR data new; set saved.computer; profit = retail - wholesal; run; proc summary data=new; class cpu; var profit; output out=new2 mean=meanprof; run; data new3; retain totavprf; set new2; if _type_ = 0 then totavprf = meanprof; if meanprof > totavprf; run; proc print data=new3; run;

60 120 200

400 250 400

Correlated Subqueries This term means using a subquery that depends on values in the outer query. Schematic

OUTPUT OBS TOTAVPRF 1 2 3 233.333 233.333 233.333 CPU TYPE FREQ 386DX 486DX 486SX 1 1 1 10 4 3 MEANPROF 285.000 250.000 266.667

Select name and date from A, but only those whose height is above 1.3:
Table A code aa bb cc dd name jimmy jack sally suzie Table height 1.2000 1.3300 1.3550 1.2200 date 24dec92 12oct91 03aug65 14feb78 B weight 76 51 68 58

Here the power of the SQL query is seen, one query replacing 4 steps. This further example shows an inner query that results in a list of values which is further compared using an IN operator:
PROGRAM EDITOR *Q02E25 Mutually exclusive subset using in option; select cpu, disk, retail - wholesale as profit from saved.computer where cpu in (select distinct cpu from saved.computer where retail < 1200) ;

code aa ee bb ff

select name, date for each row of a: from A where 1.3 < (select height from B where A.code=B.code) Process
row 1: a.code=aa 1.3 a.code=bb 1.3 a.code=cc a.code=dd b.code=aa match < 1.2 ? no b.code=bb match < 1.35 ? yes b.code= no match b.code= no match

Inner query result:


OUTPUT CPU 286 386SX 386DX

row 2: row 3: row 4:

NESUG 15

Hands-On Workshops

The inner query cannot be evaluated directly, so the initial evaluation takes place for all rows in the table. Example Records are kept on the suppliers of computer equipment pertaining to delivery targets, product quality and technical support provided. Here is the data:
PROGRAM EDITOR select * from saved.compsupp;

OUTPUT

Supplier Deviation KETCHUP KETCHUP KETCHUP KETCHUP KETCHUP COMPUTERS COMPUTERS COMPUTERS COMPUTERS COMPUTERS 286 386DX 386SX 486DX 486SX

CPU

Profit

187.50 275.00 200.00 233.33 400.00

25.00 98.74 0.00 57.74

This concludes the session on Advanced SQL

OUTPUT

SUPPLIER

DELIVERY 8 7

PRODUCT 4 8

SUPPORT

SAS is a registered trademark or a trademark of the SAS Institute Inc. in the USA and other countries. indicates USA registration.
6

FLOPPY COMPUTERS KETCHUP COMPUTERS 7

Ratings are out of 10, 1=poor, 10=excellent. We would like to assess which CPU type is the most profitable, and take into account the current quality ratings of our suppliers. The outer query provides statistics on the most profitable CPUs, while the inner query chooses only those suppliers whose average rating over all quality measures is above 6.5.
PROGRAM EDITOR *Q02E26 Outer quality; query checks profit, inner check suppliers

Other brand and product names are registered trademarks or trademarks of their respective companies. Copyright 1999,2002 Destiny Corporation. rights reserved. All

********

select supplier label=Supplier, cpu, avg(retail-wholesal) label=Average Profit format=7.2, std(retail-wholesal) label=Standard Deviation format=5.2 from saved.computer where 6.5 < (select mean(delivery,product,support) from saved.compsupp where computer.supplier=compsupp.supplier) group by supplier,cpu;

Prepared by Destiny Corporation 100 Great Meadow Rd Suite 601 Wethersfield, CT 06109-2379 Phone: (860) 721-1684 1-800-7TRAINING Fax: (860) 721-9784 Web: www.destinycorp.com Email: info@destinycorp.com Copyright 2002

*********************************************************************************************

You might also like