Analytics Functions Demo
Analytics Functions Demo
Analytic Functions, which have been available since Oracle 8.1.6, are designed to address such problems as:
"Calculate a running total",
"Find percentages within a group",
"Top-N queries",
"Compute a moving average"
With enough effort, all of these functions can be achieved with standard pl/sql.
However analytic functions are tightly integrated into the oracle kernel and avoid much of the overhead
and recursive calling of data one would need to do if you coded the same logic in pl/sql.
Analytic functions are an Oracle sql verbs that may or may not yet be part of the ANSI standards.
Analytic list:
AVG CORR COUNT CUME_DIST
DENSE_RANK FIRST FIRST_VALUE LAG
LAST LAST_VALUE LEAD MAX
MIN NTILE PERCENT_RANK PERCENTILE_CONT
PERCENTILE_DISC RANK RATIO_TO_REPORT REGR_AVGX
REGR_AVGY REGR_COUNT REGR_INTERCEPT REGR_R2
REGR_SLOPE REGR_SXX REGR_SXY REGR_SYY
ROW_NUMBER STDDEV STDDEV_POP STDDEV_SAMP
SUM VAR_POP VAR_SAMP VARIANCE
Syntax
All analytic functions have the general format syntax of:
Analytic-Function(<Argument>,<Argument>,...)
OVER (
<Query-Partition-Clause>
<Order-By-Clause>
<Windowing-Clause>
)
Query-Partition-Clause (group)
The PARTITION BY clause logically breaks a single result set into N groups,
according to the criteria set by the partition expressions.
The words "partition" and "group" are used synonymously here.
The analytic functions are applied to each group independently, they are reset for each group.
Order-By-Clause (within group)
The ORDER BY clause specifies how the data is sorted within each group (partition).
This will definitely affect the outcome of any analytic function.
Windowing-Clause
The windowing clause gives us a way to define a sliding or anchored window of data,
on which the analytic function will operate, within a group.
This clause can be used to have the analytic function compute its value based on any arbitrary
sliding or anchored window within a group.
Performance (changes to explain plan)
desc mydata0
Name Null? Type
------------------------------------------------------------------------ -------- ------------
CAL_MNTH NOT NULL NUMBER(28)
EMPLY_ID NOT NULL VARCHAR2(10)
ASMNTPLN_ID NOT NULL NUMBER(28)
RCGNTNLVL_ID NUMBER(28)
PREREQ_ASMNTPLN_ID NUMBER(28)
AMT NUMBER(7,2)
200602 26078 0
200602 26107 175
200602 26116 175
200602 29083 87.5
200603 26078 0
200603 26107 175
200603 26116 175
200603 29083 0
200604 26078 50
200604 26107 250
200604 26116 250
200604 29083 50
24 rows selected.
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=30 Card=26 Bytes=624)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'MYDATA0' (TABLE) (Cost=30 Card=26 Bytes=624)
2 1 INDEX (RANGE SCAN) OF 'MY_PK' (INDEX (UNIQUE)) (Cost=17 Card=26)
break on cal_mnth skip 1 dup
SELECT cal_mnth, emply_id, AMT,
SUM(amt)
OVER (ORDER BY cal_mnth, emply_id
) Running_Total,
SUM(AMT)
OVER (PARTITION BY cal_mnth
ORDER BY emply_id
) reporting_period_Total ,
ROW_NUMBER()
OVER (PARTITION BY cal_mnth
ORDER BY EMPLY_ID
) seq
FROM mydata0
where 1=1
and emply_id in (26078, 26107, 26116, 29083) and cal_mnth between 200601 and 200606
ORDER BY cal_mnth, emply_id
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=30 Card=26 Bytes=624)
1 0 WINDOW (BUFFER) (Cost=30 Card=26 Bytes=624)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'MYDATA0' (TABLE) (Cost=30 Card=26 Bytes=624)
3 2 INDEX (RANGE SCAN) OF 'MY_PK' (INDEX (UNIQUE)) (Cost=17 Card=26)
Output from the running total query listed above.
CAL_MNTH EMPLY_ID AMT RUNNING_TOTAL REPORTING_PERIOD_TOTAL SEQ
------------ ---------- --------------------- ------------- ---------------------- ----------
200601 26078 0 0 0 1
200601 26107 175 175 175 2
200601 26116 250 425 425 3
200601 29083 137.5 562.5 562.5 4
24 rows selected.
Notice/warning
If you compare / contrast these two result sets The dataset for the first sql looks
wrong because its "running total" is from a different order by.
With many/all analytic functions order by clauses are important to understanding what is being presented.
The point of this example illustrate that order clauses can an do affect the meaning of a report.
The query below in lets one easily find their emply_id since its sorted by that column.
And then say my amt was x and the running total of the top contributors with my amount or higher is y.
In the report below if your employee 26107, your amount was 175,
running total of all people who contributed at your level or higher comes to 575 (175+400=575)
The above report is correct it just answer a different question than a pure running total of what is displayed
The sql for this question would be
200602 26078 0 0
200602 26107 175 175
200602 26116 175 350
200602 29083 87.5 437.5
200603 26078 0 0
200603 26107 175 175
200603 26116 175 350
200603 29083 0 350
200604 26078 50 50
200604 26107 250 300
200604 26116 250 550
200604 29083 50 600
200606 26078 50 50
200606 26107 175 225
200606 26116 250 475
200606 29083 87.5 562.5
24 rows selected.
Top-N Queries
Example 1
TOP x people in each reporting period
Break on cal_mnth skip 1
SELECT *
FROM (
SELECT cal_mnth, emply_id, amt,
DENSE_RANK()
OVER (
PARTITION BY cal_mnth ORDER BY amt DESC
) dr,
RANK()
OVER (
PARTITION BY cal_mnth ORDER BY amt DESC
) r,
ROW_NUMBER()
OVER (
PARTITION BY cal_mnth ORDER BY amt DESC
) seq
FROM mydata0
where 1=1
-- and emply_id in (26078, 26107, 26116, 29083)
and cal_mnth between 200601 and 200603 )
WHERE dr <= 4
order by cal_mnth, amt DESC
Examine the data below four people are tied for 1st (dr) 9 people are tied for 2nd (dr)
That same data looked at different is:
Four people are tied for 1st (r) and nine people are tied for 5th (r) and then we have 14th and 15th place
The last column seq is a straight sequence (rownum) for that group.
CAL_MNTH EMPLY_ID AMT DR R SEQ
------------ ---------- --------------------- ---------- ---------- ----------
200601 09158 1500 1 1 1
03389 1500 1 1 2
28918 1500 1 1 3
27001 1500 1 1 4
27501 450 2 5 5
08201 450 2 5 6
08010 450 2 5 7
27237 450 2 5 8
27028 450 2 5 9
27866 450 2 5 10
26290 450 2 5 11
29249 450 2 5 12
26681 450 2 5 13
25585 400 3 14 14
28286 350 4 15 15
49 rows selected.
Windows
The windowing clause gives us a way to define a sliding window which the analytic function will operate,
within a group.
The default window is an anchored window that simply starts at the first row of a group
an continues to the current row.
The ORDER BY in an analytic function implies a default window clause of RANGE UNBOUNDED PRECEDING.
That says to get all rows in our partition that came before us as specified by the ORDER BY clause.
windows can only be based on two criteria: RANGES of data values or ROWS offset from the current row.
Range Windows
Range windows collect rows together based on a WHERE clause.
' range 5 preceding ' means generate a sliding window of preceding rows in the group such
that they are within 5 units of the current row.
These units may either be numeric comparisons or date comparisons.
It is not valid to use RANGE with datatypes other than numbers and dates.
Count of users created WITHIN 5 days of the current row.
29 rows selected.. . .
. . .
. . .
In the report above, we are using a window of 5 days prior to current row.
Since the value being compared contains a timestamp this explain why
The count jumps from 7 to 11 above.
Row Windows
Row Windows are physical units; physical number of rows, to include in the window.
In the example below we calculate the AMT field for this person an the two people above them.
Such as if I form teams of 3 people what would be their total.
SELECT *
FROM (
SELECT cal_mnth, emply_id, amt,
DENSE_RANK()
OVER (
PARTITION BY cal_mnth ORDER BY amt DESC
) dr,
RANK()
OVER (
PARTITION BY cal_mnth ORDER BY amt DESC
) r,
ROW_NUMBER()
OVER (
PARTITION BY cal_mnth ORDER BY amt DESC
) seq ,
SUM(amt)
OVER (PARTITION BY cal_mnth
ORDER BY amt DESC
ROWS 2 PRECEDING) You_and_2above
FROM mydata0
where 1=1
-- and emply_id in (26078, 26107, 26116)
and cal_mnth between 200601 and 200603 )
WHERE dr <= 4
order by cal_mnth, amt DESC
The following query gives us the when a user was created and their default tablespace,
It also tells us what we used for the last user created as well as what we will use for the next user created.
alter session set nls_date_format = 'yyyymmdd_hh24miss';
Column username format a15 trunc
29 rows selected.
This sql is the base to make true cross tab, is close but there is holes in the report.
SELECT cal_mnth,
DECODE(seq,1,emply_id,null) first,
DECODE(seq,2,emply_id,null) second,
DECODE(seq,3,emply_id,null) third,
DECODE(seq,1,amt,null) firstamt,
DECODE(seq,2,amt,null) secondamt,
DECODE(seq,3,amt,null) thirdamt
FROM (SELECT cal_mnth, emply_id, amt,
row_number()
OVER (PARTITION BY cal_mnth
ORDER BY amt desc NULLS LAST) seq
FROM mydata0
where 1=1
and cal_mnth between 200601 and 200603
)
WHERE seq <= 3
CAL_MNTH FIRST SECOND THIRD FIRSTAMT SECONDAMT THIRDAMT
---------- ---------- ---------- ---------- ---------- ---------- ----------
200601 09158 1500
03389 1500
28918 1500
9 rows selected.
The solution to make the above a cross tab is to add max to the column set.
SELECT cal_mnth,
MAX(DECODE(seq,1,emply_id,null) ) first,
MAX(DECODE(seq,2,emply_id,null) ) second,
MAX(DECODE(seq,3,emply_id,null) ) third,
MAX(DECODE(seq,1,amt,null)) firstamt,
MAX(DECODE(seq,2,amt,null)) secondamt,
MAX(DECODE(seq,3,amt,null)) thirdamt
FROM (SELECT cal_mnth, emply_id, amt,
row_number()
OVER (PARTITION BY cal_mnth
ORDER BY amt desc NULLS LAST) seq
FROM mydata0
where 1=1
and cal_mnth between 200601 and 200606
)
WHERE seq <= 3
GROUP BY cal_mnth;
6 rows selected.
NTILE Divides an ordered data set into a number of buckets
In the example what what PK values do I need to divide a table into x equal row counts.
10 rows selected.
Elapsed: 00:00:00.14
The following query is the same technique as the above. In this example a carrot ^ is used as a separator for the
values to use in a multi-column key table. One can use this carrot separate list as values such that one queries/splits t
it into 10 equal rowcount sizes.
10 rows selected.
http://www.psoug.org/reference/analytic_functions.html
SUBMIT_DA NUM_VOTES
--------- ----------
23-MAR-08 100
24-MAR-08 150
25-MAR-08 75
24-MAR-08 25
26-MAR-08 50
6 rows selected.
14 rows selected.
DENSE_RANK This example returns a group leaving no gaps in ranking sequence when there are ties
SELECT d.department_name, e.last_name, e.salary, DENSE_RANK()
OVER (PARTITION BY e.department_id ORDER BY e.salary) AS DENSE_RANK
FROM HR.employees e, HR.departments d
WHERE e.department_id = d.department_id
AND d.department_id IN (30, 60);
11 rows selected.
FIRST This example returns the row ranked first using DENSE_RANK
SELECT last_name, department_id, salary,
MIN(salary) KEEP (DENSE_RANK FIRST ORDER BY commission_pct)
OVER (PARTITION BY department_id) "Worst",
MAX(salary) KEEP (DENSE_RANK LAST ORDER BY commission_pct)
OVER (PARTITION BY department_id) "Best"
FROM HR.employees
WHERE department_id IN (30, 60)
ORDER BY department_id, salary;
11 rows selected.
FIRST_VALUE This example returns the first value in an ordered set of values.
If the first value in the set is null, then the function returns NULL unless you specify IGNORE NULLS
SELECT last_name, salary, hire_date, FIRST_VALUE(hire_date)
OVER (ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS lv
FROM (SELECT * FROM HR.employees WHERE department_id = 90
ORDER BY hire_date);
3 rows selected.
5 rows selected.
3 rows selected.
LEAD This example returns a Provides access to a row by offset BEYOND current position.
SELECT submit_date, num_votes,
LEAD(num_votes, 1, 0) OVER (ORDER BY submit_date) AS NEXT_VAL
FROM vote_count;
5 rows selected.
22 rows selected.
11 rows selected.
80 Banda 6200 .1 32
80 Johnson 6200 .1 32
80 Kumar 6100 .1 34
34 rows selected.
LAST_NAME SALARY RR
------------------------- ---------- ----------
Khoo 3100 .223021583
Baida 2900 .208633094
Tobias 2800 .201438849
Himuro 2600 .18705036
Colmenares 2500 .179856115
5 rows selected.
6 rows selected.
5 rows selected.