20-PST-044 Sas Record
20-PST-044 Sas Record
20-PST-044 Sas Record
20-PST-044
Record Work
Subject Code: PST 1504
STATISTICAL DATA ANALYSIS USING SAS
I M.Sc. Statistics
Name: SNEHA TD
Department Number: 20-PST-044
Submitted to
Prof. M. Syluvai Anthony. M.Sc, M.Phil.,
1|Page
SNEHA TD
20-PST-044
CERTIFICATE
__ ____________________
SNEHA TD 20-PST-044
Examiners
2|Page
SNEHA TD
20-PST-044
CONTENTS
Exercise TITLE
No Page No
6
1 CREATING A PERMANENT LIBRARY
7
2 CREATING A DATASET IN SAS USING DATALINES/CARDS
9
3 OBTAINING CONTENT OF A DATASET
10
4 IMPORTING AN EXCEL DATA FILE IN SAS
12
5 IMPORTING AN EXCEL FILE IN SAS
13
6 CREATING A COPY OF A DATASET
14
7 SUBSETTING A DATASET BASED ON IF CONDITION
17
8 RENAMING VARIABLES
18
9 KEEP VARIABLES
19
10 DROP VARIABLES
20
11 CREATING NEW VARIABLE USING IF ELSE STATEMENT
21
12 CREATING NEW VARIABLE USING IF ELSE IF STATEMENT
22
13 CREATING SUMMARY REPORT USING PROC MEANS
CREATING NEW VARIABLES USING MATHEMATICAL 23
14 OPERATORS
24
15 CREATING SUMMARY USING PPROC UNIVARIATE
35
16 CREATING FREQUENCY TABLE USING PROC FREQ
36
17 OBTAINING CORRELATION MATRIX USING PROC CORR
37
18 SORTING DATASET
19 38
AGGREGATING DATASET
3|Page
SNEHA TD
20-PST-044
20 39
TRANSPORTING DATASET
21 40
MULTIPLE LINEAR REGRESSION MODEL
22 46
STACKING DATASETS
23 47
MERGING DATASETS (INNER JOIN)
24 48
MERGING DATASETS (FULL OUTER JOIN)
25 49
MERGING DATASETS (LEFT OUTER JOIN)
26 50
MERGING DATASETS (RIGHT OUTER JOIN)
27 51
MERGING DATASETS (LEFT JOIN EXCLUDING INNER
JOIN)
28 52
MERGING DATASETS (RIGHT JOIN EXCLUDING INNER
JOIN)
29 53
MERGING DATASETS (FULL OUTER JOIN EXCLUDING
INNER JOIN)
30 54
REMOVAL OF DUPLICATE RECORDS
31 55
SUBSETTING USING FIRST. AND LAST.
32 57
CREATING VARIABLES USING LAG FUNCTION
33 58
SCORING A DATASET USIG REGRESSION MODEL
34 OBTAINING PREDICTED AND RESIDUALS USING OLS 62
REGRESSION MODEL
35 CHECK FOR MULTICOLLINEARITY USING VIF AND 66
CONDITIONAL INDEX
36 69
TEST FOR NORMALITY OF ERROR
37 73
MODEL SELECTION IN OLS REGRESSION
38 78
MODEL VALIDATION
39 RETAIN STATEMENT TO CREATE ROW ID AND CUMULATIVE 86
SUM
40 89
PARAMETRIC TEST-ANOVA&INDEPENDENT SAMPLE T TEST
41 93
FITTING OF DISTRIBUTION
42 98
4|Page
SNEHA TD
20-PST-044
PROC SQL- CREATING A COPY AND SORTING A DATASET
43 99
PROC SQL – SUBSETTING USING IN
44 100
PROC SQL – AGGREGATING DATASET
45 101
PROC SQL – MERGING DATASETS
46 103
PLOTTING THE TIME SERIES DATA
47 105
DIFFERENCING TO ACHIEVE STATIONARITY
48 106
FITTING OF ARIMA MODEL
49 109
FORECASTING FOR FUTURE TIME POINTS
50 111
SAS MACRO WITH INPUT ARGUMENT(S)
5|Page
SNEHA TD
20-PST-044
SAS CODE:
6|Page
SNEHA TD
20-PST-044
SAS CODE:
7|Page
SNEHA TD
20-PST-044
8|Page
SNEHA TD
20-PST-044
SAS CODE:
SAS RESULT:
9|Page
SNEHA TD
20-PST-044
SAS CODE:
10 | P a g e
SNEHA TD
20-PST-044
11 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
12 | P a g e
SNEHA TD
20-PST-044
CODE:
13 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
OUTPUT WINDOW:
AND STATEMENT:
(KEEP)
CODE:
14 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
OUTPUT:
(DELETE)
CODE:
LOG WINDOW:
OUTPUT:
15 | P a g e
SNEHA TD
20-PST-044
OR STATEMENT:
(KEEP)
CODE:
LOG WINDOW:
OUTPUT:
16 | P a g e
SNEHA TD
20-PST-044
(DELETE)
CODE:
LOG WINDOW:
OUTPUT:
“RENAMING VARIABLES”
CODE:
17 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
OUTPUT:
“KEEP VARIABLES”
CODE:
18 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
OUTPUT WINDOW:
“DROP VARIABLES”
CODE:
19 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
OUTPUT WINDOW:
20 | P a g e
SNEHA TD
20-PST-044
CODE:
LOG WINDOW:
OUTPUT WINDOW:
CODE:
21 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
OUTPUT WINDOW:
CODE:
22 | P a g e
SNEHA TD
20-PST-044
LOG WINDOW:
RESULT WINDOW:
23 | P a g e
SNEHA TD
20-PST-044
CODE:
LOG WINDOW:
OUTPUT WINDOW:
24 | P a g e
SNEHA TD
20-PST-044
CODE:
LOG WINDOW:
RESULT:
LIMIT:
25 | P a g e
SNEHA TD
20-PST-044
26 | P a g e
SNEHA TD
20-PST-044
RATING:
27 | P a g e
SNEHA TD
20-PST-044
28 | P a g e
SNEHA TD
20-PST-044
AGE:
29 | P a g e
SNEHA TD
20-PST-044
30 | P a g e
SNEHA TD
20-PST-044
BALANCE:
31 | P a g e
SNEHA TD
20-PST-044
32 | P a g e
SNEHA TD
20-PST-044
33 | P a g e
SNEHA TD
20-PST-044
34 | P a g e
SNEHA TD
20-PST-044
CODE:
LOG WINDOW:
RESULT:
35 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS RESULT:
36 | P a g e
SNEHA TD
20-PST-044
“SORTING DATASETS”
SAS CODE:
SAS LOG:
SAS OUTPUT:
37 | P a g e
SNEHA TD
20-PST-044
“AGGREGRATING DATASETS”
SAS CODE:
SAS LOG:
SAS OUTPUT:
38 | P a g e
SNEHA TD
20-PST-044
“TRANSPOSING DATASET”
SAS CODE:
SAS LOG:
SAS OUTPUT:
39 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS RESULT:
40 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
i.e. 71.8% of the Total probability in Credit balance is explained by independent variables
in the model.
Overall fit of the model: The p-value is less than 0.05. It is concluded that the null
hypothesis is rejected at 5% level of significance.
H0: Bj=Bj^
H1: Bj ≠ Bj^
• INTERCEPT: The p value is greater than 0.05. It is concluded that the null
hypothesis is accepted.
41 | P a g e
SNEHA TD
20-PST-044
• RATING: The p value is greater than 0.05. It is concluded that the null hypothesis
is accepted.
• LIMIT: The p value is less than 0.05. It is concluded that the null hypothesis is
rejected.
• CARDS: The p value is less than 0.05. It is concluded that the null hypothesis is
rejected.
• AGE: The p value is less than 0.05. It is concluded that the null hypothesis is
rejected.
• EDUCATION:The p value is greater than 0.05. It is concluded that the null
hypothesis is accepted.
• CUSTOMER_MALE: The p value is greater than 0.05. It is concluded that the null
hypothesis is accepted.
• STUDENT_IND: The p value is less than 0.05. It is concluded that the null
hypothesis is rejected.
• MARRIED_IND: The p value is greater than 0.05. It is concluded that the null
hypothesis is accepted.
• CAUCASIAN: The p value is greater than 0.05. It is concluded that the null
hypothesis is accepted.
• AFRICANAMERICAN: The p value is greater than 0.05. It is concluded that the null
hypothesis is accepted.
42 | P a g e
SNEHA TD
20-PST-044
43 | P a g e
SNEHA TD
20-PST-044
44 | P a g e
SNEHA TD
20-PST-044
SAS OUTPUT:
45 | P a g e
SNEHA TD
20-PST-044
“STACKING DATASETS”
SAS CODE:
SAS LOG:
SAS OUTPUT:
46 | P a g e
SNEHA TD
20-PST-044
“MERGING DATASETS”
INNER JOIN:
SAS CODE:
SAS LOG:
SAS OUTPUT:
47 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
48 | P a g e
SNEHA TD
20-PST-044
EXERCISE NO: 25
SAS CODE:
SAS LOG:
SAS OUTPUT:
49 | P a g e
SNEHA TD
20-PST-044
EXERCISE NO: 26
SAS CODE:
SAS LOG:
SAS OUTPUT:
50 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
51 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
52 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
53 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
54 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
55 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
56 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
57 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
SAS RESULT:
58 | P a g e
SNEHA TD
20-PST-044
59 | P a g e
SNEHA TD
20-PST-044
60 | P a g e
SNEHA TD
20-PST-044
61 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
62 | P a g e
SNEHA TD
20-PST-044
SAS RESULT:
63 | P a g e
SNEHA TD
20-PST-044
64 | P a g e
SNEHA TD
20-PST-044
65 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS RESULT:
66 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
• vif” command produces variance inflation factors with the parameter estimates
and it is the reciprocal of tolerance.
• “collin” command produces the detailed analysis of collinearity among the
regressors.
67 | P a g e
SNEHA TD
20-PST-044
• From the parameter estimates table we can observe that the independent
variables Rating and Limit have VIF value greater than 10 which indicates that
those variables are severely affected by multicollinearity. In the collinearity
diagnostics table, large values in the condition index column indicate potential
collinearity. The condition index value for the 11th row is 167.20266 which is
greater than 30 which implies that the independent variables Rating and Limit
are involved in multi-collinear relationship since the variance proportion for
variables Rating and Limit are greater than 0.50.
68 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS RESULT:
69 | P a g e
SNEHA TD
20-PST-044
70 | P a g e
SNEHA TD
20-PST-044
71 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
From the Tests for Normality table we can observe the p-values of Shapiro-wilk test,
Cramer -von Mises test and Anderson-Darling test are less than 0.05 hence we reject the
null hypothesis and conclude that the residuals do not follow normal distribution.
From the pp plot and qq plot we can observe that some points are deviating from the
straight line, thus the residuals do not follow normal distribution.
72 | P a g e
SNEHA TD
20-PST-044
FORWARD SELECTION:
SAS CODE:
“slentry” command specifies the significance level for entry into the model used in the
forward and stepwise selection methods.
SAS LOG:
SAS OUTPUT:
73 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
In Forward Selection Method the independent variables significant at 5% level of
significance (i.e. Rating, student_ind, Age, Cards, Limit) enter the model.
74 | P a g e
SNEHA TD
20-PST-044
BACKWARD ELIMINATION:
SAS CODE:
“slstay” command specifies the significance level for staying in the model for the
backward and stepwise methods.
SAS LOG:
SAS OUTPUT:
75 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
In Backward Selection Method the independent variables that are not significant at 5%
level of significance (i.e. Caucasian, AfricanAmerican, Rating, Customer_male,
Education, Married_ind) are removed from the model.
SETPWISE SELECTION:
SAS CODE:
SAS LOG:
SAS OUTPUT:
76 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
In Stepwise Selection Method the independent variables that are significant at 5% level
of significance (i.e. student_ind, Age, Cards & Limit) enter the model and the
independent variable which is not significant at 5% level of significance (i.e. Rating) is
removed from the model.
77 | P a g e
SNEHA TD
20-PST-044
“MODEL VALIDATION”
SAS CODE:
SAS LOG:
SAS OUTPUT:
78 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
SAS CODE:
SAS LOG:
SAS OUTPUT:
79 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
80 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
SAS CODE:
SAS LOG:
SAS OUTPUT:
81 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
SAS CODE:
82 | P a g e
SNEHA TD
20-PST-044
SAS LOG:
SAS OUTPUT:
SAS CODE:
SAS LOG:
SAS OUTPUT:
83 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
SAS CODE:
84 | P a g e
SNEHA TD
20-PST-044
SAS LOG:
SAS OUTPUT:
OVERALL INTERPRETATION:
We have divided the dataset into train (80%) and test (20%) data and built regression
model on both the datasets. By using “proc score”, we have scored for the two datasets
and by using “proc means”, we have calculated percentage error for train and test
datasets. Since the mean percentage error has small difference between the train and
test dataset, we can conclude that the model has good predictive power.
85 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
SAS CODE:
86 | P a g e
SNEHA TD
20-PST-044
SAS LOG:
SAS OUTPUT:
SAS CODE:
SAS LOG:
87 | P a g e
SNEHA TD
20-PST-044
SAS OUTPUT:
88 | P a g e
SNEHA TD
20-PST-044
ANOVA:
H0: There is no significant difference between the average balances towards ethnicity.
H1: There is a significant difference between the average balances towards ethnicity.
SAS CODE:
SAS OUTPUT:
89 | P a g e
SNEHA TD
20-PST-044
90 | P a g e
SNEHA TD
20-PST-044
t-TEST:
H0: There is no significant difference between the average balance of a student and a
non-student.
H1: There is a significant difference between the average balance of a student and a
non-student.
SAS CODE:
SAS OUTPUT:
91 | P a g e
SNEHA TD
20-PST-044
92 | P a g e
SNEHA TD
20-PST-044
“FITTING OF DISTRIBUTION”
SAS CODE:
SAS RESULT:
93 | P a g e
SNEHA TD
20-PST-044
94 | P a g e
SNEHA TD
20-PST-044
FITTING OF NORMAL DISTRIBUTION:
95 | P a g e
SNEHA TD
20-PST-044
96 | P a g e
SNEHA TD
20-PST-044
Interpretation:
From the above fitted distributions, Weibull distribution is the best fit to the given data.
Since it has lowest Cramer-von mises test statistics value among all the other
distribution
97 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
98 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
99 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
100 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS LOG:
SAS OUTPUT:
101 | P a g e
SNEHA TD
20-PST-044
LEFT JOIN:
SAS CODE:
SAS LOG:
SAS OUTPUT:
102 | P a g e
SNEHA TD
20-PST-044
SAS LOG:
SAS OUTPUT:
103 | P a g e
SNEHA TD
20-PST-044
104 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS OUTPUT:
105 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS OUTPUT:
106 | P a g e
SNEHA TD
20-PST-044
107 | P a g e
SNEHA TD
20-PST-044
108 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS RESULT:
SAS OUTPUT:
109 | P a g e
SNEHA TD
20-PST-044
INTERPRETATION:
We have successfully forecasted the air passengers for 12 months from NOV 2020 to OCT
2020 using the procedure “proc arima”.
110 | P a g e
SNEHA TD
20-PST-044
SAS CODE:
SAS RESULT:
111 | P a g e
SNEHA TD
20-PST-044
OUTPUT DATA:
INTERPRETATION:
112 | P a g e
SNEHA TD
20-PST-044
Using “%Macro” and “%Mend” statement, we created a user defined function
“Gold_Medal_List()” to get the gold medal list for each department.
113 | P a g e
SNEHA TD
20-PST-044
114 | P a g e