Data Analysis Final Assignment
Data Analysis Final Assignment
Solution:
In regard to this case:
Null Hypothesis is: - H0 : x = 500, (where x is average balance on their credit
cards)
Alternate Hypothesis is: - H1 : x ≠ 500, (where x is average balance on their
credit cards)
One-Sample t-Test
Variable 1 Variable 2
Mean 520.015 0
Variance 211378.2 0
Observations 400 2
Hypothesized Mean 500
df 399
t Stat 0.870674
P(T<=t) one-tail 0.192228
t Critical one-tail 1.648682
P(T<=t) two-tail 0.384456
t Critical two-tail 1.965927
From this above table which is also depicted in “Ans 1” of the excel sheet we
can infer that the P value for two tail marked in red above is 0.384456 which is
greater than 0.05. Therefore as a result, we fail to reject this assertion and the
null hypothesis assuming 95% confidence. Thus in conclusion, this assertion
is justified.
2) Is there a difference between men and women as far as average balance
is concerned? Use a two-sample t-test to draw your conclusion.
Solution
In regard to this case,
Let us assume: Average balance for men = x1 and Average balance for
women = x2.
Null hypothesis: - H0 : x1 = x2
Alternate Hypothesis is: - H1 : x1 ≠ x2
Two-Sample t-Test
Variable Variable
1 2
Mean 509.8031 529.5362
Variance 213554.6 210187.1
Observations 193 207
Hypothesized Mean Difference 0
df 396
t Stat -0.42838
P(T<=t) one-tail 0.334302
t Critical one-tail 1.648711
P(T<=t) two-tail 0.668604
t Critical two-tail 1.965973
From this above table which is also depicted in “Ans 2” of the excel sheet we
can infer that the P value for two tail marked in red above is 0.668604 which is
greater than 0.05. Therefore as a result, we fail to reject this null hypothesis
assuming 95% confidence interval and we can conclude there is no significant
difference between men and women as far as average balance is concerned.
3) Is there a difference between students and non-students as far as
average balance is concerned? Use a two-sample t-test to draw your
conclusion.
Solution:
In regard to this case,
Let us assume: Average balance for students = x1 and Average balance for
non-students = x2.
Null hypothesis: - H0 : x1 = x2
Alternate Hypothesis is: - H1 : x1 ≠ x2
Two-Sample t-Test
Variable
Variable 1 2
Mean 480.3694444 876.825
Variance 193085.1361 240101.9
Observations 360 40
Hypothesized Mean Difference 0
df 46
t Stat -4.90277866
P(T<=t) one-tail 6.08619E-06
t Critical one-tail 1.678660414
P(T<=t) two-tail 0.00001217
t Critical two-tail 2.012895599
From this above table which is also depicted in “Ans 3” of the excel sheet we
can infer that the P value for two tail marked in red above is 0.00001217
which is lesser than 0.05. There is also a huge difference in the mean for both
variable 1 and variable 2. Therefore as a result, we can directly reject this null
hypothesis assuming 95% confidence interval and we can conclude there is a
significant difference between men and women as far as average balance is
concerned.
4) It is generally assumed that if there are more credit cards then the balance on the
cards will be more. Based on this dataset, do you think this is true? Calculate a
correlation coefficient and show a scatter plot to support your answer.
Solution:
With reference to this case,
We can calculate correlation coefficient between credit cards and balance and
the result of the calculation is given below. From the inference we can say
and agree to the fact that if there are more credit cards then the balance on
the cards will be more.
Column Column
1 2
Column 1
1
Column 0.086456 1
2
Scatter Plot
Balance
2500
2000
1500
Balance
1000
500
0
0 2 4 6 8 10
From this scatter plot, we cannot say anything significantly in relation to cards and
balance. (Kindly refer to “Ans 4” of the excel sheet).
5) Examine whether the following demographic variables influence
balance: (a) age, (b) years of education, (c) marital status. For age and
years of education, use scatter plots to depict their relationship with
balance and calculate the correlation coefficient. For the relationship
between marital status and balance, use a two-sample t-test to draw
your conclusion
Solution:
With regard to this case,
We have to find the correlation coefficient of balance and age differently and
again the correlation coefficient of balance and education differently. From
“Ans 5.1” of the excel sheet we can see that the result of the correlation
coefficient of balance and age is 0.001835119 and the result of the correlation
coefficient of balance and education is -0.008061576. The scatter plot for
balance-education and balance-age are mentioned below as well as in “Ans
5.1” of the excel sheet.
Balance
2500
2000
1500
Balance
1000
500
0
0 20 40 60 80 100 120
From the above calculations and the scatter plot, we can infer that the
correlation coefficient of age and balance is very insignificant. Thus they are
not significantly correlated.
Scatter Plot of Balance-Education
Education
25
20
15
Education
10
0
0 500 1000 1500 2000 2500
From the above calculations and the scatter plot, we can infer that the
correlation coefficient of age and balance is very insignificant. Thus they are
not significantly correlated.
Two-Sample t-Test
Variable Variable
1 2
Mean 517.9429 523.2903
Variance 205696.7 221735
Observations 245 155
Hypothesized Mean Difference 0
df 319
t Stat -0.11223
P(T<=t) one-tail 0.455354
t Critical one-tail 1.649644
P(T<=t) two-tail 0.910709
t Critical two-tail 1.967428
From this above table which is also depicted in “Ans 5” of the excel sheet we
can infer that the P value for two tail marked in red above is 0.910709 which is
greater than 0.05, assuming 95% confidence interval. Therefore as a result,
we fail to reject this null hypothesis and we can conclude there is no
significant difference between marital status and balance and also the marital
status will not be influencing balance significantly.
Solution:
In regard to this case,
Let us assume: African American to be x1, Asian to be x2 and Caucasian to be
x3. Therefore,
Null hypothesis: H0 : x1 = x2 = x3
Alternate hypothesis: H1 : x1 ≠ x2 ≠ x3
ANOVA Test
SUMMARY
Groups Count Sum Average Variance
Column 1 99 52569 531 235839.2
Column 2 102 52256 512.3137 231748.3
Column 3 199 103181 518.4975 190922.4
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 18454.2 2 9227.1 0.043443 0.957492 3.018452
Within Groups 84321458 397 212396.6
From this above table which is also depicted in “Ans 6” of the excel sheet we
can infer that the P value for two tail marked in red above is 0.957492 which is
greater than 0.05. Therefore as a result, we fail to reject this null hypothesis
assuming 95% confidence interval and we can conclude there is no significant
difference and can strongly agree to the fact that Ethnicity of the cardholder
matter does not matter as far a balance is concerned.
7) A general principle that credit card companies often follow is to assign a
higher credit limit to people with a higher credit rating. Does the data
show that this principle is being followed?
Solution:
With reference to this case,
We need to calculate firstly the correlated coefficient of limit and rating. From
the calculation of the same above we get the result of correlated coefficient as
0.99688. This result is much highly correlated. To further prove the statement,
we took the help of scatter plot diagram which is mentioned below as well as
in the “Ans 7” of the excel sheet.
Rating
1200
1000
800
600
Rating
400
200
0
0 5000 10000 15000
From the above diagram which is also mentioned in “Ans 7” of the excel
sheet, we can strongly infer and agree to the general principle that credit card
companies often follow is to assign a higher credit limit to people with a higher
credit rating. Here it clearly shows that with higher credit rating of customers,
there is an increased higher credit limit.
8) Run a simple linear regression of balance on the credit limit. (Here credit
limit is the X and the balance is the Y). Report the coefficients and the R-
squared. Show a scatter plot.
Solution:
In this case, we have assumed X to be credit limit and Y to be the balance.
The result of R squared and the coefficients are mentioned below as well as
in “Ans 8” of the excel sheet.
Regression Statistics
Multiple R 0.861697267
R Square 0.74252218
Adjusted R Square 0.741875251
Standard Error 233.5849982
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 62624255 62624255 1147.764 2.5E-119
Residual 398 21715657 54561.95
Total 399 84339912
In reference to the above diagram and the calculations done in “Ans 8” of the
excel sheet, we can infer that the R square result is 0.742522, result of
intercept is -292.7904955 and the result of coefficient of X is 0.171637278.
Regression Equation: Y= -292.7904955 + 0.171637278*X
Scatter Plot
Balance
2500
2000
1500
Balance
1000
500
0
0 5000 10000 15000
9) Run a simple linear regression of balance (Y) on credit rating (X). Report
the coefficients and R-squared. Show a scatter plot.
Solution:
In this case, we have assumed X to be credit limit and Y to be the balance.
The result of R squared and the coefficients are mentioned below as well as
in “Ans 9” of the excel sheet.
Regression Statistics
Multiple R 0.863625161
R Square 0.745848418
Adjusted R Square 0.745209846
Standard Error 232.0713048
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 62904790 62904790 1167.995 1.9E-120
Residual 398 21435122 53857.09
Total 399 84339912
In reference to the above diagram and the calculations done in “Ans 9” of the
excel sheet, we can infer that the R square result is 0.745848418, result of
intercept is -390.8463418 and the result of coefficient of X is 2.566240327.
Regression Equation: Y= -390.8463418 + 2.566240327*X
Scatter Plot
Balance
2500
2000
1500
Balance
1000
500
0
0 200 400 600 800 1000 1200
10) Consider your findings in questions 8-9. Discuss business mechanisms
to increase or decrease the balance on credit cards. Try to quantify your
answers. In this context, focus on possible specific strategies using
variables in Q8 and Q9 that the business could adopt to increase the
balance on credit cards.
Solution:
In relation to this case, we can assume credit rating as X1, limits as X2 and the
balance as Y. From the previous scatter plots and calculations, we can infer
that both rating and limits are directly proportionate to the balance. We can
say that with the increase or decrease in rating, there is an increase or
decrease in balance consecutively and likewise with the increase or decrease
in limits, subsequently there is increase or decrease in balance. These two
variables in turn are interconnected to each other again proportionately. High
credit rating results in higher credit limit which again results in higher balance.
Similarly, it is just the opposite in case of lower credit rating. Therefore in
order to increase the balance on credit cards, the business should put more
emphasis and take note of the credit ratings and limits since they are the
changing points and the main turnovers of the business in terms of increased
balance on the credit cards.
11) The credit limit is provided as a consolidated amount for all the credit
cards the cardholder has. Run a multiple linear regression of Balance
(Y) on Limit and Cards as two X variables. Report the coefficients.
Discuss the effect on the balance of (a) increasing the credit limit on the
same number of cards and (b) increasing the number of cards without
altering the total credit limit.
Solution:
In this case, we have assumed X1 to be limit, X2 to be card and Y to be the
balance. The result of R squared and the coefficients are mentioned below as
well as in “Ans 11” of the excel sheet.
Regression Statistics
Multiple R 0.865188295
R Square 0.748550786
Adjusted R Square 0.74728404
Standard Error 231.1247525
Observations 400
ANOVA
df SS MS F Significance F
Regression 2 63132707 31566354 590.9238 9.8E-120
Residual 397 21207205 53418.65
Total 399 84339912
From the above diagram and chart, which is also calculated in “Ans 11” of the
excel sheet, we can infer that the intercept is -369.0359554, coefficient of limit
is 0.171479037 and the coefficient of card is 26.03375427. Therefore,
Regression Equation is: Y = -369.0359554 + 0.171479037*X1 +
26.03375427*X2.
From this, we can conclude two things. Firstly with the increase or decrease
in X1 (limit), the balance increases or decreases subsequently. They are
directly proportionate to each other. Secondly, with the increase or decrease
in X2 (card), the balance increases or decreases subsequently. They are
directly proportionate to each other. Both coefficient of limit and card being
positive.
12) Run a simple linear regression equation with Income as X and Balance
as Y. Report the coefficients. Is the coefficient of Income significantly
different from zero? What does this say about the effect of income on
balance?
Solution:
In this case, we have assumed X to be income and Y to be the balance. The
result of R squared and the coefficients are mentioned below as well as in
“Ans 12” of the excel sheet.
Regression Statistics
Multiple R 0.463656457
R Square 0.21497731
Adjusted R Square 0.213004891
Standard Error 407.8647195
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 18131167 18131167 108.9917 1.03E-22
Residual 398 66208745 166353.6
Total 399 84339912
In reference to the above diagram and the calculations done in “Ans 12” of
the excel sheet, we can infer that the R square result is 0.21497731, result of
intercept is 246.5147506 and the result of coefficient of X is 6.048363409.
Regression Equation: Y= 246.5147506 + 6.048363409*X
Null hypothesis: H0 : c1 = 0 (where, c1= coefficient of X1)
Alternate hypothesis: H1 : c1 ≠ 0
13) Based on the equation derived in question 12, what is the estimated
balance for a person with an income of USD 100k per year?
Solution:
In this case we assume Income to be X and balance to be Y.
Regression Equation: Y= 246.5147506 + 6.048363409*X
The value of X is USD 100k.
Therefore, the estimated balance for a person is:
Solution:
In this case, let us assume X1 to be income, X2 to be age, X3 to be
education, X4 to be limit and Y to be balance. The result of the coefficients
are mentioned below as well as in “Ans 14” of the excel sheet.
Regression Statistics
Multiple R 0.933826525
R Square 0.872031978
Adjusted R Square 0.870736099
Standard Error 165.298439
Observations 400
ANOVA
df SS MS F Significance F
Regression 4 73547100 18386775 672.9272 7.8E-175
Residual 395 10792812 27323.57
Total 399 84339912
In reference to the above diagram and the calculations done in “Ans 14” of
the excel sheet, we can infer that the result of the intercept is -356.4394673,
coefficient of X1 is -7.560341603, coefficient of X2 is -0.803431365,
coefficient of X3 is 1.055668552, coefficient of X4 is 0.263715465. P value of
income and limit is less than 0.05 therefore we reject null hypothesis
assuming 95% confidence. Thus the coefficient is significant and they would
be affecting the balance. On the other hand, the P value of age and education
is higher than 0.05 therefore we fail to reject null hypothesis assuming 95%
confidence. Thus the coefficient is very much insignificant and they would not
be affecting the balance.
Y= -356.4394673 - 7.560341603*X1 - 0.803431365*X2 + 1.055668552*X3 +
0.263715465*X4
(whereby the assumption of X1, X2, X3 and X4 is given above).