2 Assignment For Data Analysis For Decision Making: Dipanwita Ghosh
2 Assignment For Data Analysis For Decision Making: Dipanwita Ghosh
MAKING
BY
DIPANWITA GHOSH
PGPMex APR_A’21
Q1 : A company manager says that the average balance on their credit cards is $500. Do you
think that this assertion is justified? Use a one-sample t-test to draw your conclusion.
Hence, we don't have enough evidence to say that the assertion of company manager about the
average balance in credit card being $500 is not justified.
Q2 : Is there a difference between men and women as far as average balance is concerned?
Use a two-sample t-test to draw your conclusion.
The P value (two tail) is not less than 0.05. Hence, we fail to reject the Null Hypothesis.
There is not enough evidence to say that there is difference between male and female as far as
average balance is concerned.
t-Test: Two-Sample Assuming Unequal Variances
The P value (two tail) is less than 0.05. Hence, we reject the Null Hypothesis.
There is enough evidence to say that there is difference between Students and Non Students as
far as average balance is concerned.
The Correlation Coefficient of Number of Card VS Credit Balance is 0.086, which suggest
that there is not a very high correlation between these two parameters. The scatter plot also
portrays the same conclusion. Hence, we cannot say that the balance will be more if there are
more credit cards.
Cards Balance
Cards 1
Balance 0.086456347 1
Q5 : Examine whether the following demographic variables influence balance: (a) age, (b) years
of education, (c) marital status. For age and years of education, use scatter plots to depict their
relationship with balance and calculate the correlation coefficient. For the relationship
between marital status and balance, use a two-sample t-test to draw your conclusion
A. Age Vs Balance : The correlation coefficient (0.0018) is not high enough to say that the
age has influence on balance.
CORRELATION 0.001835119
AGE VS BALANCE
2500
2000
1500
1000
500
0
0 20 40 60 80 100 120
B. Education Vs Balance : The correlation coefficient (-0.08) is not high enough to say that
the years of education have influence on balance
CORRELATION OF
EDUCATION VS -0.008061576
BALANCE
2000
1500
1000
500
0
0 5 10 15 20 25
The P value (two tail) is more than 0.05. Hence, we fail to reject the Null Hypothesis.
Hence, there is not enough evidence to say that the marital status has influence on
balance.
Q6 : “Ethnicity of the cardholder matter does not matter as far a balance is concerned.” Carry
out an analysis of variance (ANOVA) and discuss whether this statement is supported by the
data or not.
H0 : Ethnicity of the card holder does not matter as far as balance is concerned
H1 : Ethnicity of the card holder matters as far as balance is concerned
As the P-Value is more than significance value of 0.05, we fail to reject H0.
Hence, there is enough data to support the statement - "Ethnicity of the cardholder does not
matter as far as balance is concerned"
SUMMARY
Groups Count Sum Average Variance
AFRICAN AMERICAN BALANCE 99 52569 531 235839.163
ASIAN BALANCE 99 49897 504.010101 226080.112
CAUCASSIAN BALANCE 99 50635 511.464646 192363.394
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 38466.6128 2 19233.3064 0.08818806 0.91561289 3.0264659
Within Groups 64119701.6 294 218094.223
The Correlation Coefficient of Rating vs Credit Limit is 0.99 which is an indication of having
a very high correlation. It shows enough evidence that people with higher rating have higher
credit limit.
Rating Limit
Rating 1
Limit 0.99687974 1
14000
12000
10000
8000
6000
4000
2000
0
0 200 400 600 800 1000 1200
Please refer sheet ANS 7 RATING VS LIMIT in excel file.
Q8 : Run a simple linear regression of balance on the credit limit. (Here credit limit is the X and
the balance is the Y). Report the coefficients and the R-squared. Show a scatter plot.
Co-efficient - 0.1716
R Square - 0.74
As per regression test, we can see that with each unit increase in limit, the balance will increase
by 0.1716, which is the coefficient of X.
Also, from R Square value, we can conclude, that there are 74% chances of Balance being affected
by Limit which is a significant number.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.861697267
R Square 0.74252218
Adjusted R Square 0.741875251
Standard Error 233.5849982
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 62624255.25 62624255.3 1147.764214 2.5306E-119
Residual 398 21715656.66 54561.9514
Total 399 84339911.91
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -292.7904955 26.68341452 -10.972752 1.18415E-24 -345.2485494 -240.3324415 -345.24855 -240.33244
Limit 0.171637278 0.005066234 33.878669 2.5306E-119 0.161677354 0.181597203 0.16167735 0.1815972
LIMIT VS BALANCE
2500
y = 0.1716x - 292.79
2000 R² = 0.7425
1500
1000
500
0
0 2000 4000 6000 8000 10000 12000 14000 16000
-500
Q9 : Run a simple linear regression of balance (Y) on credit rating (X). Report the coefficients
and R-squared. Show a scatter plot.
Co-efficient - 2.566
R Square - 0.74
As per regression test, we can see that with each unit increase in Rating, the Balance will increase
by 2.566, which is the coefficient of X.
Also, from R Square value, we can conclude, that there are 74% chances that Balance is
significantly affected by the change in Rating.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.863625161
R Square 0.745848418
Adjusted R Square 0.745209846
Standard Error 232.0713048
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 62904789.88 62904789.88 1167.994581 1.8989E-120
Residual 398 21435122.03 53857.09053
Total 399 84339911.91
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -390.8463418 29.06851463 -13.44569362 3.07318E-34 -447.993365 -333.69932 -447.99336 -333.69932
Rating 2.566240327 0.075089102 34.1759357 1.8989E-120 2.418619483 2.71386117 2.41861948 2.71386117
RATING VS BALANCE
2500
y = 2.5662x - 390.85
2000 R² = 0.7458
1500
1000
500
0
0 200 400 600 800 1000 1200
-500
Q10 : Consider your findings in questions 8-9. Discuss business mechanisms to increase or
decrease the balance on credit cards. Try to quantify your answers.
From the tests we performed in questions 8 & 9, we found that Credit Limit and Credit
Rating both have significant effect on Credit Balance. The R square for both the cases is 0.74, so
we have evidence to say that there are 74% chances that these data are fitting into the regression
model, which is a considerably high value. The coefficient of Credit Limit is lesser than the
coefficient of Credit Rating (0.1716 < 2.566) ; hence we can say, that the balance will be much
more higher if we are increasing the Credit Rating instead of Credit Limit. Similarly, if we are
reducing the Credit Rating, the reduction in Credit Balance will be more.
So, the credit card companies must look for individuals with higher credit ratings. Because
our data analysis shows that such individuals are likely to have higher Credit Balance.
Also, credit card companies might think of increasing Credit Limit of its existing customers,
without any changes in Credit Rating. Although, the impact won’t be as high as in the case of
customers with high Credit Rating.
Q11 : The credit limit is provided as a consolidated amount for all the credit cards the
cardholder has. Run a multiple linear regression of Balance (Y) on Limit and Cards as two X
variables. Report the coefficients. Discuss the effect on the balance of (a) increasing the credit
limit on the same number of cards and (b) increasing the number of cards without altering the
total credit limit.
As per regression test, we can see that with each unit increase in Limit, the Balance will increase
by 0.171 units, when number of cards is constant.
And, with each unit increase in number of Cards, the Balance will increase by 26.033 units, when
the Limit is constant
In both the cases, the balance will increase but the amount of increase will be more when the
number of cards is increasing. Because, the coefficient of number of cards is significantly high.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.865188295
R Square 0.748550786
Adjusted R Square 0.74728404
Standard Error 231.1247525
Observations 400
ANOVA
df SS MS F Significance F
Regression 2 63132707.37 31566353.7 590.923824 9.7585E-120
Residual 397 21207204.54 53418.6512
Total 399 84339911.91
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -369.0359554 36.16414657 -10.20447 7.2269E-22 -440.133128 -297.93878 -440.13313 -297.93878
Limit 0.171479037 0.005013136 34.2059386 2.002E-120 0.161623424 0.18133465 0.16162342 0.18133465
Cards 26.03375427 8.438363509 3.08516625 0.00217682 9.444290848 42.6232177 9.44429085 42.6232177
Please refer sheet ANS 11 MULTIPLE LINEAR REGRESSION in excel file.
Q12 : Run a simple linear regression equation with Income as X and Balance as Y. Report the
coefficients. Is the coefficient of Income significantly different from zero? What does this say
about the effect of income on balance?
Co-efficient - 6.04
H0 : Coefficient of Income is 0
H1 : Coefficient of Income is significantly different from 0
So, we have evidence to conclude that coefficient of income is significantly different from 0.
So, income has effect on balance. With every unit increase in Income, the balance will increase
by 6.04 units.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.463656457
R Square 0.21497731
Adjusted R Square 0.213004891
Standard Error 407.8647195
Observations 400
ANOVA
df SS MS F Significance F
Regression 1 18131167.4 18131167.4 108.991715 1.03089E-22
Residual 398 66208744.51 166353.6294
Total 399 84339911.91
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 246.5147506 33.19934735 7.425289058 6.9034E-13 181.2467485 311.782753 181.246749 311.782753
Income 6.048363409 0.579350163 10.43990973 1.0309E-22 4.909394402 7.18733242 4.9093944 7.18733242
INCOME VS BALANCE
2500
y = 6.0484x + 246.51
2000 R² = 0.215
1500
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Q13 : Based on the equation derived in question 12, what is the estimated balance for a person
with an income of USD 100k per year?
= 6.0484*100,000+246.51
= 65,086.51
Q14 : Based on the dataset, explore the relationship between credit card balance (Y) and (a)
Income (b) Age (c) Education (c) Limit, and (d) Rating as X variables? Estimate a multiple linear
regression model and report the statistical significance of each of these variables.
The P value of Age, Education and Limit are higher than 0.05, hence these are statistically
insignificant, so we are ignoring these parameters
With every unit increase in Income, the balance will decrease by 7.60 units if all the other
parameters are constant.
With every unit increase in Rating, the balance with increase by 2.77 units if other parameters
are constant.
R square value is 0.87, so there are 87% of chances that these parameters have significant effect
on Balance
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.936702578
R Square 0.87741172
Adjusted R Square 0.875856031
Standard Error 161.9917647
Observations 400
ANOVA
df SS MS F Significance F
Regression 5 74000827.17 14800165.43 564.0020686 4.5908E-177
Residual 394 10339084.74 26241.33183
Total 399 84339911.91
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -473.2514026 55.10833546 -8.587655545 2.08837E-16 -581.5945666 -364.9082387 -581.59457 -364.90824
Income -7.608832003 0.381931562 -19.92197755 1.37077E-61 -8.359710677 -6.85795333 -8.3597107 -6.8579533
Age -0.860030445 0.478700493 -1.796594023 0.073165937 -1.801157147 0.081096257 -1.8011571 0.08109626
Education 1.967791521 2.605290902 0.755305874 0.450516748 -3.154218733 7.089801776 -3.1542187 7.08980178
Limit 0.07901642 0.044791005 1.764113581 0.078487737 -0.009042839 0.167075679 -0.0090428 0.16707568
Rating 2.773843725 0.667079559 4.158190261 3.93909E-05 1.462363177 4.085324273 1.46236318 4.08532427
Please refer sheet ANS 14 MULTIPLE VARIABLES excel file.