CS112 Final Project
CS112 Final Project
CS112 Final Project
Date: 12/21/2018
Executive Summary:
Per our analysis, in this memorandum, we claim that the result obtained by Hansen, Kobayashi,
and Orzech (2014) may be compounded by their handling of data. Our recommendation is:
- To extend existing analyses and look further into the effect of income and adjustment of inflation
on college enrollment. An observational study design is suitable for this purpose.
- To perform a synthetic control method to verify whether GR actually had an effect on college
enrollment status.
Conceptually speaking, the GR did seem to have an impact on college enrollment status,
seeing how it reduce the financial capacity of families to send their sons and daughters to college. It
is therefore important that we verify whether the GR did have a causal effect on the college
enrollment status of young people in the United States by answering a causal question: what is the
marginal effect on college enrollment status of key predictors like family income and demographic
factors such as sex, race, and ethnicity?
For the goal of this paper, which is to understand the causal effect key predictors had on
college enrollment and synthesizing lessons to apply to future purposes, we deem it necessary to go
into further depth and look at how the variation in these predictors have a varied effect on students
depending on their measured likelihood of college enrollment. We postulate that the result would
partially answer a hypothesis: Is family income the most important factor when it comes to making a
decision on college enrollment?
Data and methodologies:
We first attempt to replicate the results from Hansen, Kobayashi, and Orzech (2014). In their
paper, they performed regression analyses, looking at certain predictors - variables with a certain
predictive power on enrollment for students coming from different income background. The method with
which we perform the replication is as follow: We first clean and organize the data. We then fix the
weights in the data for 2004, stored as wtsupp and pertw04, as both weights are used for the ASEC
supplement but pertw04 is more specific while wtsupp is more general - used for other suppements of
CPS as well. We then create a trichotomous outcome: enrolled in college full-time, not enrolled in
college, and enrolled in college part-time from six original categories, and store them in a new array.
We then attempt to adjust for inflation for the income data, using income in 1999 as the
baseline value to convert the data to the same scale and then adjust to 2010 so that any comparisons
made between different years are legitimate. We then continue to clean and restructure the data,
removing all incomes <0
We replicated their result, and found out the following key insights: []
Next, as an extension of their method, we performed a quantile regression. Our goal for
performing the quantile regression is to make a comparative analysis of whether the effect of the variation
of the key predictors -used by the authors to calculate the probability of being enrolled in college of
students- on the probability, differs between students belonging to different quantiles of the distribution,
that is, those who have different probabilities of being enrolled in college. We think it will be an
insightful extension because of … reasons:
(1) Students belonging to different quantiles of the distribution/having different probabilities of being
enrolled in college, may experience varied effect of the independent variable. For example, one of
the variables used were log_inc - an indicator of family income. We postulate that if you are in
the 99th quantile, that is, you have a very high chance of enrolling in college, the effect your
family income has on your chance of enrollment will be very different compared to such effect on
another student in the 25th quantile. For instance, as your family income increases, other factors
that may impact your decision to go to college will have more weight. On the contrary, if your
family income is very low, factors like the ranking of college, course availability, or more internal
issues like future endeavours matter much less. This is probably because the cost of college
enrollment in the US is very high, making family income the single most important factor. When
students make decision, they would likely choose the option that cost the least, sometimes at the
expense of their own preference. Because of this, we predict that if we perform quantile
regression on income, as we compare between low and high income, the variability of enrollment
probability will be more for higher-income student, because the variation in other predictors
would explain more the variation in whether students go to college.
(2)
For our method, we first perform a linear regression and take note of the R-squared value to get
an idea of which predictors should be included in our model specification. The predictors that constitute
the best linear regression model (very low p-value and respectable R-squared value) is log of income,
recenter year variance, and the interaction term of income and year variance. We find this finding
corroborative: variation in income explains the variation in college enrollment than other variables like
sex and race in the dataset can.
One potential problem that may arise from our analyses is the high number of positive fis. This
may leads to a bias in the estimation of our standard error, thus we made adjustment with regards to the
linear regression model to obtain the minimal fis. The best result comes from including only income in
1999 as the predictor, which leads to positive fis constituting only 2% of our dataset.
Conclusion and recommendation: Through the multinomial logistic regressions, we have been able to
predict the probabilities of people from different income levels to attend college from the year 2001 to
2013, and we were able to show through out multinomial logit that there wasn’t any significant effect in
the probabilities of these students going to college for any of the income levels. We then extended this
idea by plotting a regression of the probabilities from the multilogit against log of income, the year, and
an interaction term that multiplies log_inc and year to do a quantile regression.
In addition, we also recommend looking into synthetic control to answer the counterfactual question of
whether GR had an effect on college enrollment. The independent variable would be probability of
college enrollment and since GR is a treatment with designated timeline, we suspect the method can
discern if GR actually decrease college enrollment
Code:
> ###############################
> #Final Project
> #CS112
> #Fall2018
> #Anirudh Nair
> #Dang Minh Quang
> ###############################
>
> #Loading required libraries
> library(foreign)
> library(mlogit)
> library(quantreg)
>
> #Loading dataset
> cps <- read.dta("/Users/anirudhnair/Downloads/dataverse_files (9)/cps_00015.dta")
> head(cps)
year serial hwtsupp hhintype statefip month pernum wtsupp PERWT04 relate age sex race
1 2005 138 258.36 Interview Maine March 3 274.70 NA Child 18 Male White
2 2005 144 246.92 Interview Maine March 4 267.30 NA Child 19 Male White
3 2005 171 290.48 Interview Maine March 4 303.58 NA Child 18 Female White
4 2005 207 266.63 Interview Maine March 2 294.01 NA Grandchild 18 Female White
5 2005 218 542.22 Interview Maine March 4 553.32 NA Child 19 Female White
6 2005 237 272.94 Interview Maine March 3 331.86 NA Child 18 Male White
educ schlcoll ftotval adjginc
1 Grade 11 High school full time 82390 80
2 High school diploma or equivalent Does not attend school, college or university 92700 10000
3 High school diploma or equivalent College or university full time 54800 6000
4 Grade 11 High school full time 16600 8200
5 Some college but no degree College or university full time 115383 0
6 High school diploma or equivalent Does not attend school, college or university 65846 6139
> names(cps)
[1] "year" "serial" "hwtsupp" "hhintype" "statefip" "month" "pernum" "wtsupp" "PERWT04"
"relate"
[11] "age" "sex" "race" "educ" "schlcoll" "ftotval" "adjginc"
>
>
> ####Cleaning and Organizing the dataset####
>
> subset <- c("year", "serial", "statefip", "pernum", "wtsupp", "PERWT04", "sex", "race", "schlcoll",
"ftotval")
> data <- cps[subset]
>
> # fix 2004 weights
> data$wtsupp[data$year == 2004] <- data$PERWT04[data$year == 2004]
> data$schlcoll2 <- as.numeric(data$schlcoll)
>
> # create trichotomous outcome: multi_coll
> data[data[,"schlcoll2"]==6,"multi_coll"] <- 0 # not enrolled in college
> data[data[,"schlcoll2"]==5,"multi_coll"] <- 1 # enrolled pt (part-time) in college
> data[data[,"schlcoll2"]==4,"multi_coll"] <- 2 # enrolled ft (full-time) college
> data <- data[ which(data[,"multi_coll"]==0|data[,"multi_coll"]==1|data[,"multi_coll"]==2),] # get rid of
hs students & folks not in sample universe
> data[, "multi_coll"] <- as.factor(data[,"multi_coll"])
>
> # income to same scale
> data[,"inc_99"] <- 0
> data[data$year == 2001, "inc_99"] <- data[data$year == 2001, "ftotval"] *.94073
> data[data$year == 2002, "inc_99"] <- data[data$year == 2002, "ftotval"] *.92593
> data[data$year == 2003, "inc_99"] <- data[data$year == 2003, "ftotval"] *.90580
> data[data$year == 2004, "inc_99"] <- data[data$year == 2004, "ftotval"] *.88183
> data[data$year == 2005, "inc_99"] <- data[data$year == 2005, "ftotval"] *.85324
> data[data$year == 2006, "inc_99"] <- data[data$year == 2006, "ftotval"] *.82645
> data[data$year == 2007, "inc_99"] <- data[data$year == 2007, "ftotval"] *.80321
> data[data$year == 2008, "inc_99"] <- data[data$year == 2008, "ftotval"] *.77399
> data[data$year == 2009, "inc_99"] <- data[data$year == 2009, "ftotval"] *.77700
> data[data$year == 2010, "inc_99"] <- data[data$year == 2010, "ftotval"] *.76394
> data[data$year == 2011, "inc_99"] <- data[data$year == 2011, "ftotval"] *.74074
> data[data$year == 2012, "inc_99"] <- data[data$year == 2012, "ftotval"] *.72621
> data[data$year == 2013, "inc_99"] <- data[data$year == 2013, "ftotval"] *.70000
>
> # Adjusting Income to 2010
> data$adj_inc <- 1.309*data$inc_99
>
> # replace income < 0 with 0
> data$adj_inc[data$adj_inc < 0] <- 0
>
> # add $1 to all so i can take the log
> data$adj_inc <- data$adj_inc + 1
>
> # look at new data
> # length(data$adj_inc) # 39385 observations left
> # summary(data$adj_inc)
>
> # log of income
> data[,"log_inc"] <- log(data[,"adj_inc"])
>
> # generate recenter year variance. 2001 = 0. 2002 = 1. 2003 = 2....etc.
> data[,"rec_year"] <- data[,"year"] - 2001
>
> # generate interaction term
> data[,"log_inc_x_year"] <- data[,"log_inc"] * data[,"rec_year"]
>
> # recode race variable to collapse most subgroups into "other" category (recoding rationale less arbitary;
detailed rationale/sensitivity tests omitted)
> data$race2 <- as.numeric(data$race)
> data[data[,"race2"]>9&data[,"race2"]<29,"race2"] <- 998
> levels(data$race) <- c(levels(data$race), "Other - Recoded")
> data$race[data$race2 > 9] <- "Other - Recoded"
>
>
> #creating the required subsets
> # subset for income of over $8103 (2010$) and income > 0
> high_inc <- data[data[,"log_inc"] > 9,]
>
> #the code given by the author does not properly segregate between income levels, due to which we
changed the code:
> #low_inc <- data[data[,"log_inc"] > 0 && data[,"log_inc"] < 9,]
> low_inc <- data[data[,"log_inc"] < 9,]
>
> # convert data into mlogit format (shape = wide)
> dd_all <- mlogit.data(data, shape = "wide", choice = "multi_coll")
> dd_pos <- mlogit.data(low_inc, shape = "wide", choice = "multi_coll")
> dd <- mlogit.data(high_inc, shape = "wide", choice = "multi_coll")
>
>
>
> #creating the required models
>
> # (1) all income
> model_all <- mlogit(formula = multi_coll ~ 1|log_inc + rec_year + log_inc_x_year + factor(sex) +
factor(race) | 0, weights = wtsupp,data=dd_all, method = "nr", print.level = 0)
> summary(model_all)
Call:
mlogit(formula = multi_coll ~ 1 | log_inc + rec_year + log_inc_x_year +
factor(sex) + factor(race) | 0, data = dd_all, weights = wtsupp,
method = "nr", print.level = 0)
Frequencies of alternatives:
0 1 2
0.401062 0.053737 0.545200
nr method
11 iterations, 0h:0m:7s
g'(-H)^-1g = 9.37E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
1:(intercept) -6.7261497 0.5448095 -12.3459 < 2.2e-16 ***
2:(intercept) -8.6065096 0.2959310 -29.0828 < 2.2e-16 ***
1:log_inc 0.3990510 0.0486064 8.2098 2.220e-16 ***
2:log_inc 0.7594981 0.0263684 28.8034 < 2.2e-16 ***
1:rec_year 0.2680925 0.0624962 4.2897 1.789e-05 ***
2:rec_year 0.4064330 0.0366950 11.0760 < 2.2e-16 ***
1:log_inc_x_year -0.0212123 0.0056298 -3.7678 0.0001647 ***
2:log_inc_x_year -0.0329269 0.0033002 -9.9773 < 2.2e-16 ***
1:factor(sex)Female 0.4321969 0.0465328 9.2880 < 2.2e-16 ***
2:factor(sex)Female 0.6498861 0.0224313 28.9722 < 2.2e-16 ***
1:factor(race)Black/Negro -0.0690633 0.0722774 -0.9555 0.3393094
2:factor(race)Black/Negro -0.1145814 0.0347877 -3.2937 0.0009887 ***
1:factor(race)American Indian/Aleut/Eskimo -0.5312826 0.2098853 -2.5313 0.0113641 *
2:factor(race)American Indian/Aleut/Eskimo -0.6920716 0.0957167 -7.2304 4.816e-13 ***
1:factor(race)Asian or Pacific Islander 1.7435608 0.2332586 7.4748 7.727e-14 ***
2:factor(race)Asian or Pacific Islander 1.5271674 0.1638135 9.3226 < 2.2e-16 ***
1:factor(race)Asian only 0.6044276 0.1487184 4.0642 4.819e-05 ***
2:factor(race)Asian only 1.1711225 0.0770446 15.2006 < 2.2e-16 ***
1:factor(race)Hawaiian/Pacific Islander only -0.2262365 0.3453409 -0.6551 0.5123964
2:factor(race)Hawaiian/Pacific Islander only -0.2545465 0.1595032 -1.5959 0.1105175
1:factor(race)White-Black -0.1498287 0.2864802 -0.5230 0.6009754
2:factor(race)White-Black -0.2556832 0.1384726 -1.8465 0.0648264 .
1:factor(race)White-American Indian 0.0677572 0.2071880 0.3270 0.7436433
2:factor(race)White-American Indian -0.4698346 0.1136590 -4.1337 3.569e-05 ***
1:factor(race)Other - Recoded -0.3588100 0.2435056 -1.4735 0.1406113
2:factor(race)Other - Recoded 0.0674515 0.0989201 0.6819 0.4953159
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -31350
McFadden R^2: 0.067576
Likelihood ratio test : chisq = 4544.1 (p.value = < 2.22e-16)
>
> # (2) income > $0
> model_pos <- mlogit(formula = multi_coll ~ 1|log_inc + rec_year + log_inc_x_year + factor(sex) +
factor(race) | 0, weights = wtsupp,data=dd_pos, method = "nr", print.level = 0)
Error in solve.default(H, g[!fixed]) :
system is computationally singular: reciprocal condition number = 1.18838e-16
> summary(model_pos)
Error in summary(model_pos) : object 'model_pos' not found
>
> # (3) income > $8,103
> model <- mlogit(formula = multi_coll ~ 1|log_inc + rec_year + log_inc_x_year + factor(sex) +
factor(race) | 0, weights = wtsupp,data=dd, method = "nr", print.level = 0)
> summary(model)
Call:
mlogit(formula = multi_coll ~ 1 | log_inc + rec_year + log_inc_x_year +
factor(sex) + factor(race) | 0, data = dd, weights = wtsupp,
method = "nr", print.level = 0)
Frequencies of alternatives:
0 1 2
0.388091 0.054202 0.557707
nr method
11 iterations, 0h:0m:7s
g'(-H)^-1g = 2.46E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
1:(intercept) -9.4611950 0.7086892 -13.3503 < 2.2e-16 ***
2:(intercept) -12.2286606 0.3611539 -33.8600 < 2.2e-16 ***
1:log_inc 0.6434245 0.0630052 10.2122 < 2.2e-16 ***
2:log_inc 1.0775150 0.0321440 33.5215 < 2.2e-16 ***
1:rec_year 0.3213244 0.0958037 3.3540 0.0007966 ***
2:rec_year 0.3598098 0.0490526 7.3352 2.214e-13 ***
1:log_inc_x_year -0.0257086 0.0085938 -2.9915 0.0027757 **
2:log_inc_x_year -0.0281891 0.0044046 -6.4000 1.554e-10 ***
1:factor(sex)Female 0.4498178 0.0473019 9.5095 < 2.2e-16 ***
2:factor(sex)Female 0.6916898 0.0233090 29.6747 < 2.2e-16 ***
1:factor(race)Black/Negro 0.0144013 0.0747707 0.1926 0.8472672
2:factor(race)Black/Negro -0.0076466 0.0366484 -0.2086 0.8347226
1:factor(race)American Indian/Aleut/Eskimo -0.5781569 0.2264741 -2.5529 0.0106842 *
2:factor(race)American Indian/Aleut/Eskimo -0.6650940 0.1015139 -6.5518 5.687e-11 ***
1:factor(race)Asian or Pacific Islander 1.7809296 0.2347228 7.5874 3.264e-14 ***
2:factor(race)Asian or Pacific Islander 1.5915565 0.1667623 9.5439 < 2.2e-16 ***
1:factor(race)Asian only 0.6457046 0.1494640 4.3201 1.559e-05 ***
2:factor(race)Asian only 1.2048959 0.0790278 15.2465 < 2.2e-16 ***
1:factor(race)Hawaiian/Pacific Islander only -0.2051497 0.3472076 -0.5909 0.5546169
2:factor(race)Hawaiian/Pacific Islander only -0.2534883 0.1650772 -1.5356 0.1246429
1:factor(race)White-Black -0.2654803 0.3178413 -0.8353 0.4035710
2:factor(race)White-Black -0.2143151 0.1459335 -1.4686 0.1419466
1:factor(race)White-American Indian 0.1015332 0.2088010 0.4863 0.6267772
2:factor(race)White-American Indian -0.4487385 0.1177648 -3.8105 0.0001387 ***
1:factor(race)Other - Recoded -0.3350312 0.2443890 -1.3709 0.1704082
2:factor(race)Other - Recoded 0.0590978 0.1026384 0.5758 0.5647596
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -29631
McFadden R^2: 0.078918
Likelihood ratio test : chisq = 5077.6 (p.value = < 2.22e-16)
>
> data$prob <- model_all$probabilities[,2]
>
> #we choose this one because the p-value and the r squared ia respectable
>
>
> regmodel <- lm(data$prob ~ data$log_inc + data$rec_year + data$log_inc_x_year)
> summary(regmodel)
Call:
lm(formula = data$prob ~ data$log_inc + data$rec_year + data$log_inc_x_year)
Residuals:
Min 1Q Median 3Q Max
-0.039059 -0.002661 -0.000025 0.003091 0.090441
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.583e-02 6.425e-04 24.64 <2e-16 ***
data$log_inc 3.287e-03 5.765e-05 57.01 <2e-16 ***
data$rec_year 2.684e-03 8.144e-05 32.95 <2e-16 ***
data$log_inc_x_year -2.037e-04 7.344e-06 -27.73 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> # Quantile regression
>
> #25th Quantile
> quantreg25 <- rq(data$prob ~ data$log_inc + data$rec_year + data$log_inc_x_year, tau=0.25)
> summary(quantreg25)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.02327 0.00112 20.87301 0.00000
data$log_inc 0.00238 0.00009 25.38856 0.00000
data$rec_year 0.00346 0.00028 12.50722 0.00000
data$log_inc_x_year -0.00026 0.00002 -11.39288 0.00000
Warning message:
In summary.rq(quantreg25) : 1306 non-positive fis
>
> #50th Quantile
> quantreg50 <- rq(data$prob ~ data$log_inc + data$rec_year + data$log_inc_x_year, tau=0.5)
> summary(quantreg50)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.04473 0.00108 41.47107 0.00000
data$log_inc 0.00066 0.00009 7.05190 0.00000
data$rec_year 0.00431 0.00012 35.50346 0.00000
data$log_inc_x_year -0.00034 0.00001 -32.27365 0.00000
Warning message:
In summary.rq(quantreg50) : 8439 non-positive fis
>
> #75th Quantile
> quantreg75 <- rq(data$prob ~ data$log_inc + data$rec_year + data$log_inc_x_year, tau=0.75)
> summary(quantreg75)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.09069 0.00074 121.97262 0.00000
data$log_inc -0.00320 0.00006 -51.85006 0.00000
data$rec_year 0.00218 0.00010 22.83000 0.00000
data$log_inc_x_year -0.00016 0.00001 -20.33779 0.00000
Warning message:
In summary.rq(quantreg75) : 2496 non-positive fis
>
> # Simultaneous quantile regression
> quantreg2575 <- rq(data$prob ~ data$log_inc + data$rec_year + data$log_inc_x_year, tau=c(0.25,
0.75))
> summary(quantreg2575)
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.02327 0.00112 20.87301 0.00000
data$log_inc 0.00238 0.00009 25.38856 0.00000
data$rec_year 0.00346 0.00028 12.50722 0.00000
data$log_inc_x_year -0.00026 0.00002 -11.39288 0.00000
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.09069 0.00074 121.97262 0.00000
data$log_inc -0.00320 0.00006 -51.85006 0.00000
data$rec_year 0.00218 0.00010 22.83000 0.00000
data$log_inc_x_year -0.00016 0.00001 -20.33779 0.00000
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1306 non-positive fis
2: In summary.rq(xi, U = U, ...) : 2496 non-positive fis
>
> # ANOVA test for coefficient differences
> anova(quantreg25, quantreg75)
Quantile Regression Analysis of Deviance Table