Assignment
Assignment
Assignment
[Document title]
[DOCUMENT SUBTITLE]
ASAD KHAN
Table of Content
Question 1...................................................................................................................................................8
Loading the Data in MySQL.......................................................................................................................8
Queries and their Outcome.........................................................................................................................9
Analysis.......................................................................................................................................................9
4. Visualization schematics.......................................................................................................................16
Justification..............................................................................................................................................19
Question 2.................................................................................................................................................20
OLAP Operations in R..............................................................................................................................20
Appendix...................................................................................................................................................31
Question 3.................................................................................................................................................34
Introduction:..............................................................................................................................................34
Question 1
Loading the Data in MySQL
To load the data, I need to create a table first.
Analysis
The first step in the process of analyzing the datasets is loading them into R dataframes.
# Load CSV files
cars = read.csv('C:/ProgramData/MySQL/MySQLServer/Uploads/cars_info.csv', header=TRUE
R’s str function gives me a look at the data types in the “cars_info” dataset. The summary function lets
me see basic summary statistics for each column using R.
summary(cars_info)
## ID mpg cylinders displacement
## Min. : 1.0 Min. : 9.00 Min. :3.000 Min. : 68.0
## 1st Qu.:100.2 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2
## Median :199.5 Median :23.00 Median :4.000 Median :148.5
## Mean :199.5 Mean :23.51 Mean :5.455 Mean :193.4
## 3rd Qu.:298.8 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :398.0 Max. :46.60 Max. :8.000 Max. :455.0
##
## horsepower weight acceleration model
## 150 : 22 Min. :1613 Min. : 8.00 Min. :70.00
## 90 : 20 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00
## 88 : 19 Median :2804 Median :15.50 Median :76.00
## 110 : 18 Mean :2970 Mean :15.57 Mean :76.01
## 100 : 17 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00
## 75 : 14 Max. :5140 Max. :24.80 Max. :82.00
## (Other):288
## origin car_name price
## Min. :1.000 ford pinto : 6 Min. : 1598
## 1st Qu.:1.000 amc matador : 5 1st Qu.:23110
## Median :1.000 ford maverick : 5 Median :30000
## Mean :1.573 toyota corolla: 5 Mean :29684
## 3rd Qu.:2.000 amc gremlin : 4 3rd Qu.:36430
## Max. :3.000 amc hornet : 4 Max. :53746
## (Other) :369
I see several issues with how the read.csv function imported the data that need to be cleaned up before
going in-depth with the analysis. I will fix those in the code below:
# Cylinders came in as an integer, when it should be a multi-valued discrete,
# otherwise known as a "factor" in R.
cars$cylinders = cars$cylinders %>%
factor(labels = sort(unique(cars$cylinders)))
# Horsepower was imported as a factor, but it should be a continuous numerical
# variable.
cars$horsepower = as.numeric(levels(cars$horsepower))[cars$horsepower]
# I will change he model (year) column from an integer to a categorical factor.
model_years = sort(unique(cars$model))
cars$model = cars$model %>%
factor(labels = model_years)
2.1
#Displaying the First 10 Rows
head(cars, 10)
Results
So the Table is as
# A tibble: 10 x 11
ID mpg cylinders displacement horsepower weight acceleration model origin car_na price
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
2.2
Eight-cylinder cars with miles per gallon greater than 18
To show the cars with miles per gallon greater than 18 I am going to use the following R script code to
show the results.
cars_info[cars_info$mpg > 18 , c('ID', 'mpg', 'car_name')]
Results
2.3
The average horsepower and mpg by cylinder group
table(cars_info$cylinders)
Results
##
## 3 4 5 6 8
## 4 204 3 84 103
mean(cars_info$mpg)
mean,
## 23.51457286
mean(cars_info$horsepower)
Results
mean,
## 79.52173913
2.4
All cars with less than eight-cylinder and with acceleration from 11 to 13
(inclusive both limits)
To show All cars with less than eight-cylinder and with acceleration from 11 to 13 (inclusive both limits)
I am going to use the following R script code to show the results.
cars_info[cars_info$cylinders < 8 , cars_info$acceleration == 10 to 13 , c('ID', 'cylinders',
'acceleration', 'car_name')]
Results
2.5
The car names and horsepower of the cars with 3 cylinders
To show The car names and horsepower of the cars with 3 cylinders I am going to use the following R
script code to show the results.
To show the cars with miles per gallon greater than 18 I am going to use the following R script code to
show the results.
features <- cars_info %>% names() %>% keep(~ str_detect(.,"[.]"))
cars_info %>% filter_at(cylinders==8, mpg>18)
Results
All cars with less than eight-cylinder and with acceleration from 11 to 13
(inclusive both limits)
To show All cars with less than eight-cylinder and with acceleration from 11 to 13 (inclusive both limits)
I am going to use the following R script code to show the results.
features <- cars_info %>% names() %>% keep(~ str_detect(.,"[.]"))
cars_info %>% filter_at(cylinders<8, acceleration=(11 | 13))
Results
To show the car names and horsepower of the cars with 3 cylinders I am going to use the following R
script code to show the results.
data[ , c("1", "3", "5", "10")]
features <- cars_info %>% names() horsepower() %>% keep(~ str_detect(.,"[.]"))
cars_info %>% filter_at(cylinders==3)
Results
4. Visualization schematics
In this section I will take a look at the distribution of values for each variable in the dataset by creating
histograms using ggplot2’s qplot function. I am trying to find out if there is more data to clean up,
including outliers or extraneous values. This also might help me begin to identify any relationships
between variables that are worth investigating further.
4.1
The distribution of values for Mpg and cylinders by creating two relevant
histograms.
# Miles Per Gallon
qplot(cars$mpg, xlab = 'Miles Per Gallon', ylab = 'Count', binwidth = 2,
main='Frequency Histogram: Miles per Gallon')
qplot(cars$cylinders, xlab = 'Cylinders', ylab = 'Count',
main='Frequency Histogram: Number of Cylinders')
4.2
Boxplot to show the mean and distribution of Mpg measurements for each year
in the sample
The next plot uses boxplots to show the mean and distribution of MPG measurements for each year in the
sample.
ggplot(data = cars, aes(x = model, y = mpg)) +
geom_boxplot() +
xlab('Model Year') +
ylab('MPG') +
ggtitle('MPG Comparison by Model Year')
The trend over time shows a meaningful increase in MPG from the lows to the highs.
4.3
Scatter plot showing the relationship between weight and Mpg
I will use the same boxplot method to show how the distribution of MPG values compares across the
region of origin.
ggplot2 charting techniques to visualize how one variable affects another. I am starting with how weight
affects MPG by doing a scatter plot overlaid with a linear best-fit line.
ggplot(data = cars, aes(x = weight, y = mpg)) +
geom_point() +
geom_smooth(method='lm') +
xlab('MPG') +
ylab('Weight') +
ggtitle('MPG vs. Weight: Entire Sample')
Justification
The data clearly shows that weight and MPG are inversely related: as weight increases, MPG decreases.
The R-squared of the linear best fit line, as shown below, is over 70%. This means that variations in a
car’s weight explain over 70% of the changes to its MPG.
Question 2
OLAP Operations in R
Generation of Sales Functions
Here we first create a sales fact table that records each sales transaction.
# Generating state tabel
state_table <-
data.frame(key=c("FF", "LA", "SY", "SS", "CT"),
name=c("Frankfrut", "Los Angeles", "Sydney", "Seoul-s", "Cape-Town"),
country=c("Germany", "USA", "Australia", "Korea", "South Africa"))
#Generating Month table
month_table <-
data.frame(key=1:12,
desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))
#Generating Products Table
prod_table <-
data.frame(key=c("Washing Machine", "Fridge", "Vacuum cleaner","Microwave Oven"),
price=c(200, 500, 400, 150))
#Generation of Sales Function
gen_sales <- function(no_of_recs) {
# Generate transaction data randomly
loc <- sample(state_table$key, no_of_recs,
replace=T, prob=c(2,2,1,1,1))
time_month <- sample(month_table$key, no_of_recs, replace=T)
time_year <- sample(c(2015,2016,2017,2018,2019,2020), no_of_recs, replace=T)
prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
amount <- unit*prod_table[prod,]$price
sales <- data.frame(month=time_month,
year=time_year,
loc=loc,
prod=prod,
unit=unit,
amount=amount)
# Sort the records by time order
sales <- sales[order(sales$year, sales$month),]
row.names(sales) <- NULL
return(sales)
}
#Generating 500 Rows of Sales_Fact and Showing First 10
sales_fact <- gen_sales(500)
head(sales_fact)
#Creating Revenue Cube
revenue_cube <-
tapply(sales_fact$amount,
sales_fact[,c("prod", "month", "year", "loc")],
FUN=function(x){return(sum(x))})
Now we are going to print out the information of the sales for every year.
#Now recalling the Function and Showing all the data randomly
revenue_cube
Results
OLAP Operations
Here are some common operations of OLAP
Slice
Dice
Rollup
Drilldown
Pivot
"Slice" is about fixing certain dimensions to analyze the remaining dimensions. For example, we can
focus in the sales happening in "2015", "Jan", or we can focus in the sales happening in "2015", "Jan",
"Fridge".
# Slice
# cube data in Jan, 2015
revenue_cube[, "1", "2015",]
# cube data in Jan, 2015
revenue_cube["Fridge", "1", "2015",]
"Dice" is about limited each dimension to a certain range of values, while keeping the number of
dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/
Feb/Mar, Fridge/Microwave Oven, LA].
#Dice Cube Data for First 3 month of each year
revenue_cube[c("Fridge","Microwave Oven"),
c("1","2","3"),
,
c("LA")]
"Rollup" is about applying an aggregation function to collapse a number of dimensions. For example,
we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't
care where we sold our product).
#Roll Up
apply(revenue_cube, c("year", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
"Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of
granularity. For example, we want to focus in the annual and monthly revenue for each product and
collapse the location dimension (ie: we don't care where we sold our product).
#Drilldown
apply(revenue_cube, c("year", "month", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
Appendix
Code#
# Generating state tabel
state_table <-
data.frame(key=c("FF", "LA", "SY", "SS", "CT"),
name=c("Frankfrut", "Los Angeles", "Sydney", "Seoul-s", "Cape-Town"),
country=c("Germany", "USA", "Australia", "Korea", "South Africa"))
#Generating Month table
month_table <-
data.frame(key=1:12,
desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))
#Generating Products Table
prod_table <-
data.frame(key=c("Washing Machine", "Fridge", "Vacuum cleaner","Microwave Oven"),
price=c(200, 500, 400, 150))
#Generation of Sales Function
gen_sales <- function(no_of_recs) {
# Generate transaction data randomly
loc <- sample(state_table$key, no_of_recs,
replace=T, prob=c(2,2,1,1,1))
time_month <- sample(month_table$key, no_of_recs, replace=T)
time_year <- sample(c(2015,2016,2017,2018,2019,2020), no_of_recs, replace=T)
prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
amount <- unit*prod_table[prod,]$price
sales <- data.frame(month=time_month,
year=time_year,
loc=loc,
prod=prod,
unit=unit,
amount=amount)
# Sort the records by time order
sales <- sales[order(sales$year, sales$month),]
row.names(sales) <- NULL
return(sales)
}
#Generating 500 Rows of Sales_Fact and Showing First 10
sales_fact <- gen_sales(500)
head(sales_fact)
#Creating Revenue Cube
revenue_cube <-
tapply(sales_fact$amount,
sales_fact[,c("prod", "month", "year", "loc")],
FUN=function(x){return(sum(x))})
#Now recalling the Function and Showing all the data randomly
revenue_cube
# Slice
# cube data in Jan, 2015
revenue_cube[, "1", "2015",]
# cube data in Jan, 2015
revenue_cube["Fridge", "1", "2015",]
#Dice Cube Data for First 3 month of each year
revenue_cube[c("Fridge","Microwave Oven"),
c("1","2","3"),
,
c("LA")]
#Roll Up
apply(revenue_cube, c("year", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))})
#Drilldown
apply(revenue_cube, c("year", "month", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))}
Question 3
Introduction:
This analysis examines data from liver patients that focus on the relationship between the main
list of liver enzymes, proteins, age and sex that they use to try to predict liver disease similarity.
Typically, models like these found in very large data sets, will eventually evolve to provide
better health outcomes with less invasive methods by predicting possible pre-symptom illness
rather than waiting for external symptoms.
In the case of this study, the chances of detecting early symptoms of liver disease in terms of
population and protein levels can reduce recovery time and increase the length and quality of life
of high-risk people.
Data:
Importing data in MySQL
header = FALSE)
About the
There are 583 observations, 416 represent subjects with diseased livers, 167 represent subjects
without diseased livers.
The data represent 441 male subjects (of whom 324 have liver disease) and 142 female subjects
(of whom 92 have liver disease).
Data Dictionary
1. Age = Age of the patient (all subjects greater than 89 are labelled 90)
2. Sex = Gender of the patient Female Male
3. Tot_Bil = Total Bilirubin
4. Dir_Bil = Direct Bilirubin
5. Alk_Phos = Alkaline Phosphotase
6. Alamine = Alamine Aminotransferase
7. Aspartate = Aspartate Aminotransferase
8. Tot_Prot = Total Protiens
9. Albumin = Albumin 10.A_G_Ration = Albumin and Globulin Ratio 11.Disease = Disease State
(classified labeled by the medical experts ) 0 - not diseased 1- diseased
This has helped to stabilize the data better, but nevertheless it is not completely normal, so it is
important to have realistic expectations with the successive power of guessing each variation
within the whole genre.
Age
va medi trimm mi ma ran kurtosi
rs n mean sd an ed mad n x ge skew s se
F 0 50
F 1 92
M 0 117
M 1 324
Total Bilirubin
Bilirubin is a byproduct of hemolytic catabolism, it is one of many substances the liver filters
from the body. Heighten presence of either or both can be indicative of liver disease and is the
cause of skin yellowing associated with jaundice. (medscape)
m
v tri ku
m e ra sk
a m m mi m rt
n ea sd di ng e se
r m ad n ax osi
n a e w
s ed s
n
X 1 5 0. 1. 0 0. 0. - 4. 5. 1. 0. 0.0
1 8 46 01 29 52 0. 31 23 31 88 42
3 34 85 38 88 91 74 37 17 98 18
20 27 58 06 62 88 79 18 14 31
9 3 90 5
7
Direct Bilirubin
v m tri ku
m m ra
a ed m m m sk rt
n ea sd a ng se
r ia m in ax ew osi
n d e
s n ed s
X 1 5 - 1. - - 1. - 2. 5. 0. - 0.
1 8 0. 32 1. 0. 0 2. 98 28 82 0. 05
3 65 63 20 78 2 30 06 32 69 29 49
03 94 39 69 7 25 19 04 18 90 33
73 73 59 6 85 1 59 6
3 1 6 9
Alkaline Phosphotase
This is one of the enzymes included in a normal liver panel, frequently used to estimate overall
liver health.
v n m sd m tri m m m ra sk k se
a ea ed m ad in ax ng e ur
r n ia m e w to
s n ed sis
X 1 5 5. 0. 5. 5. 0. 4. 7. 3. 1. 2. 0.0
1 8 49 52 33 42 37 14 65 51 31 23 21
3 34 81 75 72 57 31 44 13 79 04 87
17 27 38 5 63 35 43 09 59 17 28
8 3
Alamine Aminotransferase
Alamine Aminotransferase is a natural art of the liver ecosystem tested for in a liver panel.
v m tri k
m ra sk
a ed m m m m ur
n ea sd ng e se
r ia m ad in ax to
n e w
s n ed sis
X 1 5 3. 0. 3. 3. 0.6 2. 7. 5. 1. 2. 0.
1 8 75 90 55 63 88 30 60 29 41 57 03
3 18 02 53 87 37 25 09 83 83 80 72
29 35 48 97 95 85 03 17 03 34 84
8
Aspartate Aminotransferase
Aspartate Aminotransferase is also a natural part of the liver ecosystem tested for in a liver
panel.
m
v tri k
m e ra sk
a m m m m ur
n ea sd di ng e se
r m ad in ax to
n a e w
s ed sis
n
X 1 5 3. 0. 3. 3. 0.8 2. 8. 6. 1. 1. 0.0
1 8 95 99 7 83 59 30 50 20 18 56 41
3 67 73 3 75 63 25 28 03 87 00 30
71 81 7 44 89 85 91 06 57 33 73
3 6
7
Total Proteins
Total protein is a measure of both albumin and globulin combined.
m
v tri k
m e ra
a m m mi m sk ur
n ea sd di ng se
r m ad n ax ew to
n a e
s ed sis
n
X 1 5 1. 0. 1. 1. 0. 0. 2. 1. - 1. 0.0
1 8 85 17 8 86 14 99 26 26 0. 97 07
3 39 96 8 70 94 32 17 85 96 98 43
67 10 7 98 53 51 63 11 89 56 87
9 0 8 29
7 9
Albumin
Albumin is a blood protein which adds structure to the vascular system keeping preventing blood
from seeping out through the vessel walls.
r
v m tri
m m m a ku
a ed m m ske
n ea sd i a n rto se
r ia m ad w
n n x g sis
s n ed
e
r
v m tri ku
m m a
a me ed m m sk rt
n sd i a n se
r an ia me ad ew osi
n x g
s n d s
e
Inference:
The goal of this analysis is to see how well a logistic regression can be tuned on these data to
predict the presence of liver disease. The ## Preparing Data for Analysis
-Disease, -Splits)
Training Summary
k
t
m u
r r
v m e s r
i m m m a
a e s d k t s
n m a i a n
r a d i e o e
m d n x g
s n a w s
e e
n i
d
s
A 1 3 4 1 4 4 1 4 9 8 - - 0
g 7 4 5 5 4 7 . 0 6 0 0 .
e . . . . . 0 . . . . 8
2 6 0 3 7 0 0 0 0 5 1
9 3 0 8 9 0 0 0 0 0 1
3 9 0 0 1 0 0 0 5 2 9
8 0 0 4 2 0 0 0 4 1 4
0 8 0 7 0 0 0 0 5 8 0
0 6 0 1 0 0 0 0 4 5 9
5 1 4 0 0 8 9
S 2 3 N N N N N I - - N N N
e 7 a A A a A n I I A A A
x N N f n n
* f f
T 3 3 0 1 0 0 0 - 4 5 1 1 0
o 7 . . . . . 0 . . . . .
t 4 0 0 2 5 . 3 0 4 2 0
_ 6 1 0 8 2 6 1 1 0 9 5
B 3 4 0 8 8 9 7 0 0 6 2
il 5 2 0 3 8 3 4 6 9 4 6
4 4 0 4 0 1 8 3 7 3 5
2 3 0 0 6 4 8 5 6 0 6
8 6 6 3 7 3 5 6 9
2
D 4 3 - 1 - - 1 - 2 5 0 - 0
i 7 0 . 1 0 . 2 . . . 0 .
r . 3 . . 0 . 9 2 8 . 0
_ 6 0 2 7 2 3 8 8 4 1 6
B 4 5 0 8 7 0 0 3 4 6 7
il 6 7 3 0 6 2 6 2 0 3 7
3 0 9 1 6 5 1 0 7 1 8
0 3 7 1 0 8 9 3 3 2 8
0 3 3 0 0 5 7 0 3 7
9 0 1 3
A 5 3 5 0 5 5 0 4 7 3 1 1 0
l 7 . . . . . . . . . . .
k 5 5 3 4 3 1 6 5 2 7 0
p 1 2 7 4 6 4 5 1 7 7 2
h 8 1 0 9 5 3 4 1 3 3 7
o 1 4 6 4 7 1 4 3 9 6 0
s 4 2 3 7 1 3 4 0 4 0 7
5 8 8 8 8 4 3 8 9 4 1
3 2 5 9 7 5 8 1 2
k
t
m u
r r
v m e s r
i m m m a
a e s d k t s
n m a i a n
r a d i e o e
m d n x g
s n a w s
e e
n i
d
s
A 6 3 3 0 3 3 0 2 7 5 1 2 0
l 7 . . . . . . . . . . .
a 7 9 6 6 7 3 3 0 3 2 0
m 9 0 1 8 1 0 9 9 1 1 4
i 4 4 0 7 6 2 6 3 1 2 6
n 0 4 9 5 7 5 3 7 6 2 9
e 2 9 1 8 2 8 3 5 2 0 5
9 1 8 7 8 5 5 0 6 9 8
1 0 9 4 1 2 6 8 8
A 7 3 4 1 3 3 0 2 8 6 1 1 0
s 7 . . . . . . . . . . .
p 0 0 7 8 8 4 5 0 1 4 0
a 0 1 6 8 6 8 0 1 8 9 5
r 9 2 1 6 4 4 2 7 4 2 2
t 6 3 2 5 5 9 8 9 1 8 5
a 6 0 0 6 7 0 9 8 6 2 5
t 6 1 0 8 2 6 1 4 1 2 6
e 9 4 1 7 6 8 8 7 1
T 8 3 1 0 1 1 0 1 2 0 - 0 0
o 7 . . . . . . . . 0 . .
t 8 1 8 8 1 2 2 9 . 4 0
_ 5 6 7 6 5 8 6 8 6 3 0
P 6 8 1 6 1 0 1 0 2 9 8
r 2 7 8 5 6 9 7 8 5 1 7
o 1 6 0 2 3 3 6 2 5 9 6
t 8 0 2 2 8 3 3 9 0 5 1
1 0 1 6 8 3 4 0 6
6
A 9 3 3 0 3 3 0 1 5 4 0 - 0
l 7 . . . . . . . . . 0 .
b 1 7 1 1 8 4 5 1 0 . 0
u 3 9 0 4 8 0 0 0 5 4 4
m 8 0 0 2 9 0 0 0 8 8 1
i 0 2 0 4 5 0 0 0 6 1 0
n 0 3 0 2 6 0 0 0 1 6 2
5 4 0 4 0 0 0 0 2 2 6
4 5 2 0 0 0 1 9 9
2
A 1 3 0 0 0 0 0 0 2 2 1 3 0
_ 0 7 . . . . . . . . . . .
G 9 3 9 9 2 3 8 5 0 4 0
_ 4 2 0 2 9 0 0 0 6 4 1
R 7 5 0 7 6 0 0 0 4 7 6
k
t
m u
r r
v m e s r
i m m m a
a e s d k t s
n m a i a n
r a d i e o e
m d n x g
s n a w s
e e
n i
d
s
a 3 9 0 5 5 0 0 0 6 8 9
ti 5 1 0 6 2 0 0 0 0 1 4
o 1 1 0 7 0 0 0 0 7 6 3
4 7 6 0 0 0 0 1 3
D 1 3 0 0 1 0 0 0 1 1 - - 0
i 1 7 . . . . . . . . 0 1 .
s 7 4 0 7 0 0 0 0 . . 0
e 1 4 0 7 0 0 0 0 9 0 2
a 9 9 0 4 0 0 0 0 7 5 3
s 6 7 0 4 0 0 0 0 4 3 3
e 7 6 0 1 0 0 0 0 2 7 5
6 3 0 0 0 0 0 0 2 1 0
5 8 8 0 0 0 0 3 6
0 9
S 1 3 N N N N N I - - N N N
p 2 7 a A A a A n I I A A A
li N N f n n
t f f
s
*
Coefficients:
Based on the above table using the MacFadden R2R2 as a guide, it appears that this model
explains approximately 22.5% of the disease classification.
Another estimate of utility is the Coefficient of Discrimination is test metric which subtracts
mean of all No-disease probabilities from the mean of all disease-probabilities based on the
predicted outcomes of the test data.
# add then add predicted class using 50% probability to round up and down
# accuracy is simply the mean of all trues (1) where prediction = reality
# #probabilities where the true outcome is disease and one where it is not
# taking an average of the two give a sense of the overall predictive power
Accuracy: 0.6889952
The accuracy is simply the number of times the model was right…which seems considerably
higher than the Model Fit Estimates would suggest, at around 69%.
pch = 18)
Discussion, comparison of methods
The two methods conditions
1. That each predictor Xi is linearly related to the logit( pipi )when all other predictors are held constant
Based on the following samples of x plotted against the predicted probabilities of the logistic
regression, it seems as though there may be some concerns with using these particular
regressors in a logistic model, even transformed.
require(ggplot2)
se = FALSE)
ggplot(train, aes(x = Dir_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
g
gplot(train, aes(x = Alkphos, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
ggplot(train, aes(x = Albumin, y = predictions)) + geom_point() + geom_smooth(method = "lm",
se = FALSE)
2. Each Y−iY−i is independent of other outcomes
no mention of relations between family members was included in the metadata for this set, so the
assumption is that all subjects are independent of each other
Conclusion:
Given the estimated predictive power of the Pseudo R-squared values at at around 22% as well
as the model condition of a linear relationship between the predictor and the logit, not being met,
I would not consider using this model as a diagnostic tool, or automating a model.
However, given the accuracy is 68% on the validation set, with further tuning of this model on
more validation data, it might be useful as a tool to suggest further definitive testing of liver
disease for patients who this model suggests might be positive.
It might be that this type of model, while not diagnostic, is useful in early detection, like a
mammogram, which also is indicative of breast cancer, but not diagnostic.
If such a model is to be useful, then a similar method would need to be applied to a more racially
diverse data set with more even gender distributions.
Improvements
Although there were variables which were not in and of themselves statistically significant
(meaning we cannot tell what their precise contribution to the diseased diagnosis is reliable)
removing them in any order did nothing but drop the accuracy, the Coefficient of Discrimination
and the Pseudo R-square scores, indicating that for shear ability to predict the likelihood of a
patient having liver disease based on the included variables, the whole model is more accurate
than an abridged logistic model.