Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 49

[Year]

[Document title]
[DOCUMENT SUBTITLE]
ASAD KHAN
Table of Content
Question 1...................................................................................................................................................8
Loading the Data in MySQL.......................................................................................................................8
Queries and their Outcome.........................................................................................................................9
Analysis.......................................................................................................................................................9
4. Visualization schematics.......................................................................................................................16
Justification..............................................................................................................................................19
Question 2.................................................................................................................................................20
OLAP Operations in R..............................................................................................................................20
Appendix...................................................................................................................................................31
Question 3.................................................................................................................................................34
Introduction:..............................................................................................................................................34
Question 1
Loading the Data in MySQL
 To load the data, I need to create a table first.

CREATE TABLE cars_info (


ID int NOT NULL PRIMARY KEY AUTO_INCREMENT,
mpg int(20),
cylinders int(20,
Displacement int(20),
horsepower varchar(35),
weight int(20),
acceleration int(20),
model int(20),
origin int(20),
car_name varchar(35) NOT NULL,
price int(20)
);

 And now I can load the data from my directory.

# Load CSV files


LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQLServer/Uploads/cars_info.csv'
INTO TABLE cars_info
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;

Queries and their Outcome


I will use the following R libraries to assist in my analysis:
library(magrittr)
library(plyr)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)

Analysis
The first step in the process of analyzing the datasets is loading them into R dataframes.

# Load CSV files
cars = read.csv('C:/ProgramData/MySQL/MySQLServer/Uploads/cars_info.csv', header=TRUE

R’s str function gives me a look at the data types in the “cars_info” dataset. The summary function lets
me see basic summary statistics for each column using R.
summary(cars_info)
## ID mpg cylinders displacement
## Min. : 1.0 Min. : 9.00 Min. :3.000 Min. : 68.0
## 1st Qu.:100.2 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2
## Median :199.5 Median :23.00 Median :4.000 Median :148.5
## Mean :199.5 Mean :23.51 Mean :5.455 Mean :193.4
## 3rd Qu.:298.8 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :398.0 Max. :46.60 Max. :8.000 Max. :455.0
##
## horsepower weight acceleration model
## 150 : 22 Min. :1613 Min. : 8.00 Min. :70.00
## 90 : 20 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00
## 88 : 19 Median :2804 Median :15.50 Median :76.00
## 110 : 18 Mean :2970 Mean :15.57 Mean :76.01
## 100 : 17 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00
## 75 : 14 Max. :5140 Max. :24.80 Max. :82.00
## (Other):288
## origin car_name price
## Min. :1.000 ford pinto : 6 Min. : 1598
## 1st Qu.:1.000 amc matador : 5 1st Qu.:23110
## Median :1.000 ford maverick : 5 Median :30000
## Mean :1.573 toyota corolla: 5 Mean :29684
## 3rd Qu.:2.000 amc gremlin : 4 3rd Qu.:36430
## Max. :3.000 amc hornet : 4 Max. :53746
## (Other) :369

I see several issues with how the read.csv function imported the data that need to be cleaned up before
going in-depth with the analysis. I will fix those in the code below:

# Cylinders came in as an integer, when it should be a multi-valued discrete, 
# otherwise known as a "factor" in R. 
cars$cylinders = cars$cylinders %>%
  factor(labels = sort(unique(cars$cylinders)))

# Horsepower was imported as a factor, but it should be a continuous numerical 
# variable.
cars$horsepower = as.numeric(levels(cars$horsepower))[cars$horsepower]

# I will change he model (year) column from an integer to a categorical factor.
model_years = sort(unique(cars$model))
cars$model = cars$model %>%
  factor(labels = model_years)

2.1
#Displaying the First 10 Rows 
head(cars, 10)

Results
So the Table is as
# A tibble: 10 x 11

ID mpg cylinders displacement horsepower weight acceleration model origin car_na price

<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>

1 1 18 8 307 130 3504 12 70 1 chevrolet chevelle malibu 25562.

2 2 15 8 350 165 3693 11.5 70 1 buick skylark 320 24221.

3 3 18 8 318 150 3436 11 70 1 plymouth satellite 27241.

4 4 16 8 304 150 3433 12 70 1 amc rebel sst 33685.

5 5 17 8 302 140 3449 10.5 70 1 ford torino 20000

6 6 15 8 429 198 4341 10 70 1 ford galaxie 500 30000

7 7 14 8 454 220 4354 9 70 1 chevrolet impala 35764.

8 8 14 8 440 215 4312 8.5 70 1 plymouth fury iii 25899.

9 9 14 8 455 225 4425 10 70 1 pontiac catalina 32883.

10 10 15 8 390 190 3850 8.5 70 1 amc ambassador dpl . 32617

2.2
Eight-cylinder cars with miles per gallon greater than 18

To show the cars with miles per gallon greater than 18 I am going to use the following R script code to
show the results.

cars_info[cars_info$mpg > 18 , c('ID', 'mpg', 'car_name')]

Results

So, the results are following;


ID mpg car_name
166 20 chevrolet monza 2+2
oldsmobile cutlass salon
250 19.9
brougham
251 19.4 dodge diplomat
252 20.2 mercury monarch ghia
263 19.2 chevrolet monte carlo landau
265 18.1 ford futura
289 18.2 dodge st. regis
292 19.2 chevrolet malibu classic (sw)
chrysler lebaron town @
293 18.5
country (sw)
299 23 cadillac eldorado
301 23.9 oldsmobile cutlass salon
brougham
365 26.6 oldsmobile cutlass ls

2.3
The average horsepower and mpg by cylinder group

First, we need to know the cylinders groups,

table(cars_info$cylinders)

Results

##
## 3 4 5 6 8
## 4 204 3 84 103

Now mpg average is

mean(cars_info$mpg)

Now the results are,

mean,
## 23.51457286

Now horsepower average is

mean(cars_info$horsepower)

Results

Now the results are,

mean,
## 79.52173913

2.4
All cars with less than eight-cylinder and with acceleration from 11 to 13
(inclusive both limits)
To show All cars with less than eight-cylinder and with acceleration from 11 to 13 (inclusive both limits)
I am going to use the following R script code to show the results.
cars_info[cars_info$cylinders < 8 , cars_info$acceleration == 10 to 13 , c('ID', 'cylinders',
'acceleration', 'car_name')]

Results

So, the results are following;


ID cylinders acceleration car_name
24 4 12.5 bmw 2002
34 6 13 amc gremlin
204 4 12.2 volkswagen rabbit
243 4 12.8 bmw 320i
307 6 11.3 chevrolet citation
oldsmobile omega
308 6 12.9
brougham
334 6 11.4 datsun 280-zx
335 3 12.5 mazda rx-7 gs
342 6 12.6 chevrolet citation
343 4 12.9 plymouth reliant
362 6 12.6 toyota cressida
392 4 13 dodge charger 2.2
396 4 11.6 dodge rampage

2.5
The car names and horsepower of the cars with 3 cylinders

To show The car names and horsepower of the cars with 3 cylinders I am going to use the following R
script code to show the results.

cars_info[cars_info$cylinders == 3 , c('ID', 'cylinders', 'horsepower', 'car_name')]

So, the results are following;


cylinde horsepow car_na
ID rs er me
72 3 97 mazda rx2 coupe
244 3 110 mazda rx-4
335 3 100 mazda rx-7 gs
3
Eight-cylinder cars with miles per gallon greater than 18

To show the cars with miles per gallon greater than 18 I am going to use the following R script code to
show the results.

features <- cars_info %>% names() %>% keep(~ str_detect(.,"[.]"))
cars_info %>% filter_at(cylinders==8, mpg>18)

Results

So, the results are following;


ID mpg car_name
166 20 chevrolet monza 2+2
oldsmobile cutlass salon
250 19.9
brougham
251 19.4 dodge diplomat
252 20.2 mercury monarch ghia
263 19.2 chevrolet monte carlo landau
265 18.1 ford futura
289 18.2 dodge st. regis
292 19.2 chevrolet malibu classic (sw)
chrysler lebaron town @
293 18.5
country (sw)
299 23 cadillac eldorado
oldsmobile cutlass salon
301 23.9
brougham
365 26.6 oldsmobile cutlass ls

All cars with less than eight-cylinder and with acceleration from 11 to 13
(inclusive both limits)

To show All cars with less than eight-cylinder and with acceleration from 11 to 13 (inclusive both limits)
I am going to use the following R script code to show the results.

features <- cars_info %>% names() %>% keep(~ str_detect(.,"[.]"))
cars_info %>% filter_at(cylinders<8, acceleration=(11 | 13))

Results

So, the results are following;


cylinde displacem horsepo weigh accelerati car_nam
ID mpg model origin
rs ent wer t on e
bmw
24 26 4 121 113 2234 12.5 70 2
2002
amc
34 19 6 232 100 2634 13 71 1
gremlin
volkswag
204 29.5 4 97 71 1825 12.2 76 2
en rabbit
bmw
243 21.5 4 121 110 2600 12.8 77 2
320i
chevrolet
307 28.8 6 173 115 2595 11.3 79 1
citation
oldsmobi
le omega
308 26.8 6 173 115 2700 12.9 79 1
brougha
m
datsun
334 32.7 6 168 132 2910 11.4 80 3
280-zx
mazda
335 23.7 3 70 100 2420 12.5 80 3
rx-7 gs
chevrolet
342 23.5 6 173 110 2725 12.6 81 1
citation
plymouth
343 30 4 135 84 2385 12.9 81 1
reliant
toyota
362 25.4 6 168 116 2900 12.6 81 3
cressida
dodge
392 36 4 135 84 2370 13 82 1 charger
2.2
dodge
396 32 4 135 84 2295 11.6 82 1
rampage

The car names and horsepower of the cars with 3 cylinders

To show the car names and horsepower of the cars with 3 cylinders I am going to use the following R
script code to show the results.

data[ , c("1", "3", "5", "10")]
features <- cars_info %>% names() horsepower() %>% keep(~ str_detect(.,"[.]"))
cars_info %>% filter_at(cylinders==3)

Results

So, the results are following;


cylinde horsepow car_na
ID rs er me
72 3 97 mazda rx2 coupe
244 3 110 mazda rx-4
335 3 100 mazda rx-7 gs

4. Visualization schematics
In this section I will take a look at the distribution of values for each variable in the dataset by creating
histograms using ggplot2’s qplot function. I am trying to find out if there is more data to clean up,
including outliers or extraneous values. This also might help me begin to identify any relationships
between variables that are worth investigating further.

4.1
The distribution of values for Mpg and cylinders by creating two relevant
histograms.

# Miles Per Gallon
qplot(cars$mpg, xlab = 'Miles Per Gallon', ylab = 'Count', binwidth = 2, 
      main='Frequency Histogram: Miles per Gallon')

qplot(cars$cylinders, xlab = 'Cylinders', ylab = 'Count', 
      main='Frequency Histogram: Number of Cylinders')
4.2
Boxplot to show the mean and distribution of Mpg measurements for each year
in the sample

The next plot uses boxplots to show the mean and distribution of MPG measurements for each year in the
sample.
ggplot(data = cars, aes(x = model, y = mpg)) +
  geom_boxplot() +
  xlab('Model Year') +
  ylab('MPG') +
  ggtitle('MPG Comparison by Model Year')

The trend over time shows a meaningful increase in MPG from the lows to the highs.
4.3
Scatter plot showing the relationship between weight and Mpg

I will use the same boxplot method to show how the distribution of MPG values compares across the
region of origin.
ggplot2 charting techniques to visualize how one variable affects another. I am starting with how weight
affects MPG by doing a scatter plot overlaid with a linear best-fit line.
ggplot(data = cars, aes(x = weight, y = mpg)) +
  geom_point() +
  geom_smooth(method='lm') +
  xlab('MPG') +
  ylab('Weight') +
  ggtitle('MPG vs. Weight: Entire Sample')

Justification
The data clearly shows that weight and MPG are inversely related: as weight increases, MPG decreases.
The R-squared of the linear best fit line, as shown below, is over 70%. This means that variations in a
car’s weight explain over 70% of the changes to its MPG.
Question 2
OLAP Operations in R
Generation of Sales Functions
Here we first create a sales fact table that records each sales transaction.

# Generating state tabel
state_table <- 
  data.frame(key=c("FF", "LA", "SY", "SS", "CT"),
             name=c("Frankfrut", "Los Angeles", "Sydney", "Seoul-s", "Cape-Town"),
             country=c("Germany", "USA", "Australia", "Korea", "South Africa"))

#Generating Month table
month_table <- 
  data.frame(key=1:12,
             desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
             quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))
#Generating Products Table
prod_table <- 
  data.frame(key=c("Washing Machine", "Fridge", "Vacuum cleaner","Microwave Oven"),
             price=c(200, 500, 400, 150))

#Generation of Sales Function
gen_sales <- function(no_of_recs) {
  
  # Generate transaction data randomly
  loc <- sample(state_table$key, no_of_recs, 
                replace=T, prob=c(2,2,1,1,1))
  time_month <- sample(month_table$key, no_of_recs, replace=T)
  time_year <- sample(c(2015,2016,2017,2018,2019,2020), no_of_recs, replace=T)

  prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
  unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
  amount <- unit*prod_table[prod,]$price
  
  sales <- data.frame(month=time_month,
                      year=time_year,
                      loc=loc,
                      prod=prod,
                      unit=unit,
                      amount=amount)
  
  # Sort the records by time order
  sales <- sales[order(sales$year, sales$month),]
  row.names(sales) <- NULL
  return(sales)
}

#Generating 500 Rows of Sales_Fact and Showing First 10
sales_fact <- gen_sales(500)
head(sales_fact)

Revenue Cube Creation


Now, we turn this fact table into a hypercube with multiple dimensions. Each cell within the cube
represents an aggregate value for a singular combination of every dimension.

#Creating Revenue Cube
revenue_cube <- 
  tapply(sales_fact$amount, 
         sales_fact[,c("prod", "month", "year", "loc")], 
         FUN=function(x){return(sum(x))})

Now we are going to print out the information of the sales for every year.

#Now recalling the Function and Showing all the data randomly
revenue_cube
Results
OLAP Operations
Here are some common operations of OLAP

 Slice
 Dice
 Rollup
 Drilldown
 Pivot

"Slice" is about fixing certain dimensions to analyze the remaining dimensions.  For example, we can
focus in the sales happening in "2015", "Jan", or we can focus in the sales happening in "2015", "Jan",
"Fridge".
# Slice
# cube data in Jan, 2015
revenue_cube[, "1", "2015",]

# cube data in Jan, 2015
revenue_cube["Fridge", "1", "2015",]

 "Dice" is about limited each dimension to a certain range of values, while keeping the number of
dimensions the same in the resulting cube.  For example, we can focus in sales happening in [Jan/
Feb/Mar, Fridge/Microwave Oven, LA].

#Dice Cube Data for First 3 month of each year
revenue_cube[c("Fridge","Microwave Oven"), 
             c("1","2","3"), 
             ,
             c("LA")]
"Rollup" is about applying an aggregation function to collapse a number of dimensions.  For example,
we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't
care where we sold our product). 

#Roll Up
apply(revenue_cube, c("year", "prod"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})

"Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of
granularity.  For example, we want to focus in the annual and monthly revenue for each product and
collapse the location dimension (ie: we don't care where we sold our product).

#Drilldown
apply(revenue_cube, c("year", "month", "prod"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))})

Appendix
Code#
# Generating state tabel
state_table <- 
  data.frame(key=c("FF", "LA", "SY", "SS", "CT"),
             name=c("Frankfrut", "Los Angeles", "Sydney", "Seoul-s", "Cape-Town"),
             country=c("Germany", "USA", "Australia", "Korea", "South Africa"))
#Generating Month table
month_table <- 
  data.frame(key=1:12,
             desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
             quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4"))
#Generating Products Table
prod_table <- 
  data.frame(key=c("Washing Machine", "Fridge", "Vacuum cleaner","Microwave Oven"),
             price=c(200, 500, 400, 150))
#Generation of Sales Function
gen_sales <- function(no_of_recs) {
  
  # Generate transaction data randomly
  loc <- sample(state_table$key, no_of_recs, 
                replace=T, prob=c(2,2,1,1,1))
  time_month <- sample(month_table$key, no_of_recs, replace=T)
   time_year <- sample(c(2015,2016,2017,2018,2019,2020), no_of_recs, replace=T)
  prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
  unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
  amount <- unit*prod_table[prod,]$price
  
  sales <- data.frame(month=time_month,
                      year=time_year,
                      loc=loc,
                      prod=prod,
                      unit=unit,
                      amount=amount)
  
  # Sort the records by time order
  sales <- sales[order(sales$year, sales$month),]
  row.names(sales) <- NULL
  return(sales)
}
#Generating 500 Rows of Sales_Fact and Showing First 10
sales_fact <- gen_sales(500)
head(sales_fact)
#Creating Revenue Cube
revenue_cube <- 
  tapply(sales_fact$amount, 
         sales_fact[,c("prod", "month", "year", "loc")], 
         FUN=function(x){return(sum(x))})
#Now recalling the Function and Showing all the data randomly
revenue_cube
# Slice
# cube data in Jan, 2015
revenue_cube[, "1", "2015",]
# cube data in Jan, 2015
revenue_cube["Fridge", "1", "2015",]
#Dice Cube Data for First 3 month of each year
revenue_cube[c("Fridge","Microwave Oven"), 
             c("1","2","3"), 
             ,
             c("LA")]
#Roll Up
apply(revenue_cube, c("year", "prod"),
      FUN=function(x) {return(sum(x, na.rm=TRUE))})
#Drilldown
apply(revenue_cube, c("year", "month", "prod"), 
      FUN=function(x) {return(sum(x, na.rm=TRUE))}
Question 3
Introduction:
This analysis examines data from liver patients that focus on the relationship between the main
list of liver enzymes, proteins, age and sex that they use to try to predict liver disease similarity.

Typically, models like these found in very large data sets, will eventually evolve to provide
better health outcomes with less invasive methods by predicting possible pre-symptom illness
rather than waiting for external symptoms.

In the case of this study, the chances of detecting early symptoms of liver disease in terms of
population and protein levels can reduce recovery time and increase the length and quality of life
of high-risk people.

Data:
Importing data in MySQL

 To load the data, I need to create a table first.

CREATE TABLE liver (


ID int NOT NULL PRIMARY KEY AUTO_INCREMENT,
age int(20),
gender varchar(20),
tot_bilirubin double(20),
direct_bilirubin double(35),
tot_proteins int(20),
albumin int(20),
ag_ratio int(20),
sgpt double(20),
sopt double(35) NOT NULL,
alkphos double(20),
is_patient int(1)
);

 And now I can load the data from my directory.

# Load CSV files


LOAD DATA INFILE 'C:/ProgramData/MySQL/MySQLServer/Uploads/liver.csv'
INTO TABLE liver
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS;

Importing the data in R

liver_data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00225/Indian%20Liver


%20Patient%20Dataset%20(ILPD).csv",

header = FALSE)

colnames(liver_data) <- c("Age", "Sex", "Tot_Bil", "Dir_Bil", "Alkphos", "Alamine",

"Aspartate", "Tot_Prot", "Albumin", "A_G_Ratio", "Disease")

liver_data$Sex <- (ifelse(liver_data$Sex == "Male", "M", "F")) #made shorter

liver_data$Disease <- as.numeric(ifelse(liver_data$Disease == 2, 0, 1)) #converted to zeros and ones

Ag Se Tot_Bil Dir_Bil Alkpho Alamine Aspartate Tot_Prot Albumin A_G_Ratio Disease


e x s
65 F 0.7 0.1 187 16 18 6.8 3.3 0.90 1
62 M 10.9 5.5 699 64 100 7.5 3.2 0.74 1
62 M 7.3 4.1 490 60 68 7.0 3.3 0.89 1
58 M 1.0 0.4 182 14 20 6.8 3.4 1.00 1
72 M 3.9 2.0 195 27 59 7.3 2.4 0.40 1
46 M 1.8 0.7 208 19 14 7.6 4.4 1.30 1

About the
There are 583 observations, 416 represent subjects with diseased livers, 167 represent subjects
without diseased livers.

The data represent 441 male subjects (of whom 324 have liver disease) and 142 female subjects
(of whom 92 have liver disease).

Data Dictionary
1. Age = Age of the patient (all subjects greater than 89 are labelled 90)
2. Sex = Gender of the patient Female Male
3. Tot_Bil = Total Bilirubin
4. Dir_Bil = Direct Bilirubin
5. Alk_Phos = Alkaline Phosphotase
6. Alamine = Alamine Aminotransferase
7. Aspartate = Aspartate Aminotransferase
8. Tot_Prot = Total Protiens
9. Albumin = Albumin 10.A_G_Ration = Albumin and Globulin Ratio 11.Disease = Disease State
(classified labeled by the medical experts ) 0 - not diseased 1- diseased

Exploratory Data Analysis:


The graphs below were created to test the distribution of data points within each solid foundation
for future analysis. In the first graphs, six variables are left over or right, so they were converted
using a natural log before inserting the graph and analysis.

This has helped to stabilize the data better, but nevertheless it is not completely normal, so it is
important to have realistic expectations with the successive power of guessing each variation
within the whole genre.

Age
va medi trimm mi ma ran kurtosi
rs n mean sd an ed mad n x ge skew s se

X 1 58 44.746 16.189 45 44.843 17.79 4 90 86 - - 0.67051


1 3 14 83 68 12 0.02923 0.57389 44
43 21
Gender
Se
x Disease Frequency

F 0 50

F 1 92

M 0 117

M 1 324

Total Bilirubin
Bilirubin is a byproduct of hemolytic catabolism, it is one of many substances the liver filters
from the body. Heighten presence of either or both can be indicative of liver disease and is the
cause of skin yellowing associated with jaundice. (medscape)

m
v tri ku
m e ra sk
a m m mi m rt
n ea sd di ng e se
r m ad n ax osi
n a e w
s ed s
n

X 1 5 0. 1. 0 0. 0. - 4. 5. 1. 0. 0.0
1 8 46 01 29 52 0. 31 23 31 88 42
3 34 85 38 88 91 74 37 17 98 18
20 27 58 06 62 88 79 18 14 31
9 3 90 5
7
Direct Bilirubin
v m tri ku
m m ra
a ed m m m sk rt
n ea sd a ng se
r ia m in ax ew osi
n d e
s n ed s

X 1 5 - 1. - - 1. - 2. 5. 0. - 0.
1 8 0. 32 1. 0. 0 2. 98 28 82 0. 05
3 65 63 20 78 2 30 06 32 69 29 49
03 94 39 69 7 25 19 04 18 90 33
73 73 59 6 85 1 59 6
3 1 6 9
Alkaline Phosphotase
This is one of the enzymes included in a normal liver panel, frequently used to estimate overall
liver health.

v n m sd m tri m m m ra sk k se
a ea ed m ad in ax ng e ur
r n ia m e w to
s n ed sis

X 1 5 5. 0. 5. 5. 0. 4. 7. 3. 1. 2. 0.0
1 8 49 52 33 42 37 14 65 51 31 23 21
3 34 81 75 72 57 31 44 13 79 04 87
17 27 38 5 63 35 43 09 59 17 28
8 3
Alamine Aminotransferase
Alamine Aminotransferase is a natural art of the liver ecosystem tested for in a liver panel.

v m tri k
m ra sk
a ed m m m m ur
n ea sd ng e se
r ia m ad in ax to
n e w
s n ed sis

X 1 5 3. 0. 3. 3. 0.6 2. 7. 5. 1. 2. 0.
1 8 75 90 55 63 88 30 60 29 41 57 03
3 18 02 53 87 37 25 09 83 83 80 72
29 35 48 97 95 85 03 17 03 34 84
8
Aspartate Aminotransferase
Aspartate Aminotransferase is also a natural part of the liver ecosystem tested for in a liver
panel.

m
v tri k
m e ra sk
a m m m m ur
n ea sd di ng e se
r m ad in ax to
n a e w
s ed sis
n

X 1 5 3. 0. 3. 3. 0.8 2. 8. 6. 1. 1. 0.0
1 8 95 99 7 83 59 30 50 20 18 56 41
3 67 73 3 75 63 25 28 03 87 00 30
71 81 7 44 89 85 91 06 57 33 73
3 6
7
Total Proteins
Total protein is a measure of both albumin and globulin combined.

m
v tri k
m e ra
a m m mi m sk ur
n ea sd di ng se
r m ad n ax ew to
n a e
s ed sis
n

X 1 5 1. 0. 1. 1. 0. 0. 2. 1. - 1. 0.0
1 8 85 17 8 86 14 99 26 26 0. 97 07
3 39 96 8 70 94 32 17 85 96 98 43
67 10 7 98 53 51 63 11 89 56 87
9 0 8 29
7 9
Albumin
Albumin is a blood protein which adds structure to the vascular system keeping preventing blood
from seeping out through the vessel walls.

r
v m tri
m m m a ku
a ed m m ske
n ea sd i a n rto se
r ia m ad w
n n x g sis
s n ed
e

X 1 5 3.1 0.7 3. 3.1 0. 0 5 4. - - 0.0


1 8 41 95 1 49 88 . . 6 0.0 0.4 32
3 85 51 25 95 9 5 43 03 94
3 88 6 46 78 7
02 93
Albumin-Globulin Ratio
The Albumin-Globulin ratio is considered an index of systemic diseases in general.

r
v m tri ku
m m a
a me ed m m sk rt
n sd i a n se
r an ia me ad ew osi
n x g
s n d s
e

X 1 5 0.9 0.3 0. 0.9 0. 0 2 2. 0.9 3. 0.0


1 7 47 19 93 29 25 . . 5 87 22 13
9 06 59 76 20 3 8 16 17 28
39 21 34 42 39 35 18
The above graphs suggest that the enzymes and proteins in question vary relative to the diseased
state which we may be able to understand more completely after the analysis.

Inference:
The goal of this analysis is to see how well a logistic regression can be tuned on these data to
predict the presence of liver disease. The ## Preparing Data for Analysis

set.seed(455) # for reproducibility

liver_data$Splits <- sample.split(liver_data, SplitRatio = 0.7) #set indexes

liver_data <- liver_data %>% mutate_each(funs(log), -Age, -Sex, -Albumin, -A_G_Ratio,

-Disease, -Splits)

train <- liver_data[liver_data$Splits == TRUE, ] #extract training using indexes

test <- liver_data[liver_data$Splits == FALSE, ] #extract test using indexes

Training Summary
k
t
m u
r r
v m e s r
i m m m a
a e s d k t s
n m a i a n
r a d i e o e
m d n x g
s n a w s
e e
n i
d
s

A 1 3 4 1 4 4 1 4 9 8 - - 0
g 7 4 5 5 4 7 . 0 6 0 0 .
e . . . . . 0 . . . . 8
2 6 0 3 7 0 0 0 0 5 1
9 3 0 8 9 0 0 0 0 0 1
3 9 0 0 1 0 0 0 5 2 9
8 0 0 4 2 0 0 0 4 1 4
0 8 0 7 0 0 0 0 5 8 0
0 6 0 1 0 0 0 0 4 5 9
5 1 4 0 0 8 9

S 2 3 N N N N N I - - N N N
e 7 a A A a A n I I A A A
x N N f n n
* f f

T 3 3 0 1 0 0 0 - 4 5 1 1 0
o 7 . . . . . 0 . . . . .
t 4 0 0 2 5 . 3 0 4 2 0
_ 6 1 0 8 2 6 1 1 0 9 5
B 3 4 0 8 8 9 7 0 0 6 2
il 5 2 0 3 8 3 4 6 9 4 6
4 4 0 4 0 1 8 3 7 3 5
2 3 0 0 6 4 8 5 6 0 6
8 6 6 3 7 3 5 6 9
2

D 4 3 - 1 - - 1 - 2 5 0 - 0
i 7 0 . 1 0 . 2 . . . 0 .
r . 3 . . 0 . 9 2 8 . 0
_ 6 0 2 7 2 3 8 8 4 1 6
B 4 5 0 8 7 0 0 3 4 6 7
il 6 7 3 0 6 2 6 2 0 3 7
3 0 9 1 6 5 1 0 7 1 8
0 3 7 1 0 8 9 3 3 2 8
0 3 3 0 0 5 7 0 3 7
9 0 1 3

A 5 3 5 0 5 5 0 4 7 3 1 1 0
l 7 . . . . . . . . . . .
k 5 5 3 4 3 1 6 5 2 7 0
p 1 2 7 4 6 4 5 1 7 7 2
h 8 1 0 9 5 3 4 1 3 3 7
o 1 4 6 4 7 1 4 3 9 6 0
s 4 2 3 7 1 3 4 0 4 0 7
5 8 8 8 8 4 3 8 9 4 1
3 2 5 9 7 5 8 1 2
k
t
m u
r r
v m e s r
i m m m a
a e s d k t s
n m a i a n
r a d i e o e
m d n x g
s n a w s
e e
n i
d
s

A 6 3 3 0 3 3 0 2 7 5 1 2 0
l 7 . . . . . . . . . . .
a 7 9 6 6 7 3 3 0 3 2 0
m 9 0 1 8 1 0 9 9 1 1 4
i 4 4 0 7 6 2 6 3 1 2 6
n 0 4 9 5 7 5 3 7 6 2 9
e 2 9 1 8 2 8 3 5 2 0 5
9 1 8 7 8 5 5 0 6 9 8
1 0 9 4 1 2 6 8 8

A 7 3 4 1 3 3 0 2 8 6 1 1 0
s 7 . . . . . . . . . . .
p 0 0 7 8 8 4 5 0 1 4 0
a 0 1 6 8 6 8 0 1 8 9 5
r 9 2 1 6 4 4 2 7 4 2 2
t 6 3 2 5 5 9 8 9 1 8 5
a 6 0 0 6 7 0 9 8 6 2 5
t 6 1 0 8 2 6 1 4 1 2 6
e 9 4 1 7 6 8 8 7 1

T 8 3 1 0 1 1 0 1 2 0 - 0 0
o 7 . . . . . . . . 0 . .
t 8 1 8 8 1 2 2 9 . 4 0
_ 5 6 7 6 5 8 6 8 6 3 0
P 6 8 1 6 1 0 1 0 2 9 8
r 2 7 8 5 6 9 7 8 5 1 7
o 1 6 0 2 3 3 6 2 5 9 6
t 8 0 2 2 8 3 3 9 0 5 1
1 0 1 6 8 3 4 0 6
6

A 9 3 3 0 3 3 0 1 5 4 0 - 0
l 7 . . . . . . . . . 0 .
b 1 7 1 1 8 4 5 1 0 . 0
u 3 9 0 4 8 0 0 0 5 4 4
m 8 0 0 2 9 0 0 0 8 8 1
i 0 2 0 4 5 0 0 0 6 1 0
n 0 3 0 2 6 0 0 0 1 6 2
5 4 0 4 0 0 0 0 2 2 6
4 5 2 0 0 0 1 9 9
2

A 1 3 0 0 0 0 0 0 2 2 1 3 0
_ 0 7 . . . . . . . . . . .
G 9 3 9 9 2 3 8 5 0 4 0
_ 4 2 0 2 9 0 0 0 6 4 1
R 7 5 0 7 6 0 0 0 4 7 6
k
t
m u
r r
v m e s r
i m m m a
a e s d k t s
n m a i a n
r a d i e o e
m d n x g
s n a w s
e e
n i
d
s

a 3 9 0 5 5 0 0 0 6 8 9
ti 5 1 0 6 2 0 0 0 0 1 4
o 1 1 0 7 0 0 0 0 7 6 3
4 7 6 0 0 0 0 1 3

D 1 3 0 0 1 0 0 0 1 1 - - 0
i 1 7 . . . . . . . . 0 1 .
s 7 4 0 7 0 0 0 0 . . 0
e 1 4 0 7 0 0 0 0 9 0 2
a 9 9 0 4 0 0 0 0 7 5 3
s 6 7 0 4 0 0 0 0 4 3 3
e 7 6 0 1 0 0 0 0 2 7 5
6 3 0 0 0 0 0 0 2 1 0
5 8 8 0 0 0 0 3 6
0 9

S 1 3 N N N N N I - - N N N
p 2 7 a A A a A n I I A A A
li N N f n n
t f f
s
*

C5.0 and CART Model


fit <- glm(Disease ~ Age + Sex + Tot_Bil + Dir_Bil + Alkphos + Alamine + Aspartate +

Tot_Prot + Albumin + A_G_Ratio, data = train, family = binomial(link = "logit"))

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -15.8559002 4.1200217 - 0.0001188


3.8484992

Age 0.0217058 0.0087573 2.4786079 0.0131896

SexM -0.4007260 0.3266459 - 0.2199014


1.2267904
Estimate Std. Error z value Pr(>|z|)

Tot_Bil 0.4276794 0.7432995 0.5753797 0.5650346

Dir_Bil 0.1718378 0.4851157 0.3542202 0.7231739

Alkphos 0.9463127 0.3856115 2.4540572 0.0141254

Alamine 0.9373542 0.3219296 2.9116743 0.0035950

Aspartate 0.1980968 0.2889246 0.6856348 0.4929434

Tot_Prot 5.3262285 2.3915094 2.2271409 0.0259379

Albumin -1.4553451 0.7649599 - 0.0571043


1.9025117

A_G_Ratio 1.8906917 1.2224601 1.5466286 0.1219528

Model Pseudo R-Square and Log-Likelihoods (from pscl package)

llh llhNull G2 McFadden r2ML r2CU

- -220.0989 99.0489 0.2250101 0.2348626 0.3375943


170.5744 4

Based on the above table using the MacFadden R2R2 as a guide, it appears that this model
explains approximately 22.5% of the disease classification.
Another estimate of utility is the Coefficient of Discrimination is test metric which subtracts
mean of all No-disease probabilities from the mean of all disease-probabilities based on the
predicted outcomes of the test data.

# Start by making a data frame of predictions

Test_Predictions <- data.frame(Probability = predict(fit, test, type = "response"))

# add then add predicted class using 50% probability to round up and down

Test_Predictions$Prediction <- ifelse(Test_Predictions > 0.5, 1, 0)

# added the actual diagnosed disease/ non classification to the set

Test_Predictions$Disease <- test$Disease

# accuracy is simply the mean of all trues (1) where prediction = reality

accuracy <- mean(Test_Predictions$Disease == Test_Predictions$Prediction, na.rm = TRUE)


# For the Coefficient of Discrimination create arrays of predicted

# #probabilities where the true outcome is disease and one where it is not

# taking an average of the two give a sense of the overall predictive power

disease <- Test_Predictions$Probability[which(Test_Predictions$Disease == 1)]

non <- Test_Predictions$Probability[which(Test_Predictions$Disease == 0)]

Coef_Desc <- mean(disease, na.rm = TRUE) - mean(non, na.rm = TRUE)

Coefficient of Discrimination : 0.1420897

Accuracy: 0.6889952

The accuracy is simply the number of times the model was right…which seems considerably
higher than the Model Fit Estimates would suggest, at around 69%.

plot(Test_Predictions$Probability, Test_Predictions$Disease, xlim = c(0.1, 1.1),

xlab = "Probability", ylim = c(-0.1, 1.1), ylab = "Disease", col = "blue",

pch = 18)
Discussion, comparison of methods
The two methods conditions

1. That each predictor Xi is linearly related to the logit( pipi )when all other predictors are held constant

Based on the following samples of x plotted against the predicted probabilities of the logistic
regression, it seems as though there may be some concerns with using these particular
regressors in a logistic model, even transformed.

train$predictions <- predict(fit, train, type = "response")

require(ggplot2)

ggplot(train, aes(x = Tot_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm",

se = FALSE)
ggplot(train, aes(x = Dir_Bil, y = predictions)) + geom_point() + geom_smooth(method = "lm",

se = FALSE)
g
gplot(train, aes(x = Alkphos, y = predictions)) + geom_point() + geom_smooth(method = "lm",

se = FALSE)
ggplot(train, aes(x = Albumin, y = predictions)) + geom_point() + geom_smooth(method = "lm",

se = FALSE)
2. Each Y−iY−i is independent of other outcomes

no mention of relations between family members was included in the metadata for this set, so the
assumption is that all subjects are independent of each other

Conclusion:
Given the estimated predictive power of the Pseudo R-squared values at at around 22% as well
as the model condition of a linear relationship between the predictor and the logit, not being met,
I would not consider using this model as a diagnostic tool, or automating a model.

However, given the accuracy is 68% on the validation set, with further tuning of this model on
more validation data, it might be useful as a tool to suggest further definitive testing of liver
disease for patients who this model suggests might be positive.

It might be that this type of model, while not diagnostic, is useful in early detection, like a
mammogram, which also is indicative of breast cancer, but not diagnostic.

If such a model is to be useful, then a similar method would need to be applied to a more racially
diverse data set with more even gender distributions.

Improvements
Although there were variables which were not in and of themselves statistically significant
(meaning we cannot tell what their precise contribution to the diseased diagnosis is reliable)
removing them in any order did nothing but drop the accuracy, the Coefficient of Discrimination
and the Pseudo R-square scores, indicating that for shear ability to predict the likelihood of a
patient having liver disease based on the included variables, the whole model is more accurate
than an abridged logistic model.

You might also like