Airline Data Analysis
Airline Data Analysis
Airline
Data
In
this
project,
I
will
be
analyzing
over
6gb
of
Airline
data,
and
answering
some
questions
that
I
think
would
be
important
when
looking
at
data
similar
to
this.
data=load(url("http://eeyore.ucdavis.edu/stat141/Data/winterDelays.rda"
))
Number
1
How many flights are there in the data set?
>nrow(winterDelays)
1961489
So
there
are
1,961,489
flights
in
the
data
set
because
that
is
the
number
of
rows
in
the
set.
Number
2
Which airline has the most flights?
sort(table(winterDelays$UNIQUE_CARRIER))
VX
HA
F9
YV
9E
AS
FL
B6
US
MQ
UA
AA
OO
DL
EV
WN
17216
23468
23699
41305
44967
46737
61988
75816
130453
145487
161857
172342
201171
227539
232481
354963
So
WN
(Southwest
Airlines)
has
the
most
flights
(354,963)
in
this
data
set.
Number
3
Compute the number of flights for each originating airport and airline
carrier. Show only the rows and columns for the 20 airports with the
largest number of flights, and the 10 airline carriers with the most
flights.
>tab
=
table(winterDelays$ORIGIN,
winterDelays$UNIQUE_CARRIER)
>m
=
margin.table(tab,1)
>ord
=
order(m,
decreasing
=
TRUE)[1:20]
>m2
=
margin.table(tab,2)
>ord2
=
order(m2,decreasing
=
TRUE)[1:10]
>tab[ord,ord2]
WN
EV
DL
OO
AA
UA
MQ
US
B6
FL
ATL
3369
29393
63957
558
1656
260
1992
1734
0
17897
ORD
0
17570
1852
8740
15653
18113
25434
2319
479
0
This
table
shows
the
top
20
airports
with
the
largest
number
of
flights,
and
the
10
airline
carriers
with
the
most
flights.
Number
4
Is the mean delay in November different from the mean delay in
December?
>mean(winterDelays[winterDelays$MONTH
==
11,'ARR_DELAY'],
na.rm=TRUE)
[1]
-0.1246967
>
mean(winterDelays[winterDelays$MONTH
==
12,'ARR_DELAY'],
na.rm=TRUE)
[1]
6.892993
Yes,
the
mean
delays
for
November
and
December
are
different
since
the
mean
delay
for
november
is
-0.125
minutes
and
the
mean
delay
for
december
is
6.893
minutes.
Number
5
Which is a better measure for characterizing the center of the
distribution for overall delay - mean, median or mode? Why?
hist(winterDelays$ARR_DELAY,
main="Histogram
of
Delays",
xlab=
"Delay
time")
I
believe
that
median
would
be
the
best
measure
because
it
is
not
a
normal
distrobution
(it
is
skewed
right).
Since
the
data
is
so
heavily
skewed,
the
mean
delay
would
not
be
a
good
measure
for
the
overall
average
delay
since
the
outliers
would
skew
the
mean
heavily.
Usually,
with
heavily
skewed
data
the
median
or
mode
is
the
best
measure
of
characterizing
the
center,
in
this
case
I
am
going
with
median.
Number
6
What is the mean and standard deviation of the arrival delays for all
United airlines (UA) flights on the weekend out of SFO?
>delays
=
subset(winterDelays,
UNIQUE_CARRIER
==
'UA'
&
ORIGIN
==
'SFO'
&
DAY_OF_WEEK%in%c(6,7),
ARR_DELAY)
>mean(delays$ARR_DELAY,na.rm=TRUE)
[1]
0.9390957
So
the
mean
Arrival
delay
time
for
United
Airlines
is
about
1
minute.
>sd(delays$ARR_DELAY,na.rm=TRUE)
[1]
36.76253
So
the
standard
deviation
for
arrival
delay
time
for
United
Airlines
is
about
37
minutes.
Number
7
Plot the distributions of overall delay for each month. What is the best
way to display this?
Delay1=
subset(winterDelays,
winterDelays$MONTH=="1")
Delay2=
subset(winterDelays,
winterDelays$MONTH=="2")
Delay3=
subset(winterDelays,
winterDelays$MONTH=="11")
Delay4=
subset(winterDelays,
winterDelays$MONTH=="12")
This
plot
shows
the
distrobutions
of
delay
for
each
month,
including
ALL
the
outliers.
As
one
can
see,
this
is
a
hard
grah
to
read
becuase
of
all
the
outliers,
but
it
shows
the
True
data
and
distrobutions
wih
no
manipulation.
boxplot(list(Delay1$ARR_DELAY,
Delay2$ARR_DELAY,
Delay3$ARR_DELAY,
Delay4$ARR_DELAY),
xaxt='c',
xlab="MONTH",
ylab="Mean
Delay",
outline=FALSE)
axis(1,
at=1:4,
labels=c("Jan",
"Feb",
"Nov",
"Dec"))
axis(2)
This
plot
is
shows
the
distrutions
of
delay
from
each
month,
not
including
the
outliers
(which
is
lying
in
a
sense),
but
it
much
more
clear
and
easy
to
read.
Number
8
Display the number of flights for each airport on a single plot so we
can quickly compare them.
flight=table(winterDelays$ORIGIN)
dotchart(flight,
cex=.3,
main="Number
of
Flights
per
Airport",
xlab="Number
of
Flights",
ylab="Airport"
)
##
Warning:
'x'
is
neither
a
vector
nor
a
matrix:
using
as.numeric(x)
Above
is
the
number
of
flights
for
each
airport
on
a
single
plot.
It
is
impossible
to
read
the
Y
axis,
because
I
have
plotted
ALL
airports,
but
it
could
easily
be
more
readable
by
plotting
only
a
few
airports
on
several
plots.
Number
9
Are there many more flights on weekdays relative to Saturday and
Sunday?
>compare=table(winterDelays$DAY_OF_WEEK)
>compare
1
2
3
4
5
6
7
286929
275181
288557
308239
292745
235456
274382
#With
1
being
Mon,
2
being
Tuesday,
3
being
Wednesday,
4
being
Thursday,
5
Being
Friday,
6
Being
Saturday
and
7
being
Sunday.
It
appears
that
are
are
many
more
flights
on
weekdays
rather
relative
to
Saturday
and
Sunday,
especially
Saturday
since
it
is
has
the
lowest
number
of
flights
(235,465).
Number
10
What day of the week has the most number of delayed flights?
>table(winterDelays$DAY_OF_WEEK,
winterDelays$ARR_DELAY>0)
FALSE
TRUE
1
176174
106163
2
175396
94570
3
177613
104861
4
186625
115142
5
178447
108652
6
155173
76932
7
172365
97596
Here,
the
TRUE
column
is
showing
the
number
of
delayed
flights.
The
rows
are
signifying
the
day
of
week
(1
being
Monday,
1
being
Tuesday
and
so
on).
We
can
see
that
Thursday
has
the
mist
number
of
delayed
flights
(115,142).
Number
11
What day of the week has the largest median overall delay? 90th
quantile for overall delay?
>daymedian=
with(winterDelays,
tapply(ARR_DELAY,
list(DAY_OF_WEEK),
median,
na.rm=TRUE))
>daymedian
1
2
3
4
5
6
7
-4
-5
-5
-4
-4
-6
-5
There
is
a
tie
for
Mon,
Thurs,
Fri,
and
Sun
for
median
overall
Delay
at
-4
minutes.
It
is
important
to
remeber
that
this
is
median
overall
delay
and
not
mean
overall
delay.
>with(winterDelays,
tapply(ARR_DELAY,
list(DAY_OF_WEEK),
quantile,
prob=.9,
na.rm=TRUE))
1
2
3
4
5
6
7
32
27
34
32
32
26
31
Wednesday
has
the
highest
90th
quantile
for
overall
delay
with
34
minutes.
Number
12
Consider the 10 airports with the most number of flights. For this
subset of the data, which routes (origin-destination pair) have the
worst median delay.
#
The
Professor
said
to
acknowledge
when
we
had
outside
help.
On
this
problem
I
was
directed
by
Nick
Ulle.
tt
=
table(winterDelays$ORIGIN)
mtt=
margin.table(tt,1)
order.mtt
=
order(mtt,decreasing=TRUE)[1:10]
#This
will
show
the
top
ten
airports
and
their
corresponding
number
of
flights.
tt[order.mtt]
Group.1
Group.2
x
25
ORD
DEN
0.0
49
DFW
LAS
0.0
58
DFW
LAX
0.0
24
LAX
DEN
-1.0
57
DEN
LAX
-1.0
59
IAH
LAX
-1.0
12
DFW
CLT
-2.0
16
ORD
CLT
-2.0
Here
I
have
chosen
to
just
show
the
first
part
of
the
large
table
to
save
paper.
But
the
output
was
very
long
and
showed
all
of
the
routes
with
the
meaidn
delay
for
each.
The
ones
shown
are
the
ones
with
the
largest
overall
median
delay
for
the
top
ten
airports.
As
one
can
see,
from
ORD
to
DEN,
DFW
to
LAS
and
from
DFW
to
LAX,
there
is
the
largest
median
delay.
This
delay
time,
though,
is
zero
minutes
because
most
f
the
lfights
in
the
data
set
take
off
early.
(again,
we
are
looking
at
median
here,
not
mean).
Number
13
Graphically display any relationship between the distance of the flight
(between the two origin and destination) airports and the overall
delay. Interpret the display.
smoothScatter(winterDelays$ARR_DELAY,
winterDelays$DISTANCE,
xlab="Delay",
ylab="Distance(miles)")
##
Warning:
Binning
grid
too
coarse
for
current
(small)
bandwidth:
consider
##
increasing
'gridsize'
It
is
hard
to
tell
from
this
data,
becuase
there
are
many
more
flights
with
shorter
distance
than
long,
but
it
would
probably
be
safe
to
say
that
there
is
not
much
of
a
difference
in
delay
times
from
shorter
and
longer
flights.
That
being
said,
there
are
a
few
more
significant
outliers
in
the
shorter
flights
than
in
the
longer
ones,
but
this
is
likely
due
to
the
vast
percentage
of
flights
being
shorter
ones.
I
used
smoothScatter
on
this
problem
as
the
professor
had
recommeded
because
it
shows
the
graph
in
a
much
neater
fashion
and
it
is
easier
to
read.
Number
14
What are the worst hours to fly in terms of experiencing delays.
>
tapply(winterDelays$ARR_DELAY,
winterDelays$ARR_TIME_BLK,
mean,
na.rm=TRUE)
1-0559
0600-0659
0700-0759
0800-0859
0900-0959
1000-1059
1100-1159
1200-1259
1300-1359
1400-1459
2.5888
-1.4700
-1.3306
-0.9237
-0.4361
-
0.2953
0.2934
1.3193
1.7777
3.3557
1500-1559
1600-1659
1700-1759
1800-1859
1900-1959
2000-2059
2100-2159
2200-2259
2300-2359
3.8163
4.7015
5.9254
6.1220
6.5939
6.3258
6.5707
6.5587
5.1105
The
time
above
is
sated
in
military
time,
as
this
is
what
the
airlines
use
as
a
way
of
keeping
time.
So
the
worst
hours
to
fly,
in
terms
of
experienceing
delays
would
be
between
6:00
and
10:00
pm,
where
the
delay
time
averages
around
6
minutes.
The
absolute
worst
hour
to
fly
would
be
between
7:00
and
8:00
p.m.
Number
15
Are the delays worse on December 25th than other days?
Thanksgiving? Provide evidence to support your conclusions.
>
tapply(winterDelays$ARR_DELAY,
list(winterDelays$MONTH,
winterDelays$DAY_OF_MONTH),
mean,
na.rm=TRUE)
1
2
3
4
5
6
7
8
9
10
1
6.8945763
5.9187169
4.6276680
-1.253096
0.7645577
1.391044
-
3.5615892
-0.6202718
-1.571971
-1.633439
2
3.7956690
3.0728452
-0.7300243
10.649105
-2.2401363
-2.808513
4.9996482
2.7355633
-2.238645
10.826545
11
1.4928729
-0.6600911
-4.6630724
-2.244804
-2.6255178
-2.084708
0.1688204
0.8479014
1.293546
-2.655190
12
-0.7579075
2.4712065
-2.2010932
-3.697318
-3.3443914
-4.176649
-
0.5729222
-2.8824730
5.661869
16.335838
11
12
13
14
15
16
17
18
19
20
1
2.401867
-1.844984
7.201963
2.5351881
-1.541786
6.9567796
0.1099049
-0.705212
-6.126114
1.2997991
2
11.548020
-2.928880
-1.979298
-0.3131504
-1.050239
1.4220683
0.1417358
0.942776
6.703037
2.4348809
11
3.966439
10.177729
-1.115996
-2.1576971
2.156633
0.9675679
-
2.9009901
-3.221705
-2.974814
0.1846513
12
-1.402158
-3.225194
-2.745070
-0.9295866
-0.358311
8.5672392
9.6133976
3.798906
10.656995
18.5276957
21
22
23
24
25
26
27
28
29
30
1
2.112755
-0.3910833
1.182579
6.407166
13.318865
0.1107449
5.841744
5.5110657
5.766958
19.3347537
2
10.141076
12.1875036
5.183197
2.091021
3.569462
9.8617102
8.933590
0.7131929
NA
NA
11
7.154388
-8.8592179
-6.118507
-2.850553
1.918476
3.0102443
2.688409
-1.8931600
-1.034617
0.4279986
12
25.894130
11.0176950
8.934984
2.568117
14.125010
33.2713518
22.885562
13.9638004
16.154251
8.5911889
colSums(is.na(winterDelays))
YEAR
QUARTER
MONTH
DAY_OF_MONTH
DAY_OF_WEEK
0
0
0
0
0
FL_DATE
UNIQUE_CARRIER
AIRLINE_ID
CARRIER
TAIL_NUM
0
0
0
0
0
FL_NUM
ORIGIN_AIRPORT_ID
ORIGIN_AIRPORT_SEQ_ID
ORIGIN_CITY_MARKET_ID
ORIGIN
0
0
0
0
0
ORIGIN_CITY_NAME
ORIGIN_STATE_ABR
ORIGIN_STATE_FIPS
ORIGIN_STATE_NM
ORIGIN_WAC
0
0
0
0
0
DEST_AIRPORT_ID
DEST_AIRPORT_SEQ_ID
DEST_CITY_MARKET_ID
DEST
DEST_CITY_NAME
0
0
0
0
0
DEST_STATE_ABR
DEST_STATE_FIPS
DEST_STATE_NM
DEST_WAC
CRS_DEP_TIME
0
0
0
0
0
DEP_TIME
DEP_DELAY
DEP_DELAY_NEW
DEP_DEL15
DEP_DELAY_GROUP
30721
30721
30721
30721
30721
DEP_TIME_BLK
TAXI_OUT
WHEELS_OFF
WHEELS_ON
TAXI_IN
0
31540
31540
32950
32950
CRS_ARR_TIME
ARR_TIME
ARR_DELAY
ARR_DELAY_NEW
ARR_DEL15
0
32950
35780
35780
35780
ARR_DELAY_GROUP
ARR_TIME_BLK
CANCELLED
CANCELLATION_CODE
DIVERTED
35780
0
0
0
0
CRS_ELAPSED_TIME
ACTUAL_ELAPSED_TIME
AIR_TIME
FLIGHTS
DISTANCE
0
35780
35780
0
0
DISTANCE_GROUP
CARRIER_DELAY
WEATHER_DELAY
NAS_DELAY
SECURITY_DELAY
0
1619153
1619153
1619153
1619153
LATE_AIRCRAFT_DELAY
FIRST_DEP_TIME
TOTAL_ADD_GTIME
LONGEST_ADD_GTIME
DIV_AIRPORT_LANDINGS
1619153
1950485
1950485
1950485
0
DIV_REACHED_DEST
DIV_ACTUAL_ELAPSED_TIME
DIV_ARR_DELAY
DIV_DISTANCE
DIV1_AIRPORT
1957674
1958659
1958659
1957755
0
DIV1_AIRPORT_ID
DIV1_AIRPORT_SEQ_ID
DIV1_WHEELS_ON
DIV1_TOTAL_GTIME
DIV1_LONGEST_GTIME
1957250
1957250
1957249
1957249
1957249
DIV1_WHEELS_OFF
DIV1_TAIL_NUM
DIV2_AIRPORT
DIV2_AIRPORT_ID
DIV2_AIRPORT_SEQ_ID
1958597
0
0
1961419
1961419
DIV2_WHEELS_ON
DIV2_TOTAL_GTIME
DIV2_LONGEST_GTIME
DIV2_WHEELS_OFF
DIV2_TAIL_NUM
1961419
1961419
1961419
1961479
0
DIV3_AIRPORT
DIV3_AIRPORT_ID
DIV3_AIRPORT_SEQ_ID
DIV3_WHEELS_ON
DIV3_TOTAL_GTIME
0
1961487
1961487
1961487
1961487
DIV3_LONGEST_GTIME
DIV3_WHEELS_OFF
DIV3_TAIL_NUM
DIV4_AIRPORT
DIV4_AIRPORT_ID
1961487
1961489
1961489
1961489
1961489
DIV4_AIRPORT_SEQ_ID
DIV4_WHEELS_ON
DIV4_TOTAL_GTIME
DIV4_LONGEST_GTIME
DIV4_WHEELS_OFF
1961489
1961489
1961489
1961489
1961489
DIV4_TAIL_NUM
DIV5_AIRPORT
DIV5_AIRPORT_ID
DIV5_AIRPORT_SEQ_ID
DIV5_WHEELS_ON
1961489
1961489
1961489
1961489
1961489
DIV5_TOTAL_GTIME
DIV5_LONGEST_GTIME
DIV5_WHEELS_OFF
DIV5_TAIL_NUM
X
1961489
1961489
1961489
1961489
1961489
The
table
above
shows
the
amount
of
missing
values
(NA's)
for
each
variable
in
the
data
set.
Number
17
Each of the variables DEP_TIME, DEP_DELAY, DEP_DELAY_NEW have the
same number of missing values. Do these missing values correspond to the
same records for each of these variables?
>length(which(is.na(winterDelays$DEP_TIME)))
[1]
30721
>
length(which(is.na(winterDelays$DEP_DELAY)))
[1]
30721
>
length(which(is.na(winterDelays$DEP_DELAY_NEW)))
[1]
30721
Above
shows
that
each
of
the
three
varibles
have
the
exact
same
number
if
missing
values.
>
identical(which(is.na(winterDelays$DEP_DELAY_NEW)),which(is.na(winterDe
lays$DEP_TIME)),
which(is.na(winterDelays$DEP_DELAY)))
[1]
TRUE
We
can
see
that,
in
fact
the
missing
values
all
correspond
to
the
same
records
because
the
output
gave
us
the
value
'TRUE'
when
we
asked
if
all
of
the
variables
missing
values
correspond
together.
Number
18
Does the distribution of delays depend on the time of day? Provide
evidence for your conclusion..
library("lattice",
lib.loc="/Library/Frameworks/R.framework/Versions/3.0/Resources/library
")
bwplot(winterDelays$ARR_DELAY~winterDelays$ARR_TIME_BLK,
data
=
winterDelays,
scales=list(rot=45),ylim=c(-80,80),main="Delays
by
Time
of
Day",xlab="Time
of
Day",
ylab="Delay
(in
minutes)")
The
median
delay
increses
as
the
day
goes
on,
as
it
looks
in
this
plot.
Also,
the
ditrobution
of
delays
increases
as
the
day
goes
on
as
well.
This
is
lkely
dues
to
some
of
the
flights
being
dependent
on
the
ones
before
them,
as
there
are
only
a
certain
aount
of
gates
at
each
airport.
So,
yes,
the
distrbution
of
delays
does
depend
on
time
of
day.
Number
19
What proportion of flights took off late?
nrow
(winterDelays
[winterDelays$DEP_DELAY>0,]
)/
nrow(winterDelays)
[1]
0.3814918
Number20
What proportion of flights arrived late? What proportion arrived early?
>
nrow
(winterDelays
[winterDelays$ARR_DELAY>0,]
)/
nrow(winterDelays)
[1]
0.3771094
Number
21
What proportion of flights that took off late also arrived late?
depart=nrow(subset(winterDelays,DEP_DELAY>0))
arrive=nrow(subset(winterDelays,ARR_DELAY>0))
depart/(arrive+depart)
So
the
proportion
of
flights
that
took
off
late
and
also
arrived
late
was
about
51%.
Number
22
Do planes leaving late tend to make up time?
depart2=
subset(winterDelays,
DEP_DELAY>0)
nrow(subset(depart2,(DEP_DELAY
-
ARR_DELAY)
>
0))
[1]
497560
nrow(subset(depart2,(DEP_DELAY
-
ARR_DELAY)
<
0))
[1]
196348
The
number
above
tell
us
that,
in
fact,
the
flights
that
leave
late
to
tand
to
make
up
time.
The
way
we
can
see
this
is
that
when
a
flight
is
leaving
late,
it
should
also
be
arriving
late
by
the
same
amount
of
time
or
later,
but
most
of
the
flights
that
left
late
actually
arrived
ealrier
than
expected,
menaing
they
made
up
time.
A
very
impressive
result
in
my
opinion.
Number
23
Do flights that take off late fly faster to make up time?
tapply(winterDelays/AIR_TIME,
winterDelays$ARR_DELAY<=0,
mean)
FALSE
TRUE
6.471606
6.732395
The
TRUE
column
data
is
the
mean
velocity
(in
miles/minute)
for
flights
on
time,
and
the
FALSE
is
for
late
flights.
So
it
is
clear
that
the
flights
that
are
late
are
not
fliying
faster
than
those
that
are
on
time.
Number
24
For flights originating from SFO, what are the 5 most popular
destination airports?
org.SFO=
subset(winterDelays,
ORIGIN=="SFO")
SF.tab=
table(org.SFO$DEST)
m.SFO=
margin.table(SF.tab,1)
SF.ORDER=
order(m.SFO,
decreasing=TRUE)[1:5]
SF.tab[SF.ORDER]
LAX
LAS
JFK
SAN
ORD
5292
2732
2634
2508
2434
So
here
we
can
see
that
the
top
5
destination
airports
from
SFO
are:
LAX,
LAS,
JFK,
SAN,
and
ORd
respectively.
Number
25
For flights originating from SFO, compute the distance to the 5 most
popular destination airports. How did you do this? Can we do this for
all pairs of airports?
tapply(winterDelays, ORIGIN=="SFO"
&
winterDelays$DEST=="ORD",
mean)
FALSE
TRUE
760.3089
1846.0000
Yes,
this
can
be
done
for
any
pair
of
airports
in
the
data
set
with
a
simple
substiution
of
airport
destination
and
origin.
Number
26
For flights from SFO to these 5 most popular destinations, compute
and display the distribution of the average speed of the airplane on
each flight?
#
subset=
winterDelays[winterDelays$ORIGIN=="SFO"
&
as.character(winterDelays$DEST)%in%c("ORD",
"SAN",
"LAX",
"JFK",
"LAS"),]
subset[,"DEST"]=as.character(subset[,"DEST"])
splt=split(speed, list(subset$DEST))
boxplot(splt)
Above
shows
the
average
speed
of
the
airplanes
on
each
of
the
flights
of
the
top
5
destination
airports
originating
from
SFO.
Number
27
Are the distributions of delays for "commuter" flights between SFO and
LAX similar to those for SFO to JFK, EWR?
delay=(winterDelays$ARR_DELAY)
splt2=split(delay,
list(subset$DEST))
##
Warning:
data
length
is
not
a
multiple
of
split
variable
boxplot(splt2,
outline=FALSE)
THe
above
plot
shows
that
the
ditrsibutions
form
lfights
from
SFO
to
LAS
and
LAX
are
similar
to
those
to
JFK,
dispite
that
the
flights
are
much
further
in
distance.
Now
this
plot
does
not
show
the
outliers
like
the
one
below,
but
it
gives
us
a
better
view
of
the
distrobutions.
boxplot(splt2)
This
is
the
same
box
plot
as
the
one
above,
but
includes
the
outliers
and
still
shows
that
the
dustrbutions
are
similar.
Number
28
What other questions could we address with this data?
We
could
ask
which
days
of
the
winter
months
are
the
best
days
to
fly
in
terms
of
delay
time.
We
could
find
which
airline
carrier
had
highest
average
delay
time
for
the
winter
months.
Number
29
What other data/variables would allow us to address additional
interesting questions?
If
we
had
data
on
the
weather
conditions
per
flight,
we
could
find
what
percentage
of
the
delays
were
a
result
of
weather
conditions.
We
could
also
gather
data
on
passengers,
and
see
if
flights
with
more
passengers
had
a
higher
average
delay
time.
We
could
gather
delay
data
from
the
enitre
year
to
find
which
is
the
worst
and
best
times
to
fly
in
terms
of
delay.