Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
173 views

Airline Data Analysis

This document analyzes over 6GB of airline data to answer several questions. It finds that there are 1,961,489 flights in the data set. Southwest Airlines (WN) has the most flights at 354,963. It also shows the top 20 airports and 10 airlines by number of flights, with Atlanta (ATL) having the most flights. Finally, it compares the mean arrival delay in November and December, finding they are different.

Uploaded by

Austin Kinion
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views

Airline Data Analysis

This document analyzes over 6GB of airline data to answer several questions. It finds that there are 1,961,489 flights in the data set. Southwest Airlines (WN) has the most flights at 354,963. It also shows the top 20 airports and 10 airlines by number of flights, with Atlanta (ATL) having the most flights. Finally, it compares the mean arrival delay in November and December, finding they are different.

Uploaded by

Austin Kinion
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Analyzing

Airline Data
In this project, I will be analyzing over 6gb of Airline data, and answering
some questions that I think would be important when looking at data similar to this.
data=load(url("http://eeyore.ucdavis.edu/stat141/Data/winterDelays.rda"
))

Number 1
How many flights are there in the data set?
>nrow(winterDelays)

1961489

So there are 1,961,489 flights in the data set because that is the number of rows in
the set.
Number 2
Which airline has the most flights?
sort(table(winterDelays$UNIQUE_CARRIER))


VX HA F9 YV 9E AS FL B6 US MQ
UA AA OO DL EV WN
17216 23468 23699 41305 44967 46737 61988 75816 130453 145487
161857 172342 201171 227539 232481 354963

So WN (Southwest Airlines) has the most flights (354,963) in this data set.
Number 3
Compute the number of flights for each originating airport and airline
carrier. Show only the rows and columns for the 20 airports with the
largest number of flights, and the 10 airline carriers with the most
flights.
>tab = table(winterDelays$ORIGIN, winterDelays$UNIQUE_CARRIER)
>m = margin.table(tab,1)
>ord = order(m, decreasing = TRUE)[1:20]
>m2 = margin.table(tab,2)
>ord2 = order(m2,decreasing = TRUE)[1:10]
>tab[ord,ord2]

WN EV DL OO AA UA MQ US B6 FL
ATL 3369 29393 63957 558 1656 260 1992 1734 0 17897
ORD 0 17570 1852 8740 15653 18113 25434 2319 479 0

DFW 0 1633 1638 2005 50114 1326 29051 2163 337 0


DEN 17944 4975 2094 15603 1636 14383 756 1498 324 285
LAX 11797 0 6104 19803 9582 9268 2756 1921 1032 293
IAH 0 25998 772 6463 1363 21311 793 1809 0 0
PHX 18698 15 2249 7006 1906 2065 56 18814 229 64
SFO 5063 0 2490 16838 3457 14453 0 1658 1276 295
CLT 0 1732 1600 193 664 99 1713 28284 479 644
LAS 23477 0 3731 2446 3115 3931 0 2081 1152 421
DTW 1874 7449 14912 1104 743 231 1561 1098 0 676
EWR 2045 15783 1312 0 1136 14783 848 1413 2232 0
MSP 2334 2350 16429 8173 1037 882 1094 1334 0 482
MCO 9471 50 5465 0 3168 3869 0 2942 6148 5506
SLC 3572 123 9700 18149 579 343 492 664 360 0
JFK 0 426 6375 0 4674 1429 2279 901 13363 0
BOS 2175 860 3061 0 3376 3697 0 5693 11041 1172
BWI 18762 920 2167 0 934 1081 574 1510 491 4518
LGA 1834 876 7885 1 4918 2446 5571 3985 2039 1284
SEA 3442 0 2636 1863 1523 2980 0 1022 451 0

This table shows the top 20 airports with the largest number of flights, and the 10
airline carriers with the most flights.
Number 4
Is the mean delay in November different from the mean delay in
December?
>mean(winterDelays[winterDelays$MONTH == 11,'ARR_DELAY'], na.rm=TRUE)
[1] -0.1246967
> mean(winterDelays[winterDelays$MONTH == 12,'ARR_DELAY'], na.rm=TRUE)
[1] 6.892993

Yes, the mean delays for November and December are different since the mean
delay for november is -0.125 minutes and the mean delay for december is 6.893
minutes.







Number 5
Which is a better measure for characterizing the center of the
distribution for overall delay - mean, median or mode? Why?
hist(winterDelays$ARR_DELAY, main="Histogram of Delays", xlab= "Delay
time")

I believe that median would be the best measure because it is not a normal
distrobution (it is skewed right). Since the data is so heavily skewed, the mean delay
would not be a good measure for the overall average delay since the outliers would
skew the mean heavily. Usually, with heavily skewed data the median or mode is the
best measure of characterizing the center, in this case I am going with median.



Number 6
What is the mean and standard deviation of the arrival delays for all
United airlines (UA) flights on the weekend out of SFO?
>delays = subset(winterDelays, UNIQUE_CARRIER == 'UA' & ORIGIN == 'SFO'
& DAY_OF_WEEK%in%c(6,7), ARR_DELAY)

>mean(delays$ARR_DELAY,na.rm=TRUE)
[1] 0.9390957

So the mean Arrival delay time for United Airlines is about 1 minute.
>sd(delays$ARR_DELAY,na.rm=TRUE)
[1] 36.76253

So the standard deviation for arrival delay time for United Airlines is about 37
minutes.
Number 7
Plot the distributions of overall delay for each month. What is the best
way to display this?
Delay1= subset(winterDelays, winterDelays$MONTH=="1")
Delay2= subset(winterDelays, winterDelays$MONTH=="2")
Delay3= subset(winterDelays, winterDelays$MONTH=="11")
Delay4= subset(winterDelays, winterDelays$MONTH=="12")

boxplot(list(Delay1$ARR_DELAY, Delay2$ARR_DELAY, Delay3$ARR_DELAY,


Delay4$ARR_DELAY), xaxt='c', xlab="MONTH", ylab="Mean Delay",
outline=TRUE)
axis(1, at=1:4, labels=c("Jan", "Feb", "Nov", "Dec"))
axis(2)


This plot shows the distrobutions of delay for each month, including ALL the
outliers. As one can see, this is a hard grah to read becuase of all the outliers, but it
shows the True data and distrobutions wih no manipulation.
boxplot(list(Delay1$ARR_DELAY, Delay2$ARR_DELAY, Delay3$ARR_DELAY,
Delay4$ARR_DELAY), xaxt='c', xlab="MONTH", ylab="Mean Delay",
outline=FALSE)
axis(1, at=1:4, labels=c("Jan", "Feb", "Nov", "Dec"))
axis(2)

This plot is shows the distrutions of delay from each month, not including the
outliers (which is lying in a sense), but it much more clear and easy to read.
Number 8
Display the number of flights for each airport on a single plot so we
can quickly compare them.
flight=table(winterDelays$ORIGIN)
dotchart(flight, cex=.3, main="Number of Flights per Airport",
xlab="Number of Flights", ylab="Airport" )
## Warning: 'x' is neither a vector nor a matrix: using as.numeric(x)

Above is the number of flights for each airport on a single plot. It is impossible to
read the Y axis, because I have plotted ALL airports, but it could easily be more
readable by plotting only a few airports on several plots.

Number 9
Are there many more flights on weekdays relative to Saturday and
Sunday?
>compare=table(winterDelays$DAY_OF_WEEK)
>compare
1 2 3 4 5 6 7
286929 275181 288557 308239 292745 235456 274382
#With 1 being Mon, 2 being Tuesday, 3 being Wednesday, 4 being
Thursday, 5 Being Friday, 6 Being Saturday and 7 being Sunday.

It appears that are are many more flights on weekdays rather relative to Saturday
and Sunday, especially Saturday since it is has the lowest number of flights
(235,465).
Number 10
What day of the week has the most number of delayed flights?
>table(winterDelays$DAY_OF_WEEK, winterDelays$ARR_DELAY>0)

FALSE TRUE
1 176174 106163
2 175396 94570
3 177613 104861
4 186625 115142
5 178447 108652
6 155173 76932
7 172365 97596

Here, the TRUE column is showing the number of delayed flights. The rows are
signifying the day of week (1 being Monday, 1 being Tuesday and so on). We can see
that Thursday has the mist number of delayed flights (115,142).
Number 11
What day of the week has the largest median overall delay? 90th
quantile for overall delay?
>daymedian= with(winterDelays, tapply(ARR_DELAY, list(DAY_OF_WEEK),
median, na.rm=TRUE))
>daymedian
1 2 3 4 5 6 7
-4 -5 -5 -4 -4 -6 -5

There is a tie for Mon, Thurs, Fri, and Sun for median overall Delay at -4 minutes. It
is important to remeber that this is median overall delay and not mean overall
delay.
>with(winterDelays, tapply(ARR_DELAY, list(DAY_OF_WEEK), quantile,
prob=.9, na.rm=TRUE))

1 2 3 4 5 6 7
32 27 34 32 32 26 31

Wednesday has the highest 90th quantile for overall delay with 34 minutes.
Number 12
Consider the 10 airports with the most number of flights. For this
subset of the data, which routes (origin-destination pair) have the
worst median delay.
# The Professor said to acknowledge when we had outside help. On this
problem I was directed by Nick Ulle.
tt = table(winterDelays$ORIGIN)
mtt= margin.table(tt,1)
order.mtt = order(mtt,decreasing=TRUE)[1:10]
#This will show the top ten airports and their corresponding number of
flights.
tt[order.mtt]

tbest <- c("LAS","CLT","SFO", "PHX", "IAH","LAX", "DEN", "DFW","ORD",


"ATL")
sub= subset(winterDelays, winterDelays$ORIGIN %in% tbest &
winterDelays$DEST %in% tbest)
tbest.route = aggregate(sub$ARR_DELAY,by=list(sub$ORIGIN,sub$DEST),
FUN="median",na.rm = TRUE)
tbest.route[order(tbest.route$x,decreasing=TRUE),]

Group.1 Group.2 x
25 ORD DEN 0.0
49 DFW LAS 0.0
58 DFW LAX 0.0
24 LAX DEN -1.0
57 DEN LAX -1.0
59 IAH LAX -1.0
12 DFW CLT -2.0
16 ORD CLT -2.0

Here I have chosen to just show the first part of the large table to save paper. But the
output was very long and showed all of the routes with the meaidn delay for each.
The ones shown are the ones with the largest overall median delay for the top ten
airports. As one can see, from ORD to DEN, DFW to LAS and from DFW to LAX, there
is the largest median delay. This delay time, though, is zero minutes because most f
the lfights in the data set take off early. (again, we are looking at median here, not
mean).


Number 13
Graphically display any relationship between the distance of the flight
(between the two origin and destination) airports and the overall
delay. Interpret the display.
smoothScatter(winterDelays$ARR_DELAY, winterDelays$DISTANCE,
xlab="Delay", ylab="Distance(miles)")
## Warning: Binning grid too coarse for current (small) bandwidth:
consider
## increasing 'gridsize'

It is hard to tell from this data, becuase there are many more flights with shorter
distance than long, but it would probably be safe to say that there is not much of a
difference in delay times from shorter and longer flights. That being said, there are a
few more significant outliers in the shorter flights than in the longer ones, but this is
likely due to the vast percentage of flights being shorter ones. I used smoothScatter
on this problem as the professor had recommeded because it shows the graph in a
much neater fashion and it is easier to read.

Number 14
What are the worst hours to fly in terms of experiencing delays.
> tapply(winterDelays$ARR_DELAY, winterDelays$ARR_TIME_BLK, mean,
na.rm=TRUE)


1-0559 0600-0659 0700-0759 0800-0859 0900-0959 1000-1059 1100-1159
1200-1259 1300-1359 1400-1459
2.5888 -1.4700 -1.3306 -0.9237 -0.4361 - 0.2953 0.2934
1.3193 1.7777 3.3557

1500-1559 1600-1659 1700-1759 1800-1859 1900-1959 2000-2059
2100-2159 2200-2259 2300-2359
3.8163 4.7015 5.9254 6.1220 6.5939 6.3258
6.5707 6.5587 5.1105

The time above is sated in military time, as this is what the airlines use as a way of
keeping time.
So the worst hours to fly, in terms of experienceing delays would be between 6:00
and 10:00 pm, where the delay time averages around 6 minutes. The absolute worst
hour to fly would be between 7:00 and 8:00 p.m.
Number 15
Are the delays worse on December 25th than other days?
Thanksgiving? Provide evidence to support your conclusions.
> tapply(winterDelays$ARR_DELAY, list(winterDelays$MONTH,
winterDelays$DAY_OF_MONTH), mean, na.rm=TRUE)
1 2 3 4 5 6
7 8 9 10
1 6.8945763 5.9187169 4.6276680 -1.253096 0.7645577 1.391044 -
3.5615892 -0.6202718 -1.571971 -1.633439
2 3.7956690 3.0728452 -0.7300243 10.649105 -2.2401363 -2.808513
4.9996482 2.7355633 -2.238645 10.826545
11 1.4928729 -0.6600911 -4.6630724 -2.244804 -2.6255178 -2.084708
0.1688204 0.8479014 1.293546 -2.655190
12 -0.7579075 2.4712065 -2.2010932 -3.697318 -3.3443914 -4.176649 -
0.5729222 -2.8824730 5.661869 16.335838
11 12 13 14 15 16
17 18 19 20
1 2.401867 -1.844984 7.201963 2.5351881 -1.541786 6.9567796
0.1099049 -0.705212 -6.126114 1.2997991
2 11.548020 -2.928880 -1.979298 -0.3131504 -1.050239 1.4220683
0.1417358 0.942776 6.703037 2.4348809
11 3.966439 10.177729 -1.115996 -2.1576971 2.156633 0.9675679 -
2.9009901 -3.221705 -2.974814 0.1846513
12 -1.402158 -3.225194 -2.745070 -0.9295866 -0.358311 8.5672392
9.6133976 3.798906 10.656995 18.5276957

21 22 23 24 25 26
27 28 29 30
1 2.112755 -0.3910833 1.182579 6.407166 13.318865 0.1107449
5.841744 5.5110657 5.766958 19.3347537
2 10.141076 12.1875036 5.183197 2.091021 3.569462 9.8617102
8.933590 0.7131929 NA NA
11 7.154388 -8.8592179 -6.118507 -2.850553 1.918476 3.0102443
2.688409 -1.8931600 -1.034617 0.4279986
12 25.894130 11.0176950 8.934984 2.568117 14.125010 33.2713518
22.885562 13.9638004 16.154251 8.5911889

Above shows a table of month (rows corresponding to 1= Jan, 2=Feb, 11=Nov,


12=Dec) vs. day of month and their delay times in minutes.
For November, the worst day for delays was November 21st with delay times
averaging to about 7 minutes. Thanksgiving fell on November 22 in 2012, and the
mean delay for Thanksgiving is about -8 minutes. So thanksgiving day delays are not
'bad' compared to other days of the year, but the day before thanksgiving is.
For December, the worst day for delays was December 26th with delay times
averaging to about 33 minutes. Christmas day delays were stil pretty bad, though,
with a mean delay of about 14 minutes. That being said, there are more days in
December that have worse delays than that of Christmas day.
So thanksgiving day is not a bad day to travel in terms of delay times, while
Christmas day is much worse.
Number 16

How many missing values are there in each variable?

colSums(is.na(winterDelays))
YEAR QUARTER MONTH
DAY_OF_MONTH DAY_OF_WEEK
0 0 0
0 0
FL_DATE UNIQUE_CARRIER AIRLINE_ID
CARRIER TAIL_NUM
0 0 0
0 0
FL_NUM ORIGIN_AIRPORT_ID ORIGIN_AIRPORT_SEQ_ID
ORIGIN_CITY_MARKET_ID ORIGIN
0 0 0
0 0
ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS
ORIGIN_STATE_NM ORIGIN_WAC
0 0 0
0 0
DEST_AIRPORT_ID DEST_AIRPORT_SEQ_ID DEST_CITY_MARKET_ID
DEST DEST_CITY_NAME
0 0 0

0 0
DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM
DEST_WAC CRS_DEP_TIME
0 0 0
0 0
DEP_TIME DEP_DELAY DEP_DELAY_NEW
DEP_DEL15 DEP_DELAY_GROUP
30721 30721 30721
30721 30721
DEP_TIME_BLK TAXI_OUT WHEELS_OFF
WHEELS_ON TAXI_IN
0 31540 31540
32950 32950
CRS_ARR_TIME ARR_TIME ARR_DELAY
ARR_DELAY_NEW ARR_DEL15
0 32950 35780
35780 35780
ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED
CANCELLATION_CODE DIVERTED
35780 0 0
0 0
CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME
FLIGHTS DISTANCE
0 35780 35780
0 0
DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY
NAS_DELAY SECURITY_DELAY
0 1619153 1619153
1619153 1619153
LATE_AIRCRAFT_DELAY FIRST_DEP_TIME TOTAL_ADD_GTIME
LONGEST_ADD_GTIME DIV_AIRPORT_LANDINGS
1619153 1950485 1950485
1950485 0
DIV_REACHED_DEST DIV_ACTUAL_ELAPSED_TIME DIV_ARR_DELAY
DIV_DISTANCE DIV1_AIRPORT
1957674 1958659 1958659
1957755 0
DIV1_AIRPORT_ID DIV1_AIRPORT_SEQ_ID DIV1_WHEELS_ON
DIV1_TOTAL_GTIME DIV1_LONGEST_GTIME
1957250 1957250 1957249
1957249 1957249
DIV1_WHEELS_OFF DIV1_TAIL_NUM DIV2_AIRPORT
DIV2_AIRPORT_ID DIV2_AIRPORT_SEQ_ID
1958597 0 0
1961419 1961419
DIV2_WHEELS_ON DIV2_TOTAL_GTIME DIV2_LONGEST_GTIME
DIV2_WHEELS_OFF DIV2_TAIL_NUM
1961419 1961419 1961419
1961479 0
DIV3_AIRPORT DIV3_AIRPORT_ID DIV3_AIRPORT_SEQ_ID

DIV3_WHEELS_ON DIV3_TOTAL_GTIME
0 1961487 1961487
1961487 1961487
DIV3_LONGEST_GTIME DIV3_WHEELS_OFF DIV3_TAIL_NUM
DIV4_AIRPORT DIV4_AIRPORT_ID
1961487 1961489 1961489
1961489 1961489
DIV4_AIRPORT_SEQ_ID DIV4_WHEELS_ON DIV4_TOTAL_GTIME
DIV4_LONGEST_GTIME DIV4_WHEELS_OFF
1961489 1961489 1961489
1961489 1961489
DIV4_TAIL_NUM DIV5_AIRPORT DIV5_AIRPORT_ID
DIV5_AIRPORT_SEQ_ID DIV5_WHEELS_ON
1961489 1961489 1961489
1961489 1961489
DIV5_TOTAL_GTIME DIV5_LONGEST_GTIME DIV5_WHEELS_OFF
DIV5_TAIL_NUM X
1961489 1961489 1961489
1961489 1961489

The table above shows the amount of missing values (NA's) for each variable in the
data set.
Number 17
Each of the variables DEP_TIME, DEP_DELAY, DEP_DELAY_NEW have the
same number of missing values. Do these missing values correspond to the
same records for each of these variables?
>length(which(is.na(winterDelays$DEP_TIME)))
[1] 30721
> length(which(is.na(winterDelays$DEP_DELAY)))
[1] 30721
> length(which(is.na(winterDelays$DEP_DELAY_NEW)))
[1] 30721

Above shows that each of the three varibles have the exact same number if missing
values.
>
identical(which(is.na(winterDelays$DEP_DELAY_NEW)),which(is.na(winterDe
lays$DEP_TIME)), which(is.na(winterDelays$DEP_DELAY)))
[1] TRUE

We can see that, in fact the missing values all correspond to the same records
because the output gave us the value 'TRUE' when we asked if all of the variables
missing values correspond together.

Number 18
Does the distribution of delays depend on the time of day? Provide
evidence for your conclusion..

library("lattice",
lib.loc="/Library/Frameworks/R.framework/Versions/3.0/Resources/library
")


bwplot(winterDelays$ARR_DELAY~winterDelays$ARR_TIME_BLK, data =
winterDelays, scales=list(rot=45),ylim=c(-80,80),main="Delays by Time
of Day",xlab="Time of Day", ylab="Delay (in minutes)")

The median delay increses as the day goes on, as it looks in this plot. Also, the
ditrobution of delays increases as the day goes on as well. This is lkely dues to some
of the flights being dependent on the ones before them, as there are only a certain
aount of gates at each airport. So, yes, the distrbution of delays does depend on time
of day.

Number 19
What proportion of flights took off late?
nrow (winterDelays [winterDelays$DEP_DELAY>0,] )/ nrow(winterDelays)
[1] 0.3814918

Number20
What proportion of flights arrived late? What proportion arrived early?
> nrow (winterDelays [winterDelays$ARR_DELAY>0,] )/ nrow(winterDelays)
[1] 0.3771094

So the proportion of flights that arrived late is about 38%.


`` > nrow (winterDelays [winterDelays$DEP_DELAY<0,] )/ nrow(winterDelays) [1]
0.5756775
So the proportion of flights that arrived early is about 58%.


Number 21
What proportion of flights that took off late also arrived late?
depart=nrow(subset(winterDelays,DEP_DELAY>0))
arrive=nrow(subset(winterDelays,ARR_DELAY>0))
depart/(arrive+depart)
So the proportion of flights that took off late and also arrived late
was about 51%.

Number 22
Do planes leaving late tend to make up time?
depart2= subset(winterDelays, DEP_DELAY>0)
nrow(subset(depart2,(DEP_DELAY - ARR_DELAY) > 0)) [1] 497560
nrow(subset(depart2,(DEP_DELAY - ARR_DELAY) < 0)) [1] 196348
The number above tell us that, in fact, the flights that leave late to
tand to make up time. The way we can see this is that when a flight is
leaving late, it should also be arriving late by the same amount of
time or later, but most of the flights that left late actually arrived
ealrier than expected, menaing they made up time. A very impressive
result in my opinion.

Number 23
Do flights that take off late fly faster to make up time?
tapply(winterDelays/AIR_TIME, winterDelays$ARR_DELAY<=0, mean) FALSE
TRUE 6.471606 6.732395

The TRUE column data is the mean velocity (in miles/minute) for flights
on time, and the FALSE is for late flights. So it is clear that the
flights that are late are not fliying faster than those that are on
time.


Number 24
For flights originating from SFO, what are the 5 most popular
destination airports?
org.SFO= subset(winterDelays, ORIGIN=="SFO")
SF.tab= table(org.SFO$DEST)
m.SFO= margin.table(SF.tab,1)
SF.ORDER= order(m.SFO, decreasing=TRUE)[1:5]
SF.tab[SF.ORDER]
LAX LAS JFK SAN ORD 5292 2732 2634 2508 2434
So here we can see that the top 5 destination airports from SFO are:
LAX, LAS, JFK, SAN, and ORd respectively.


Number 25
For flights originating from SFO, compute the distance to the 5 most
popular destination airports. How did you do this? Can we do this for
all pairs of airports?
tapply(winterDelays, ORIGIN=="SFO" & winterDelays$DEST=="ORD", mean)
FALSE TRUE 760.3089 1846.0000

So the average distance from SFO to ORD is 1846 miles


tapply(winterDelays, ORIGIN=="SFO" & winterDelays$DEST=="SAN", mean)
FALSE TRUE 762.0589 447.0000

So the average distance from SFO to SAN is 447 miles.


tapply(winterDelays, ORIGIN=="SFO" & winterDelays$DEST=="JFK", mean)
FALSE TRUE 759.203 2586.000

So the average distance from SFO to JFK is 2586 miles


tapply(winterDelays, ORIGIN=="SFO" & winterDelays$DEST=="LAS", mean)
FALSE TRUE 762.141 414.000

So the average distance from SFO to LAS is 414 miles


tapply(winterDelays, ORIGIN=="SFO" & winterDelays$DEST=="LAX", mean)
FALSE TRUE 762.8049 337.0000 ``` So the Average distance from SFO to LAX is 337 miles.

Yes, this can be done for any pair of airports in the data set with a simple substiution
of airport destination and origin.
Number 26
For flights from SFO to these 5 most popular destinations, compute
and display the distribution of the average speed of the airplane on
each flight?
#
subset= winterDelays[winterDelays$ORIGIN=="SFO" &
as.character(winterDelays$DEST)%in%c("ORD", "SAN", "LAX", "JFK",
"LAS"),]

subset[,"DEST"]=as.character(subset[,"DEST"])

speed= (subset$DISTANCE/ subset$AIR_TIME)

splt=split(speed, list(subset$DEST))

boxplot(splt)

Above shows the average speed of the airplanes on each of the flights of the top 5
destination airports originating from SFO.
Number 27
Are the distributions of delays for "commuter" flights between SFO and
LAX similar to those for SFO to JFK, EWR?
delay=(winterDelays$ARR_DELAY)

splt2=split(delay, list(subset$DEST))
## Warning: data length is not a multiple of split variable
boxplot(splt2, outline=FALSE)

THe
above plot shows that the ditrsibutions form lfights from SFO to LAS and LAX are
similar to those to JFK, dispite that the flights are much further in distance. Now this
plot does not show the outliers like the one below, but it gives us a better view of the
distrobutions.
boxplot(splt2)

This is the same box plot as the one above, but includes the outliers and still shows
that the dustrbutions are similar.
Number 28
What other questions could we address with this data?
We could ask which days of the winter months are the best days to fly
in terms of delay time.

We could find which airline carrier had highest average delay time for
the winter months.

We could also see if which airline carrier experienced the most


cancellations.

Number 29
What other data/variables would allow us to address additional
interesting questions?
If we had data on the weather conditions per flight, we could find what percentage
of the delays were a result of weather conditions.

We could also gather data on passengers, and see if flights with more passengers
had a higher average delay time.
We could gather delay data from the enitre year to find which is the worst and best
times to fly in terms of delay.

You might also like