# install.packages("anyflights")
# install.packages("sqldf")
Aviation Data Analysis using SQL and ggplot2
EDAV Community Contribution (Fall 2024)
Background
Data Manipulation is a very important step before performing any kind of analysis. It allows us to alter the structure of the data and helps to derive additional data from the existing data. Without proper data manipulation, the quality of analysis or visualizations can get compromised. In R, dplyr
is one of the most popular and effective data manipulation libraries. Moreover, SQL is also another language that is dedicated to handle and manipulate data. Almost all the operations that can be done with SQL queries can be translated into an equivalent logic written with the help of the dplyr
library.
In this notebook, I have used the anyflights
package to retrieve some data pertaining to the on-time performance of flights departing out of all the three airports serving the New York City (NYC) region (i.e. LGA, JFK, and EWR). This package is a great open-source resource for aviation data since it queries the annual flight data released by Federal Aviation Administration (FAA). Moreover, it can be used for performing geo-spatial and time-series data analysis.
Five questions are formulated based on the retrieved data and data manipulation is performed by writing SQL select queries for them. In addition to data manipulation, simple plots are also made to visualize the data obtained as a result of the select queries.
Installations
This notebook explores two new packages in R. These include the following:
anyflights
: It is a package that provides useful information about the arrival and departure delays of flights (within the United States). It queries the yearly data released by the Federal Aviation Administration (FAA). In addition to the flight on-time performance, it also provides information about the aircraft information (such astail number
,manufacturer
, andmodel name
), weather information (such asvisibility
and precipitation), airline metadata, and airport metadata.(ref: https://cran.rproject.org/web/packages/anyflights/anyflights.pdf)
sqldf
: It is a package that allows to write SQL select queries on data frames. The package has only one function which issqldf()
and it takes one required argument, the SQL query. It is one of the simplest ways to execute SQL queries on data frames in R.(ref: https://cran.r-project.org/web/packages/sqldf/sqldf.pdf)
Loading the required libraries
library(ggplot2, quietly = T)
library(dplyr, quietly = T)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyverse, quietly = T)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
✔ readr 2.1.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(anyflights, quietly = T)
library(RSQLite)
library(sqldf)
Loading required package: gsubfn
Loading required package: proto
Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 0x0006): Library not loaded: /opt/X11/lib/libSM.6.dylib
Referenced from: <31EADEB5-0A17-3546-9944-9B3747071FE8> /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/modules/R_X11.so
Reason: tried: '/opt/X11/lib/libSM.6.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/X11/lib/libSM.6.dylib' (no such file), '/opt/X11/lib/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk-11.0.18+10/Contents/Home/lib/server/libSM.6.dylib' (no such file)
tcltk DLL is linked to '/opt/X11/lib/libX11.6.dylib'
Could not load tcltk. Will use slower R code instead.
library(DBI)
Obtaining the data
The anyflights()
has three required arguments as follows:
station
: The origin airport(s). This can be a single value or a list of values.year
: The year for which the data is required. This can only be a single value.month
: The month(s) for which the data is required. This can be a single value or a list of values.
Note: The fetching and processing of the data takes significant amount of time. Hence, increasing the default timeout
of 60 seconds using the timeout
argument of options()
can be helpful to omit a chance of timeout while obtaining the data.
# options(timeout = 120)
# nycflights23 <- anyflights(c("JFK", "LGA", "EWR"), 2023, c(1,2))
All the data is stored in .csv
files to avoid repeated API calls.
# write.csv(x = nycflights23$flights, file = "flights.csv")
# write.csv(x = nycflights23$planes, file = "planes.csv")
# write.csv(x = nycflights23$weather, file = "weather.csv")
<- read.csv(file = "flights.csv")
flights <- read.csv(file = "planes.csv")
planes <- read.csv(file = "weather.csv") weather
An interesting feature of the anyflights
package is that after loading the data, it can be converted to be used as a package for future use. However, this works only for the current session.
# as_flights_package(nycflights23)
Revisiting SQL
SQL stands for Structured Query Language and it is used to store relational data. Any entry in the table must be uniquely identifiable with primary key that can include a single column or more than one column. Any table can have one or more than one foreign key that can help retrieve relevant information from other tables as and when required. Moreover, these foreign keys are often used to consolidate two or more tables together with the help of the concept of joins.
Select queries in SQL are data control statements since they only retrieve the data and do not perform any modifications to it. The typical structure of a select query includes the following keywords:
SELECT <col1>, <col2>, ...
: The select keyword defines which columns need to be included in the data. Here,<col1>
and<col2>
are the columns that will be a part of the result obtained after the execution of the query. Other columns will not be included in the result.FROM <table>
: The from keyword tells which table will be queried. Here,<table>
is the placeholder for the table that will be queried.<table>
often includes a keywordJOIN
that combines two or more tables together based on a join predicate generally includes conditions on the foreign keys that make the tables related to each other.WHERE <condition>
: Thewhere
keyword filters the results based on the given condition(s).GROUP BY <col1>, <col2>, ...
: The group by keyword is used to arrange columns in the form of groups and apply aggregate functions on them.ORDER BY <col1>, <col2>, ...
: The order by keyword is used to sort the rows based on the columns mentioned after it. Here,<col1>
and<col2>
are the columns that will be used to the results obtained after the query execution. By default, the values are sorted in ascending order. However, if descending order is preferred then the column name can be succeeded byDESC
.AS
: The as keyword is used to alias column and table names.WITH
: The with keyword allows to store results of a select query so that it can be used multiple times in the main select query.COUNT()
,AVG()
,ROUND()
: These are aggregate functions and must be used along withGROUP BY
There are a plethora of other keywords (such as HAVING
) and functions that can be used inside select queries in SQL.
Background for on-time performance for flights
With reference to the aviation industry, on-time performance deals with how well the flight abides to its designated schedule. This leads to three cases as follows:
If the flight overshoots the scheduled departure or arrival time, then it is referred to as late. This implies that a positive
dep_delay
orarr_delay
means that the flight arrived or departed late.If the flight departs or arrives before schedule, then it is referred to as early. This implies that a negative
dep_delay
orarr_delay
means that the flight arrived or departed early.If the flight departs or arrives exactly as per the schedule, then it is referred to as on-time. This implies that a zero
dep_delay
orarr_delay
means that the flight arrived or departed on-time.
A flight can be late for numerous reasons and totally depends on the real-time situation. Most obvious reasons for the delay could be weather-related issues, technical issues related to aircraft equipment, insufficient airport infrastructure, and busy air traffic. Diagnosing the reason behind the delay can be a very challenging task for a computer since there are a lot of factors that come into picture and there might be a totally novel factor that could lead to a flight delay.
In this notebook, five such common parameters are computed as follows:
Average departure delay of the origin airports
Average arrival delay of the destination airports
Market share of airlines
Age of the aircraft
Visibility of the departure airport
Data is fetched and manipulated for each of the above cases.
Performing data transformation using SQL in R
The library(nycflights23)
does work properly when the rendering the .qmd
file.
# library(nycflights23)
A remote connection to SQLite
database is opened by the dbConnect()
function. The dbWriteTable()
and dbReadTable()
are used to copy data frames into a database table. I apply all the functions to the flights
, planes
, and weather
data frames from nycflights23
.
<- dbConnect(SQLite(), "")
connection dbWriteTable(connection, 'flights', flights)
= dbReadTable(connection, 'flights')
flights_tbl
dbWriteTable(connection, 'planes', planes)
= dbReadTable(connection, 'planes')
planes_tbl
dbWriteTable(connection, 'weather', weather)
= dbReadTable(connection, 'weather') weather_tbl
Let us formulate five questions that will be solved using the SQL with help of sqldf
.
Writing SQL queries in R
Q. What is the average departure delay for the airports that serve NYC region?
The origin
column in the flights
data frame contains only three airports which serve the New York City region. COUNT(*)
computes the number of rows for each origin airport (which is nothing but all the flights operated from that airport). AVG(dep_delay)
computes the average departure delay at each origin airport with help of GROUP BY
on origin
.
The values in the result are sorted in descending order of num_flights
.
Note: SQL is case insensitive but writing the keywords in capital letters helps distinguish between keywords or functions and column names.
# only keeping common airlines
<- c("DL", "AA", "UA", "WN", "B6", "YX")
airlines $carrier <- ifelse(flights$carrier %in% airlines,
flights$carrier, "Other") flights
<- sqldf("SELECT origin,
avg_dep_delay COUNT(*) AS num_flights,
AVG(dep_delay) AS avg_dep_delay
FROM flights
GROUP BY origin
ORDER BY num_flights DESC;")
avg_dep_delay
origin num_flights avg_dep_delay
1 LGA 26682 10.83581
2 EWR 22614 13.41770
3 JFK 21485 13.73005
Q. What is the average arrival delay for each airport?
The logic to this question is very similar to the previous except that the dest
and arr_delay
is used instead of origin
and dep_delay
respectively.
<- sqldf("SELECT dest,
avg_arr_delay COUNT(*) AS num_flights,
AVG(arr_delay) AS avg_arr_delay
FROM flights
GROUP BY dest
ORDER BY num_flights DESC
LIMIT 15;")
avg_arr_delay
dest num_flights avg_arr_delay
1 BOS 3438 5.9398564
2 ORD 3122 4.2192897
3 MCO 2932 7.7469051
4 MIA 2885 3.2115922
5 ATL 2800 4.3634067
6 FLL 2515 7.8836743
7 LAX 2449 2.4705155
8 CLT 2068 3.7313141
9 DFW 1901 5.5391400
10 RDU 1854 3.6706849
11 DCA 1825 -2.4446309
12 SFO 1742 0.9865889
13 DTW 1618 0.8500627
14 DEN 1575 6.5267915
15 PBI 1556 9.2992889
Q. What is the market share of each carrier on every route?
Market share is considered as the number of flights operated by the carrier divided by the total number of flights operated on that route. Initially, the total number of flights on each route are computed and is stored temporarily as total_flights
. In the main query, the two tables: flights
and total_flights
are joined based on the origin
and dest
. The market share is computed as a proportion of flights operated by the carrier from the total number of flights between origin
and dest
.
Note: In SQL, when computing percentages, it is important to use 100.0 instead of 100 to get the percentage with decimal value precision.
<- sqldf("WITH total_flights AS (
mk_share SELECT origin,
dest,
COUNT(*) AS num_flights
FROM flights
GROUP BY origin, dest
ORDER BY num_flights
)
SELECT flights.origin,
flights.dest,
carrier,
ROUND(COUNT(*)*100.0/num_flights, 1) AS market_share
FROM flights JOIN total_flights
ON flights.origin = total_flights.origin AND
flights.dest = total_flights.dest
GROUP BY flights.origin, flights.dest, carrier
ORDER BY total_flights.num_flights DESC;"
)
<- mk_share |>
mk_share_wide pivot_wider(names_from = carrier, values_from = market_share)
|> head(15) mk_share_wide
# A tibble: 15 × 9
origin dest AA DL Other UA YX B6 WN
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 LGA ORD 38.2 22.5 9 21.1 9.2 NA NA
2 JFK LAX 31.9 29.8 NA NA NA 38.3 NA
3 LGA BOS NA 9.9 NA NA 41.8 48.3 NA
4 LGA ATL 0.1 59.7 4.7 NA 17.6 NA 17.9
5 EWR MCO NA NA 21.3 57.7 NA 21 NA
6 LGA DCA 34.5 NA NA NA 65.5 NA NA
7 JFK BOS NA 15.3 4.6 NA 35 45.1 NA
8 JFK SFO 19.6 28.8 21.5 NA NA 30.2 NA
9 JFK MIA 54.1 22.3 NA NA NA 23.6 NA
10 EWR FLL NA NA 25.3 50.5 NA 24.1 NA
11 LGA DFW 70 24.8 5.2 NA NA NA NA
12 LGA MIA 58.3 23.5 18.2 NA NA NA NA
13 EWR ATL NA 47.6 12.1 35.6 4.7 NA NA
14 LGA MCO NA 36.2 24.8 NA NA 36.4 2.5
15 EWR ORD 32.4 NA 4.3 63.2 NA NA NA
= c("DL" = "#EE7663", "AA" = "#A0BEF5", "WN" = "lightgreen",
fill_color "UA" = "cornflowerblue", "B6" = "gold", "Other" = "#91AFB3",
"YX" = "#85decb")
|> head(15) |>
mk_share_wide pivot_longer(cols = !c("origin", "dest"),
names_to = "carrier",
values_to = "market_share") |>
ggplot(aes(x = paste(origin, " to ", dest),
y = market_share, fill = carrier)) +
geom_col(position = "stack") +
geom_text(aes(label = market_share),
position = position_stack(vjust = .5),
size = 3.5) +
scale_fill_manual(values = fill_color) +
labs(title = "Marketshare of Carrier by Route",
x = "Route",
y = "Market Share (%)") +
coord_flip() +
theme_minimal()
Warning: Removed 53 rows containing missing values or values outside the scale range
(`geom_col()`).
Warning: Removed 53 rows containing missing values or values outside the scale range
(`geom_text()`).
If a carrier has significantly higher market share on a particular route, then it is highly probable that the carrier has a hub at either the origin or destination airport. For example, United Airlines does not have any flight departing from JFK or LGA. Hence, it has a hub at EWR airport.
Q. What is the age of the plane operating the flight?
The age of the plane corresponds to the difference between the current year and the year in which it came into service. The two tables flights and planes are joined together based on the tail number of plane (foreign key) and the difference between flights.year
and planes.year
is computed. The planes for which the which the year of start of service is not provided are ignored in the calculation.
The final result is sorted in descending order of the dep_delay
.
<- sqldf("SELECT origin,
plane_age dest,
sched_dep_time,
flights.tailnum,
planes.year AS service_start_year,
flights.year - planes.year AS age_of_plane,
dep_delay,
arr_delay
FROM flights JOIN planes
ON flights.tailnum = planes.tailnum
WHERE service_start_year IS NOT NULL
ORDER BY dep_delay DESC;")
|> head(15) plane_age
origin dest sched_dep_time tailnum service_start_year age_of_plane dep_delay
1 LGA DFW 1700 N967AN 2001 22 1665
2 JFK STT 1241 N8001N 2013 10 1201
3 LGA DEN 1215 N47281 2022 1 1181
4 EWR MSP 1500 N241SY 2016 7 1169
5 LGA ATL 1340 N336NW 1992 31 1097
6 JFK SLC 1630 N394DA 2000 23 1074
7 EWR DTW 555 N955AT 2000 23 1057
8 EWR ATL 1400 N350NA 1993 30 1043
9 LGA ATL 1555 N366NW 1999 24 1038
10 JFK MIA 1955 N118DY 2021 2 1027
11 JFK ORD 1600 N261SY 2017 6 1010
12 LGA MIA 1759 N310FR 2017 6 1006
13 LGA ORD 1915 N145SY 2015 8 996
14 EWR SLC 1843 N849DN 2015 8 974
15 JFK SFO 730 N556AS 2006 17 967
arr_delay
1 1647
2 1193
3 1170
4 1183
5 1082
6 1073
7 1035
8 1044
9 1009
10 1008
11 1012
12 961
13 1016
14 970
15 982
|>
plane_age ggplot(aes(x = age_of_plane, y = dep_delay, color = arr_delay)) +
geom_point(alpha = 1.5) +
labs(title = "Age of plane vs. Departure delay",
subtitle = "filled with arrival delay",
x = "Age of plane (in years)",
y = "Departure delay (in minutes)") +
theme_minimal()
Warning: Removed 905 rows containing missing values or values outside the scale range
(`geom_point()`).
From the plot, it is not certain that if the plane is old, there will be a delay.
Q. What is the visibility at the time of take-off for every flight?
Visibility is a key factor that can lead to flights taking off late or arriving late. It is basically refers to how clearly the ground is visible. Visibility is denoted by visib
column in the weather
data frame. The values of visibility are only reported for the origin airport on an hourly basis. Therefore, the flights and weather columns are joined based on the origin
. At this point, all the instances of the flight are mapped to the 24 hourly readings of visibility. This result is filtered by checking whether the month
, day
, and hour
of flight match.
The hour
is not provided explicitly in the flights
data frame but it is be derived from the sched_dep_time
column. The sched_dep_time
provides an integer which indicates the scheduled time of departure. If the scheduled time of departure is 23:10, then the sched_dep_time
column holds 2310 in integer format. The hour is derived by dividing the value in sched_dep_time
by 100 and rounding it off to the nearest integer. The resultant value is compared with the hour
column of the weather
data frame.
All rows in the result are sorted in ascending order of visibility
and descending order of dep_delay
.
<- sqldf("SELECT flights.origin,
visib dest,
visib AS visibility,
dep_delay,
arr_delay
FROM flights JOIN weather
ON flights.origin = weather.origin
WHERE flights.month = weather.month AND
flights.day = weather.day AND
ROUND(flights.sched_dep_time/100) = weather.hour AND
visib IS NOT NULL
ORDER BY visib ASC;")
|> head(15) visib
origin dest visibility dep_delay arr_delay
1 JFK MIA 0.24 -6 -6
2 JFK ATL 0.24 151 126
3 JFK SJU 0.24 -2 -8
4 JFK TPA 0.24 18 28
5 JFK SLC 0.24 20 26
6 JFK MIA 0.24 25 10
7 JFK MIA 0.24 -3 11
8 JFK LAS 0.24 -2 15
9 JFK BQN 0.24 -2 2
10 EWR GSO 0.24 0 2
11 EWR LAS 0.24 0 -6
12 EWR RDU 0.24 -7 -11
13 EWR AUS 0.24 -5 -20
14 EWR PBI 0.24 1 0
15 EWR ORD 0.24 26 27
|>
visib ggplot(aes(x = visibility, y = dep_delay, color = arr_delay)) +
geom_point(alpha = 0.9) +
labs(title = "Visibility vs. Departure delay",
subtitle = "filled with arrival delay",
x = "Visibility (in miles)",
y = "Departure delay (in minutes)") +
theme_minimal()
Warning: Removed 890 rows containing missing values or values outside the scale range
(`geom_point()`).
The delays because of bad visibility are not very high enough. Rather, the maximum delays are at a very high visibility of 10 miles which means there is some other reason for the delays.
Conclusion
This notebook discusses the availability of an open-source library that can provide actual flight information for any given set of origin airports, months, and a specific year.
All the data manipulation was carried out using SQL. However, all the above SQL queries can be written in R code using the dplyr
library. Both the methods give the same results and choosing which one to use is completely your choice.
In addition to performing using SQL within R, this notebook also gives an idea on how to simple select queries in SQL.