Sunday, June 30, 2013

Better Bird Strike Visualizations with R and ggplot2


Last week I wrote about building some graphs for the FAA Bird Strike Dataset. I used R's built-in graphics capabilities for that work. In this post, I re-do the graphs using the ggplot2 plotting system for R. Why? Because ggplot2 builds upon the features and capabilities of the previous two graphics systems, base and lattice, and improves upon them. It seems to be the graphics system of choice if you are doing any serious R. So in any case, something I thought was worth learning, so I did.

In the interests of time, I will skip over the back story behind the graphs, since I already covered them in depth on my previous post.

The first graph displays the frequency of bird strikes for each of the states in the US in the form of a map. It makes use of a column called Origin.State in the Bird Strike dataset. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
setwd("/path/to/project")

df <- read.csv("BirdStrikes.csv")
df.clean <- df[df$Origin.State != "N/A", ]

library(maps)
library(ggplot2)
library(RColorBrewer)

usa.map <- map_data("state")

# produce a data frame (region, val) where region is the 
# lowercased state name and val is the number of bird hits.
# We want to merge it with usa.map$region
hits.by.state <- as.data.frame(table(df.clean$Origin.State))
names(hits.by.state) <- c("region", "val")
total.hits <- sum(hits.by.state$val)
hits.by.state$region <- tolower(hits.by.state$region)

# merge the bird strike data in
usa.map.with.hits <- merge(usa.map, hits.by.state, by="region", all=TRUE)
usa.map.with.hits <- usa.map.with.hits[order(usa.map.with.hits$order), ]

# plot the data
colors <- brewer.pal(9, "Reds")
(qplot(long, lat, data=usa.map.with.hits, geom="polygon", 
       group=group, fill=val)) +
  theme_bw() +
  labs(title="States by Strike Count", x="", y="", fill="") + 
  scale_fill_gradient(low=colors[1], high=colors[9]) +
  theme(legend.position="bottom", legend.direction="horizontal")

And the graph generated looks like this:


The second graph shows the total cost of the impact by state. In an effort to pack in more information, we use a color gradient to indicate the number of bird hits for each state. So we combine the second and third graphs from the previous post into one here. We also flipped the graph to make it horizontal, that way its also easier to read. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
library(ggplot2)
library(RColorBrewer)

# We want to produce a single bar plot where each state is represented
# by a bar. The color of the bar indicates the number of bird hits and
# the length of the bar indicates the cost incurred as a result
strikes <- as.data.frame(table(df.clean$Origin.State))
names(strikes) <- c("state", "num_hits")
costs <- as.data.frame(aggregate(
  as.numeric(Cost..Total..) ~ Origin.State, data=df.clean, FUN="sum"))
names(costs) <- c("state", "cost")
comb.strikes.costs <- merge(strikes, costs, all=TRUE)
comb.strikes.costs <- comb.strikes.costs[
  rev(order(comb.strikes.costs$cost)), ]

# clean up N/A from merged data, and add a state.level field for
# sorting by state name for ggplot. If we use state, ggplot will
# plot with the Alpha list of states, we want it sorted by cost
comb.strikes.costs <- comb.strikes.costs[
  comb.strikes.costs$state != "N/A", ]
comb.strikes.costs <- transform(comb.strikes.costs, 
                                state.level=reorder(state, cost))
comb.strikes.costs <- transform(comb.strikes.costs, 
                                hit.level=cut(num_hits, 9))

# plot the data
ggplot(comb.strikes.costs, aes(x=state.level, y=cost, colour=hit.level)) + 
  geom_bar(width=0.5, stat="identity") + 
  coord_flip() +
  labs(title="Costs by State", x="", y="(Dollars)", fill="") +
  scale_colour_brewer(palette="Reds") +
  theme(legend.position="none")

And our graph looks like this:


Our next graph draws the total bird strikes (ie, across all states) as a time series. The time dimension is given by the FlightDate column. We also compute a trend line using a linear model similar to the previous post and draw it on the graph. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
library(ggplot2)
library(RColorBrewer)

# draw line chart of number of bird hits by flight date.
# Convert flight date to Date so we can sort
hit.dates <- as.data.frame(table(df$FlightDate))
names(hit.dates) <- c("flight_date", "num_hits")
hit.dates$flight_date <- as.Date(hit.dates$flight_date, 
                                 format="%m/%d/%Y %H:%M")

# Compute trend line and plot
trend <- lm(hit.dates$num_hits ~ hit.dates$flight_date)
trend.coeffs <- as.array(coef(trend))

# graph the number of hits over date
ggplot(data=hit.dates, aes(x=flight_date, y=num_hits, group=1)) + 
  geom_line() +
  labs(title="Bird Strikes by Date", x="Flight Date", y="#-hits") +
  geom_abline(intercept=trend.coeffs[[1]], slope=trend.coeffs[[2]], 
              color="red", size=2)

And the output is the graph as shown below:


Finally, we want to graph the incidence of various effects as a result of the bird strikes. The effect is given by the Effect..Impact.to.flight column. It is a set of 6 category variables, out of which we discard "Null", "None" and "Other" since they don't appear interesting enough for our analysis. To keep the graph smooth, we rollup the numbers into monthly numbers. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
library(ggplot2)
library(reshape2)
library(RColorBrewer)
library(scales)

# roll up flight dates to nearest month. We do this by replacing all
# flight dates to the first of the month, then computing the number
# of days since the epoch (1970-01-01).
df.clean$FlightDate <- as.numeric(as.Date(
  format(as.Date(df.clean$FlightDate, format="%m/%d/%Y %H:%M"), 
  "%Y-%m-01")) - as.Date("1970-01-01"))

# Populate a table whose rows are flight dates and whose columns
# are each category of bird strike impact.
flight.dates <- unique(df.clean$FlightDate)
flight.dates <- flight.dates[order(flight.dates)]
effects <- unique(df.clean$Effect..Impact.to.flight)

# build up the data frame for plotting.
plot.df <- data.frame()
for (flight.date in flight.dates) {
  df.filter <- df.clean[df.clean$FlightDate == flight.date, ]
  counts <- table(df.filter$Effect..Impact.to.flight)
  row <- c(flight.date)
  for (effect in effects) {
    if (nchar(effect) > 0) {
      effect.count <- counts[effect]
      row <- cbind(row, effect.count)
    }
  }
  plot.df <- rbind(plot.df, row)
}
names(plot.df) <- c("flight_date", "NO", "PL", "AT", "OT", "ES")

# graph the number of hits over date. We restrict it to (PL, AT, ES)
# because the others seem unimportant.
plot.df.melted <- melt(plot.df, 
  id="flight_date", measure=c("PL", "AT", "ES"))
# convert the number of days since epoch back to actual dates for 
# graphing
plot.df.melted$flight_date <- as.POSIXct(
  plot.df.melted$flight_date * 86400, origin="1970-01-01")
ggplot(plot.df.melted, 
  aes(x=flight_date, y=value, colour=variable)) +
  labs(title="Impact Types over Time", x="Flight Date", y="#-impacts") +
  theme(legend.position="bottom") +
  geom_line(stat="identity")

And the corresponding graph is shown below:


If you compare the graphics generated in this post to my previous post, I think you will agree that the graphs look nicer and more professional. You may also notice a slight improvement in the quality of the R code. Thats because with ggplot2, it is easier if you can convert your input set to a data frame, so you will probably find the current R code more structured and easier to understand. I also use a few built-in R functions that I didn't know existed before.

1 comments (moderated to prevent spam):

Sujit Pal said...

Hi Sojar, just wanted to point out that this was a post about /not/ using Tableau (and using ggplot2 instead). Also wanted to point out that the link you provided 404s, you may want to fix/get fixed.