Insights from Flight Delay Data Visualization

Satyam
9 min readMar 19, 2022
Photo by VOO QQQ on Unsplash

Introduction

This work was the winning solution submitted by me and my teammate Shubham Gupta in Analytics competition organized by Analytics Club of IIM Indore.

You can download the case from here and data from here. You can also download the code from my Github profile.

Objective:
The objective of the case was to find patterns and insights from the data which can help to understand reasons for flight delays

Data-set

Datasets consisting of 36 files were provided. Each file contained data for one calendar month. The data were the records of all flights taking off from and landing at all US airports. It covers a period of 3 years.

Data-set Provided(Image by Author)

Data Preprocessing

Data types of some variables such as dates were changed followed by dropping irrelevant columns as mentioned in the case. After this, missing values were handled.

Missing Value Treatment

Missing Value Visualization (Image by Author)

All the rows with missing values in columns related to delay were removed

Others were filled with median

Feature Engineering

New features based on conditions given in the problem statement were created.

Derived Features

  1. ‘Status’: To show delay status
  2. ‘DELAYED’: Based on condition given in Problem Statement
  3. ‘LANDING_DELAYED’: Based on condition given in Problem Statement
  4. ‘TAKEOFF_DELAYED’: Based on condition given in Problem Statement
  5. ‘SYSTEMIC_DELAY’: To show delay at airports: 1 if delay is more than 15% than regular delay else 0

Condition mentioned in the Case:

Delayed: A flight will be considered delayed if (actual arrival time — scheduled arrival time) / (Scheduled flying time) > 10%.
Landing Delay: A landing will be considered delayed if (actual arrival time — actual departure time — scheduled flying time) / (scheduled flying time) > 10%.
Take-off Delay: A take-off will be considered delayed if (actual departure time — scheduled departure time) / (scheduled flying time) > 10%.

Systematic Delay:
All delays may not be attributable to the airlines. For example, there may be a baggage handling system fault in an airport on a given day. All flights going out of that airport will be delayed that day. Or, there may be a security alert / heavy snow in an airport one day. All flights coming in or going out of that airport will be delayed on that day. These types of delays are called ‘systemic delay’. For evaluating airline performance, all systemic delays will need to be ignored. Also, consider only those airlines that have operated at least an average of 10 flights per day for at least 2 years (not necessarily contiguous) during the 3-year period.

Airport Delay Analysis

Average Delay at Major Airports

Firstly, we tried to analyzed how average delay varies across different airports. We also analyzed arrival and departure delay separately.

Arrival Delay (Image by Author)
Departure Delay (Image by Author)

Note:
As mentioned in the case, only those airports that have had at least 10 take-offs and 10 landings per day on an average during the 3 year period

Inference:
1) For all the major airports there seems to be a high correlation between the arrival and departure delay
2) “ATL” has maximum average arrival and departure delay but departure delay is little higher compared to arrival delay

Seasonal Delay

In this part, we wanted to understand how seasonal delays varies across different quarters. Average seasonal delay in different quarters of the years were plotted for this analysis as shown below.

Note:
A seasonal delay will be recognized if a high proportion of delays (compared to the average proportion of delays at that airport) have occurred in the same set of consecutive days / weeks / months in at least 2 of the 3 years. Consider only those cases (days / airports) that have had at least 10 take-offs and 10 landings on a given day.

Inference:
1) A seasonal Behavior in delayed proportion of flights are observed in the 1st and 4th quarter of every year
2) The increase in delays could be attributed to the unfavorable meteorological conditions affecting operations

Systemic Delay Analysis

Before we go for systemic delay analysis, don’t forget to understand how it was calculated. Check conditions given under derived features.

Systemic Delay Calculation (Assumption):
Days on which average delay is more than 15 % compared to regular delay on that airport

Percentage Systemic Delay over 3 years

Similar to the seasonal delay analysis in different quarters, percentage systemic delay in different quarter was calculated.

Inference:
1)There is seasonality in delay. Delay in quarter 1 is maximum followed by quarter 4, quarter 2, and quarter 3.

Average Systemic Delay at Major Airports

Now, we analyzed how systematic delay varied across different major airports. In the below graph, y-axis shows percentage number of flight delay due to systematic delay.

Inference:
1)Systemic Delay is maximum for PDX, more than 55 % of times there is systemic delay.
2)Systemic Delay is lowest for DFW

Volume vs Delay Analysis

Lets understand how flight volume at different airport affects takeoff and landing delay. Total number of Take-offs along with Average Percentage delay in takeoffs from major airports was plotted to understand relation between volume of flight and delay. Similar analysis was done for Landings.

Inference:
1) At airports, such as DAL, although take-offs volume is low but still average percentage delay is higher compared to other major airports. Also, there is no landing delay.
2) At many airports, such as LAX, the landing and take-off both are delayed. Also, the traffic is high. This shows inefficiency is due to congestion

Route Analysis

Next, we tried to analyze the delay on busiest route.

Flight Delay on Busiest Route

Inference:
1. Route connecting to ‘LAX’ are the most busiest. 6 out of 10 most busiest routes includes ‘LAX’.

Routes with Maximum Flight Delay

Inference:
1. Compared to Busiest routes (by air traffic) the problematic routes (based on the delay duration) have nearly 1.5x more delay

Distance Vs. Delay

Inference:
1. Arrival and Departure delays increase with the increase in distance till 2200
2. Between the distance range 2900 to 3800 units there is a sharp decline in the delays

It would be interesting to understand why there is sudden drop in delay between 2900 to 3800.

Geographical Analysis

We tried to group different airports on basis of average arrival and departure delay followed by visualization of different airports belonging to different clusters on map for better understanding.

Cluster 0(in red) denotes airports with minimum arrival and departure delay, followed by cluster 2, cluster 1 and cluster 3.

Inference:
1)Airports in the same geography but different performance: New York, LA, Miami, Houston
2)For above airports (high delay cohort) Geography, weather and terrain does not have any influence as in similar geography other airports (low delay cohort) are performing well
3)Low delay airports have a average elevation of 250 ft. whereas for mediocre performing airports elevation ranges from 750–1250 ft.

Airlines Evaluation

Everyone wants to travel by best airline. In this part, we tried to evaluate performance of different airlines.

Percentage of Flights per Company

Arrival and Departure Delay Density

Weekly Variation of Airlines Delay

Inference:
1) Significant disparity exists w.r.t no. of flights operated by airlines
2) DL, US & AA accounts for ~50% of the carriers in operation
3) However, the delay (arrival/departure) among airlines are less pronounced
4) Density function’s plot that most of the data centers around 0, so most flights had little to no delay
4) For most of the airlines except for ‘WN’ the proportion of delay increases from Monday to Friday and dips during the weekend
5) The weekly delay patterns for airlines (except ‘NW’) is agnostic of the fleet size

Arrival & Departure Delay

Inference:
1. Delays at arrival are generally lower than at departure
2. This indicates that airlines adjust their flight speed in order to reduce the delays at arrival
3. Except for AS & AA all other airlines have arrival delay less than the departure delay

Delay Magnitude across Air Carriers

Inference:
1. Independent of the airline, delays greater than 45 minutes only account for a few percent
2. For AA and UA there is not a significant reduction in going from small to large delay

Airline Performance –Total Delays (Excluding Systemic Delay)

Airlines Ranking

§Mean delays behave homogeneously among airlines

§The low value of “b” consequence of the large proportion of flights that take off on time

§a & b coefficients will be correlated with a∝1/b

§The low values of ”a” will correspond to airlines with a large proportion of important delays and, on the contrary, airlines that benefits from their punctuality will admit hight a values

Exponential Approximation:
f(x)=a exp(−x/b)
a & b are parameters obtained to describe each airline

This figure shows the normalized distribution of delays that I modeled with an exponential distribution f(x)=a exp(−x/b) . The a et b parameters obtained to describe each airline are given in the upper right corner of each panel. Note that the normalization of the distribution implies that ∫f(x)dx∼1 . Here, we do not have a strict equality since the normalization applies the histograms but not to the model function. However, this relation entails that the a et b coefficients will be correlated with a∝1/b and hence, only one of these two values is necessary to describe the distributions. Finally, according to the value of either a or b, it is possible to establish a ranking of the companies: the low values of a will correspond to airlines with a large proportion of important delays and, on the contrary, airlines that shine from their punctuality will admit high a values

Categorizing Delay Among Airlines (Image by Author)

You can reach out to me on LinkedIn . Follow me for more articles on Analytics and Data Science.

--

--