Introduction
This work was the winning solution submitted by me and my teammate Shubham Gupta in Analytics competition organized by Analytics Club of IIM Indore.
You can download the case from here and data from here. You can also download the code from my Github profile.
Objective:
The objective of the case was to find patterns and insights from the data which can help to understand reasons for flight delays
Data-set
Datasets consisting of 36 files were provided. Each file contained data for one calendar month. The data were the records of all flights taking off from and landing at all US airports. It covers a period of 3 years.
Data Preprocessing
Data types of some variables such as dates were changed followed by dropping irrelevant columns as mentioned in the case. After this, missing values were handled.
Missing Value Treatment
All the rows with missing values in columns related to delay were removed
Others were filled with median
Feature Engineering
New features based on conditions given in the problem statement were created.
Derived Features
- ‘Status’: To show delay status
- ‘DELAYED’: Based on condition given in Problem Statement
- ‘LANDING_DELAYED’: Based on condition given in Problem Statement
- ‘TAKEOFF_DELAYED’: Based on condition given in Problem Statement
- ‘SYSTEMIC_DELAY’: To show delay at airports: 1 if delay is more than 15% than regular delay else 0
Condition mentioned in the Case:
Delayed: A flight will be considered delayed if (actual arrival time — scheduled arrival time) / (Scheduled flying time) > 10%.
Landing Delay: A landing will be considered delayed if (actual arrival time — actual departure time — scheduled flying time) / (scheduled flying time) > 10%.
Take-off Delay: A take-off will be considered delayed if (actual departure time — scheduled departure time) / (scheduled flying time) > 10%.Systematic Delay:
All delays may not be attributable to the airlines. For example, there may be a baggage handling system fault in an airport on a given day. All flights going out of that airport will be delayed that day. Or, there may be a security alert / heavy snow in an airport one day. All flights coming in or going out of that airport will be delayed on that day. These types of delays are called ‘systemic delay’. For evaluating airline performance, all systemic delays will need to be ignored. Also, consider only those airlines that have operated at least an average of 10 flights per day for at least 2 years (not necessarily contiguous) during the 3-year period.
Airport Delay Analysis
Average Delay at Major Airports
Firstly, we tried to analyzed how average delay varies across different airports. We also analyzed arrival and departure delay separately.
Note:
As mentioned in the case, only those airports that have had at least 10 take-offs and 10 landings per day on an average during the 3 year period
Inference:
1) For all the major airports there seems to be a high correlation between the arrival and departure delay
2) “ATL” has maximum average arrival and departure delay but departure delay is little higher compared to arrival delay
Seasonal Delay
In this part, we wanted to understand how seasonal delays varies across different quarters. Average seasonal delay in different quarters of the years were plotted for this analysis as shown below.
Note:
A seasonal delay will be recognized if a high proportion of delays (compared to the average proportion of delays at that airport) have occurred in the same set of consecutive days / weeks / months in at least 2 of the 3 years. Consider only those cases (days / airports) that have had at least 10 take-offs and 10 landings on a given day.
Inference:
1) A seasonal Behavior in delayed proportion of flights are observed in the 1st and 4th quarter of every year
2) The increase in delays could be attributed to the unfavorable meteorological conditions affecting operations
Systemic Delay Analysis
Before we go for systemic delay analysis, don’t forget to understand how it was calculated. Check conditions given under derived features.
Systemic Delay Calculation (Assumption):
Days on which average delay is more than 15 % compared to regular delay on that airport
Percentage Systemic Delay over 3 years
Similar to the seasonal delay analysis in different quarters, percentage systemic delay in different quarter was calculated.
Inference:
1)There is seasonality in delay. Delay in quarter 1 is maximum followed by quarter 4, quarter 2, and quarter 3.
Average Systemic Delay at Major Airports
Now, we analyzed how systematic delay varied across different major airports. In the below graph, y-axis shows percentage number of flight delay due to systematic delay.
Inference:
1)Systemic Delay is maximum for PDX, more than 55 % of times there is systemic delay.
2)Systemic Delay is lowest for DFW
Volume vs Delay Analysis
Lets understand how flight volume at different airport affects takeoff and landing delay. Total number of Take-offs along with Average Percentage delay in takeoffs from major airports was plotted to understand relation between volume of flight and delay. Similar analysis was done for Landings.
Inference:
1) At airports, such as DAL, although take-offs volume is low but still average percentage delay is higher compared to other major airports. Also, there is no landing delay.
2) At many airports, such as LAX, the landing and take-off both are delayed. Also, the traffic is high. This shows inefficiency is due to congestion
Route Analysis
Next, we tried to analyze the delay on busiest route.
Flight Delay on Busiest Route
Inference:
1. Route connecting to ‘LAX’ are the most busiest. 6 out of 10 most busiest routes includes ‘LAX’.
Routes with Maximum Flight Delay
Inference:
1. Compared to Busiest routes (by air traffic) the problematic routes (based on the delay duration) have nearly 1.5x more delay
Distance Vs. Delay
Inference:
1. Arrival and Departure delays increase with the increase in distance till 2200
2. Between the distance range 2900 to 3800 units there is a sharp decline in the delays
It would be interesting to understand why there is sudden drop in delay between 2900 to 3800.
Geographical Analysis
We tried to group different airports on basis of average arrival and departure delay followed by visualization of different airports belonging to different clusters on map for better understanding.
Cluster 0(in red) denotes airports with minimum arrival and departure delay, followed by cluster 2, cluster 1 and cluster 3.
Inference:
1)Airports in the same geography but different performance: New York, LA, Miami, Houston
2)For above airports (high delay cohort) Geography, weather and terrain does not have any influence as in similar geography other airports (low delay cohort) are performing well
3)Low delay airports have a average elevation of 250 ft. whereas for mediocre performing airports elevation ranges from 750–1250 ft.
Airlines Evaluation
Everyone wants to travel by best airline. In this part, we tried to evaluate performance of different airlines.
Percentage of Flights per Company
Arrival and Departure Delay Density
Weekly Variation of Airlines Delay
Inference:
1) Significant disparity exists w.r.t no. of flights operated by airlines
2) DL, US & AA accounts for ~50% of the carriers in operation
3) However, the delay (arrival/departure) among airlines are less pronounced
4) Density function’s plot that most of the data centers around 0, so most flights had little to no delay
4) For most of the airlines except for ‘WN’ the proportion of delay increases from Monday to Friday and dips during the weekend
5) The weekly delay patterns for airlines (except ‘NW’) is agnostic of the fleet size
Arrival & Departure Delay
Inference:
1. Delays at arrival are generally lower than at departure
2. This indicates that airlines adjust their flight speed in order to reduce the delays at arrival
3. Except for AS & AA all other airlines have arrival delay less than the departure delay
Delay Magnitude across Air Carriers
Inference:
1. Independent of the airline, delays greater than 45 minutes only account for a few percent
2. For AA and UA there is not a significant reduction in going from small to large delay
Airline Performance –Total Delays (Excluding Systemic Delay)
Airlines Ranking
§Mean delays behave homogeneously among airlines
§The low value of “b” consequence of the large proportion of flights that take off on time
§a & b coefficients will be correlated with a∝1/b
§The low values of ”a” will correspond to airlines with a large proportion of important delays and, on the contrary, airlines that benefits from their punctuality will admit hight a values
Exponential Approximation:
f(x)=a exp(−x/b)
a & b are parameters obtained to describe each airline
This figure shows the normalized distribution of delays that I modeled with an exponential distribution f(x)=a exp(−x/b) . The a et b parameters obtained to describe each airline are given in the upper right corner of each panel. Note that the normalization of the distribution implies that ∫f(x)dx∼1 . Here, we do not have a strict equality since the normalization applies the histograms but not to the model function. However, this relation entails that the a et b coefficients will be correlated with a∝1/b and hence, only one of these two values is necessary to describe the distributions. Finally, according to the value of either a or b, it is possible to establish a ranking of the companies: the low values of a will correspond to airlines with a large proportion of important delays and, on the contrary, airlines that shine from their punctuality will admit high a values
You can reach out to me on LinkedIn . Follow me for more articles on Analytics and Data Science.