1. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 1
HOW DATA CAN HELP TO REDUCE
AVIATION ACCIDENTS
Carmen Serpa, Northwestern University. Sunilkumar Kakade, Northwestern University
Abstract - This project is based on the data from the National
Transportation Safety Board (NTSB). It involves classifying a set
of aircraft accident/incident data covering the years 2000 to 2015
in United States. The NTSB provides one of the most
comprehensive online aircraft accident and incident databases. It
includes dates, places, aircraft and engine types, scheduled and
non-scheduled certificated air carrier, and name of air carrier.
These filters enable business aircraft operators to identify risks
associated with the aircraft they operate, the types of operations
and procedures they typically fly, and the airports they frequent.
The big data industry has mastered the art of gathering and
logging terabytes of data, but the challenge is to base forecasts and
make decisions derived from this real data, which is why Hadoop
is so important.
The purpose of this study is to optimize analyses of the NTSB
massive data with Hadoop and uncover hidden patterns that can
give some ideas of how we can reduce the rate of aviation
accidents and thus decrease risk and improve safety. While many
factors can contribute to accidents, it is important to identify
which ones are the most commons in order to fix them.
Specifically, the focus is to see what we can learn from the past in
order to identify appropriate measures to prevent repetitive errors
in the future.
Hadoop should be used to extend the ability of human pattern
recognition to uncover accidents’ causes in multivariate data.
Although Hadoop can assimilate more data at any given time than
the analyst, it is the analyst, at the end, who must make the
necessary decisions and judgements about the data.
I. INTRODUCTION
Whether its fine monitoring shop floor operations, gauging
consumer sentiment, or any number of other large-scale
analytic challenges, big data is having a tremendous impact
on the aviation enterprise. The amount of data that is
generated in every flight has risen over the years and more
and more types of information are being stored in digital
formats. However, it is not just access to new data sources,
but access to patterns and interrelationships among these
elements that are of interest. Collecting lots of diverse types
of data very quickly does not create value, it is the analysis
of this data to uncover insights that will help your
organization, which is important.
This project is based on data from the National
Transportation Safety Board (NTSB) and it consist of
26,931 records from 2000 to 2015. The NTSB not only
handles USA data but world data, and it currently manages
data from the 1940s to present. This generates tremendous
amounts of log data, that storing and analysis become issues
that need to be solved. Storing vehicle monitoring data is
very important for the NTSB to give information support
for departments such as public security, criminal
investigation, and economic investigation and front-line
police.
Hadoop is widely known to solve all types of problems with
Big Data, and for this study I have used some Hadoop
components, such as: Hive, Pig and Mahout to analyze the
data. Many questions about aviation accidents in USA have
been answered. However, the big question still remains:
how can it be safer, particularly in the context of increasing
the amount of flights? Consequently and definitively more
questions can be solved an analyzed in order to get a more
complete analysis of this data.
After analyzing this data, the most-common themes in these
mishaps are the apparent failure to stabilize the aircraft
during landing that can be due to many factors such as the
apparent willingness of the flight crew to attempt to land
the aircraft in unsafe conditions. Analyzing this data
uncovered that bad metrological conditions are not the main
reason for aviation accidents. A more exhaustive analyses
can be done only in airports were bad metrological
conditions have influenced aviation accidents. A detailed
analyses was done to the Illinois State, revealing that most
aviation accidents in Illinois happen during landing and that
Cessna is the aircraft that was involved in more of its
accidents during this last 15 years.
Based on the history of aviation accidents in USA, probable
accidents have been forecasted using Mahout for some
states. It predicts in what phase of flight Aviation accidents
are more probable to occur.
This project has demonstrated that Hadoop has high degree
of sustainability and robustness in analyzing and storing
large quantity of data. Big data doesn’t only bring new data
types and storage mechanisms, but new types of analysis as
well. Big data analysis is a continuum, not an isolated set of
activities.
2. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 2
II. DATASET PREPARATION AND DESCRIPTION
This project is based on the data from the National
Transportation Safety Board (NTSB). The data contains
information from 1941 to present from over the world, but
for the purpose of this study I have worked with data from
2,000 to 2,015 in USA. The reason of this reduction of data
is that after a quick inspection, I have realized that there are
many missing data, especially in aircraft category.
During the period of 2,000 to 2,015 there are also missing
data but this is in less proportion than before the year 2,000.
Missing data for this study was replaced by “unknown”
and, thus calculations had to be based on this information.
After that, the aircraft accident and incident records were
uploaded into Hadoop HDFS using Hue. The workflow
here would be to download each year of Aviation data, then
upload it using the “Upload –> Files” menu drop-down.
The Aviation file was in a csv format. After the file was
uploaded, a table containing 32 attributes and 26,931was
created: Aviationusa.csv
However, to facilitate the use of Pig, I have created a txt file
based on the original aviation accidents file in csv. Pig can
also use a csv file, but it needs to use a different script.
III. METHODOLOGY
To analyze this data I have used Hive and Pig in order to
uncover hidden patterns that will help to solve specific
questions. Mahout is also used and it will predict aviation
accidents during a specific phase of flight. R is used in this
study to visualize the difference between injured and
uninjured persons in aviation accidents from 2,000 to 2,015.
IV. RESULTS
Analyzing Aviation Accidents Dataset with Hive
Every flight is different, but in some cases accidents follow
well-worn patterns. Whether heedless, hapless, or simply
clueless, pilots keep falling into the same traps that have
snared others before them and that is what I will try to
discover in this study.
I have performed quick ad hoc analyses for the dataset in
order to get some insights of what are the factors that may
cause air accidents.
First, locating aviation accidents in a USA map in
Cloudera.
SELECT Latitude, Longitude FROM Aviationusa;
Map from Cloudera
Answering some questions:
1. Weather can be a factor to have accidents.
Analyzing the data we can see that in USA is not a
determinant influence.
IMC – Instrument Meteorological Condition - this means
flying in cloud or bad weather.
VMC - Visual meteorological conditions - It is an aviation
flight category in which pilots have sufficient visibility to
fly the aircraft.
2. Many aviation occurrence reporting systems
capture the phase of operation or the phase of
flight in which the event that is to be reported
occurred. Analyzing this factor, I obtained that
accidents have mostly occurred when Landing is
performed.
3. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 3
3. How many Air accidents happen based on the
different types of injury and fatalities?
There were 21,063 Non-Fatal accidents, but 2,671
accidents where it was at least one fatality. See
harts below.
Number of Occurrences per type of Injury and quantity of
fatalities.
Number of Occurrences per type of Injury (general)
4. What are the states with more accidents during the
last 15 years?
California has the most registered accidents.
5. Which Aircraft category have more accidents?
Airplane
4. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 4
Chart from Cloudera indicating which type of aircraft has
been involved in the majority of air accidents:
6. Based on that the accident was Fatal, what was the
Aircraft’s phase of flight at the moment of the
accident and what was the aircraft damage?
7. What Aero companies have been more involved in
accidents where their aircraft was completely
destroyed? CESSNA
8. When a Fatal accident happened, in what phase of
flight was the aircraft? What was the aircraft
category and model?
9. We can determine accidents in a specific date:
How many accidents happen on November 23,
2014 in USA? Where they Fatal?
10. During the last 15 years, how many accidents
happened per state per type of Injury Severity?
Analyzing Aviation Accidents in Illinois using Pig
How many accidents we had the last 15 years per state:
A = LOAD '/user/hive/warehouse/aviation/Aviationusa.txt'
AS (
eventid: CHARARRAY,
investigationtype: CHARARRAY,
accidentnumber: CHARARRAY,
eventdate: CHARARRAY,
location: CHARARRAY,
state: CHARARRAY,
country: CHARARRAY,
latitude: FLOAT,
longitude: FLOAT,
5. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 5
airportcode: CHARARRAY,
airportname: CHARARRAY,
injuryseverity: CHARARRAY,
aircraftdamage: CHARARRAY,
aircraftcategory: CHARARRAY,
registrationnumber: CHARARRAY,
make: CHARARRAY,
model: CHARARRAY,
amateurbuilt: CHARARRAY,
numberofengines: INT,
enginetype: CHARARRAY,
fardescription: CHARARRAY,
schedule: CHARARRAY,
purposeofflight: CHARARRAY,
aircarrier: CHARARRAY,
totalfatalinjuries: INT,
totalseriousinjuries: INT,
totalminorinjuries: INT,
totaluninjured: INT,
weathercondition: CHARARRAY,
broadphaseofflight: CHARARRAY,
reportstatus: CHARARRAY,
publicationdate: CHARARRAY
);
GF = Group A BY state;
CC = FOREACH GF GENERATE group, COUNT(A) AS ct;
top = ORDER CC BY ct DESC;
top10 = LIMIT top 10;
DUMP top10;
1. How many accidents had happened in Illinois by
Injury Severity:
FF = FILTER A BY state == ' IL ';
GF = GROUP FF BY injuryseverity;
injury = FOREACH GF GENERATE group, COUNT(FF)
AS ct;
DUMP injury;
2. How many accidents had happened in Illinois
based on the broad phase of flight?
FF = FILTER A BY state == ' IL ';
GF = GROUP FF BY broadphaseofflight;
injury = FOREACH GF GENERATE group, COUNT(FF)
AS ct;
top = ORDER injury BY ct DESC;
DUMP top;
3. What Aircraft models were more involved when
air accidents happened in Illinois?
FF = FILTER A BY state == ' IL ';
GF = GROUP FF BY make;
injurymake = FOREACH GF GENERATE group,
COUNT(FF) AS ct;
top = ORDER injurymake BY ct DESC;
top10 = LIMIT top 10;
DUMP top10;
Analyzing Aviation Accidents data with R
To analyze this data, I also can use R, which is a powerful
open source language. In the beginning, big data and R
were not natural friends. R programming requires that all
objects be loaded into the main memory of a single
machine, and distributed file systems such as Hadoop are
missing strong statistical techniques but are ideal for scaling
complex operations and tasks.
Now, there is an integration of Hadoop with R that connect
solutions and integrate high-level programming and
querying languages with Hadoop.
Through R, I can examine through complex data sets,
manipulate data through sophisticated modeling functions,
and create sleek graphics to represent the numbers, in just a
few lines of code. It’s likened to a hyperactive version of
Excel.
Just to give a glance that what we can do with R, in his
study I have used it to visualize the difference between
6. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 6
Injured and Uninjured people from 2,000 to 2,015.
Fortunately the difference is huge. More people that had
accidents got uninjured.
Analyzing Aviation accidents data with Apache Mahout
Apache Mahout is a machine-learning library with
tools for clustering, classification, and several types of
recommenders, including tools to calculate most-similar
items or build item recommendations for users. Mahout
employs the Hadoop framework to distribute calculations
across a cluster, and now includes additional work
distribution methods.
For this project I will use the algorithm
recommenditembased similarity_cooccurrence because I
would want to use Mahout’s matrix algebra to get from
user-behavior histories to useful indicators. To prepare the
data for using Mahout to build a recommendation engine, I
have codified all states and broad phase of flight. Aircraft
Damage was used to rate the phase of flight per state. Doing
this, Mahout Recommender is in the expected input format:
Broad Phase of Flight
Approach 0
Climb 1
Cruise 2
Descent 3
Go_Around 4
Landing 5
Maneuvering 6
Other 7
Standing 8
Takeoff 9
Taxi 10
Aircraft Damage
Unknown 1
Minor 2
Substantial 3
Destroyed 4
For convenience and simplicity of this project, I had
configured the recommendation engine to give 1
recommendations per state.
This is the code to create the Aviation accident
recommender in Mahout:
[cloudera@quickstart ~]$ mahout recommenditembased --
input /user/cloudera/Avion.csv --tempDir
/user/cloudera/temp0 --similarityClassname
SIMILARITY_COOCCURRENCE --output
/user/cloudera/result0 --numRecommendations 1
State
Phase of flight-id: recommendation-
strength
AK [Approach: 2.0]
AL [Landing: 2.0]
AZ [Approach: 2.0]
CO [Approach: 2.0]
CT [Cruise: 2.0]
DC [Cruise: 2.0]
DE [Go-Around: 2.1316726]
GA [Approach: 2.0]
HI [Go-Around: 2.469751]
IA [Standing: 2.2961373]
IL [Cruise: 2.5659575]
LA [Landing: 2.7575758]
MT [Standing: 2.639485]
ND [Standing: 2.4506438]
NE [Go-Around: 2.1032028]
NJ [Approach: 2.0]
NM [Landing: 2.0]
NV [Maneuver: 2.3461537]
NY [Standing: 2.508772]
OH [Standing: 2.2735848]
OK [Climb: 2.1538463]
OR [Cruise: 2.2248063]
PA [Standing: 2.1374407]
RI [Takeoff: 2.0]
SC [Landing: 2.131068]
7. HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS 7
SD [Landing: 2.0]
TN [Standing: 2.2230768]
TX [Approach: 2.0]
UT [Takeoff: 2.0]
VA [Approach: 2.0]
VT [Approach: 2.0]
WA [Descent: 1.7841727]
WI [Descent: 2.0267856]
WV [Landing: 2.0]
WY [Landing: 2.0]
AR [Cruise: 2.0]
Mahout will use cross-action cooccurrence analysis to limit
the views to ones that do predict the phase of flight in
which an accident may occur. Mahout do this by treating
the primary action (phase of flight) as data for the indicator
matrix and use the secondary action (aircraft damage) to
calculate the cross-cooccurrence indicator matrix.
The first number is a user id, in this case a State and the
key-value pairs inside the brackets are phase of flight-
id:recommendation-strength tuples. The recommendation
strengths are at a hundred percent or 4.0 in this case. The
results show preference values greater and less than 2.
Values less than 2.0 mean that there is a weak possibility
that the aircraft had been destroyed during a specific phase
of flight, 2.0 is just an average value, values greater than
2.5 indicate that there is a strong possibility that the aircraft
in a determined phase of flight will end in an accident
where the aircraft will be destroyed.
V. CONCLUSION
The only sure way to avoid risks associated with accidents
and incidents is to study mishap history. This project has
demonstrated that Hadoop has high degree of sustainability
and robustness in analyzing and storing large quantity of
data. Using Hive we have solved some important questions
such as that accidents have mostly occurred when landing is
performed or that based on 1 injury fatality, 2671 accidents
have occurred during 2,000 to 2015, and there were 21,063
Non-Fatal accidents during this period.
One fascinating fact, is that bad metrological conditions do
not influence aviation accidents.
Also, it was uncovered that California had the most
registered accidents during this past 15 years with 296
fatalities.
The aircraft category that have more accidents is the
airplane, and the Aero Company that have been more
involved in accidents where their aircrafts were completely
destroyed is Cessna with 766 accidents during this last 15
years. More interesting is to know that 121 of them
happened during the Maneuvering phase of flight.
Pig was used to analyze aviation accidents in Illinois, and it
help as to uncover that in Illinois most aviation accidents
happen during landing (161 accidents). Also, it was
interesting too know that Illinois had 482 accidents with no
fatalities during 2,000 to 2,015.
Although, it was more interesting to verify that in Illinois
Cessna is also the aircraft that was involved in more
accidents during this last 15 years.
Based on the history of aviation accidents in USA, probable
accidents have been forecasted using Mahout for some
states.
The goal of this study is to present some ideas to help
reducing the rate of aviation accidents; and although many
factors can contribute to them, it is important to identify
which ones are the most commons in order to fix them or at
least decrease them. Specifically, the focus is to see what
we can learn from the past in order to identify appropriate
measures to prevent repetitive errors in the future. For this
Hadoop has accomplished the expectations.
References
[1] A. Garibay and J. Young. “Reducing General Aviation
Accidents using Airline Operational Strategies”.
Department of Aviation Technology
http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1019
&context=atgrads
[2] F. George. “Lessons from past Mishaps”.
http://aviationweek.com/aftermarket-solutions/lessons-past-
mishaps
[3] E. McNulty. “Data Science. Understanding Big Data:
The Ecosystem”. 2014.
[4] V. Oster, J. Strong, C. Zorn. “Analyzing aviation safety:
Problems, challenges, opportunities”. Research in
Transportation Economics 43 (2013) 148e164. 2013.
[5] S. Owen, R. Anil, T. Dunning and E. Friedman.
“Mahout in Action”. 2012.
[6] J. Pontani, J. “Hands-on with Apache Mahout”. 2014.
[7] Seshachala, S. (2015) “Big Data – Understanding
Hadoop and its Ecosystem”.
[8] J. Thaker, J. “Why Hadoop is important in Handling Big
Data? ”. 2013.
[9] R. Woodhouse. “Accident Analysis Jet and Turboprop
Business Aircraft 1998-2003 Potential Impact of IS-BAO”.
2006.