SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       1 
 
HOW DATA CAN HELP TO REDUCE
AVIATION ACCIDENTS
Carmen Serpa, Northwestern University. Sunilkumar Kakade, Northwestern University
Abstract - This project is based on the data from the National
Transportation Safety Board (NTSB). It involves classifying a set
of aircraft accident/incident data covering the years 2000 to 2015
in United States. The NTSB provides one of the most
comprehensive online aircraft accident and incident databases. It
includes dates, places, aircraft and engine types, scheduled and
non-scheduled certificated air carrier, and name of air carrier.
These filters enable business aircraft operators to identify risks
associated with the aircraft they operate, the types of operations
and procedures they typically fly, and the airports they frequent.
The big data industry has mastered the art of gathering and
logging terabytes of data, but the challenge is to base forecasts and
make decisions derived from this real data, which is why Hadoop
is so important.
The purpose of this study is to optimize analyses of the NTSB
massive data with Hadoop and uncover hidden patterns that can
give some ideas of how we can reduce the rate of aviation
accidents and thus decrease risk and improve safety. While many
factors can contribute to accidents, it is important to identify
which ones are the most commons in order to fix them.
Specifically, the focus is to see what we can learn from the past in
order to identify appropriate measures to prevent repetitive errors
in the future.
Hadoop should be used to extend the ability of human pattern
recognition to uncover accidents’ causes in multivariate data.
Although Hadoop can assimilate more data at any given time than
the analyst, it is the analyst, at the end, who must make the
necessary decisions and judgements about the data.
I. INTRODUCTION
Whether its fine monitoring shop floor operations, gauging
consumer sentiment, or any number of other large-scale
analytic challenges, big data is having a tremendous impact
on the aviation enterprise. The amount of data that is
generated in every flight has risen over the years and more
and more types of information are being stored in digital
formats. However, it is not just access to new data sources,
but access to patterns and interrelationships among these
elements that are of interest. Collecting lots of diverse types
of data very quickly does not create value, it is the analysis
of this data to uncover insights that will help your
organization, which is important.
This project is based on data from the National
Transportation Safety Board (NTSB) and it consist of
26,931 records from 2000 to 2015. The NTSB not only
handles USA data but world data, and it currently manages
data from the 1940s to present. This generates tremendous
amounts of log data, that storing and analysis become issues
that need to be solved. Storing vehicle monitoring data is
very important for the NTSB to give information support
for departments such as public security, criminal
investigation, and economic investigation and front-line
police.
Hadoop is widely known to solve all types of problems with
Big Data, and for this study I have used some Hadoop
components, such as: Hive, Pig and Mahout to analyze the
data. Many questions about aviation accidents in USA have
been answered. However, the big question still remains:
how can it be safer, particularly in the context of increasing
the amount of flights? Consequently and definitively more
questions can be solved an analyzed in order to get a more
complete analysis of this data.
After analyzing this data, the most-common themes in these
mishaps are the apparent failure to stabilize the aircraft
during landing that can be due to many factors such as the
apparent willingness of the flight crew to attempt to land
the aircraft in unsafe conditions. Analyzing this data
uncovered that bad metrological conditions are not the main
reason for aviation accidents. A more exhaustive analyses
can be done only in airports were bad metrological
conditions have influenced aviation accidents. A detailed
analyses was done to the Illinois State, revealing that most
aviation accidents in Illinois happen during landing and that
Cessna is the aircraft that was involved in more of its
accidents during this last 15 years.
Based on the history of aviation accidents in USA, probable
accidents have been forecasted using Mahout for some
states. It predicts in what phase of flight Aviation accidents
are more probable to occur.
This project has demonstrated that Hadoop has high degree
of sustainability and robustness in analyzing and storing
large quantity of data. Big data doesn’t only bring new data
types and storage mechanisms, but new types of analysis as
well. Big data analysis is a continuum, not an isolated set of
activities.
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       2 
 
II. DATASET PREPARATION AND DESCRIPTION
This project is based on the data from the National
Transportation Safety Board (NTSB). The data contains
information from 1941 to present from over the world, but
for the purpose of this study I have worked with data from
2,000 to 2,015 in USA. The reason of this reduction of data
is that after a quick inspection, I have realized that there are
many missing data, especially in aircraft category.
During the period of 2,000 to 2,015 there are also missing
data but this is in less proportion than before the year 2,000.
Missing data for this study was replaced by “unknown”
and, thus calculations had to be based on this information.
After that, the aircraft accident and incident records were
uploaded into Hadoop HDFS using Hue. The workflow
here would be to download each year of Aviation data, then
upload it using the “Upload –> Files” menu drop-down.
The Aviation file was in a csv format. After the file was
uploaded, a table containing 32 attributes and 26,931was
created: Aviationusa.csv
However, to facilitate the use of Pig, I have created a txt file
based on the original aviation accidents file in csv. Pig can
also use a csv file, but it needs to use a different script.
III. METHODOLOGY
To analyze this data I have used Hive and Pig in order to
uncover hidden patterns that will help to solve specific
questions. Mahout is also used and it will predict aviation
accidents during a specific phase of flight. R is used in this
study to visualize the difference between injured and
uninjured persons in aviation accidents from 2,000 to 2,015.
IV. RESULTS
Analyzing Aviation Accidents Dataset with Hive
Every flight is different, but in some cases accidents follow
well-worn patterns. Whether heedless, hapless, or simply
clueless, pilots keep falling into the same traps that have
snared others before them and that is what I will try to
discover in this study.
I have performed quick ad hoc analyses for the dataset in
order to get some insights of what are the factors that may
cause air accidents.
First, locating aviation accidents in a USA map in
Cloudera.
SELECT Latitude, Longitude FROM Aviationusa;
 
Map from Cloudera
Answering some questions:
1. Weather can be a factor to have accidents.
Analyzing the data we can see that in USA is not a
determinant influence.
IMC – Instrument Meteorological Condition - this means
flying in cloud or bad weather.
VMC - Visual meteorological conditions - It is an aviation
flight category in which pilots have sufficient visibility to
fly the aircraft.
 
2. Many aviation occurrence reporting systems
capture the phase of operation or the phase of
flight in which the event that is to be reported
occurred. Analyzing this factor, I obtained that
accidents have mostly occurred when Landing is
performed.
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       3 
 
 
 
3. How many Air accidents happen based on the
different types of injury and fatalities?
There were 21,063 Non-Fatal accidents, but 2,671
accidents where it was at least one fatality. See
harts below.
 
 
 
Number of Occurrences per type of Injury and quantity of
fatalities.
 
Number of Occurrences per type of Injury (general)
4. What are the states with more accidents during the
last 15 years?
California has the most registered accidents.
 
5. Which Aircraft category have more accidents?
Airplane
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       4 
 
Chart from Cloudera indicating which type of aircraft has
been involved in the majority of air accidents:
 
6. Based on that the accident was Fatal, what was the
Aircraft’s phase of flight at the moment of the
accident and what was the aircraft damage?
 
7. What Aero companies have been more involved in
accidents where their aircraft was completely
destroyed? CESSNA 
 
 
 
8. When a Fatal accident happened, in what phase of
flight was the aircraft? What was the aircraft
category and model?
 
9. We can determine accidents in a specific date:
How many accidents happen on November 23,
2014 in USA? Where they Fatal?
 
10. During the last 15 years, how many accidents
happened per state per type of Injury Severity?
 
Analyzing Aviation Accidents in Illinois using Pig
How many accidents we had the last 15 years per state:
A = LOAD '/user/hive/warehouse/aviation/Aviationusa.txt'
AS (
eventid: CHARARRAY,
investigationtype: CHARARRAY,
accidentnumber: CHARARRAY,
eventdate: CHARARRAY,
location: CHARARRAY,
state: CHARARRAY,
country: CHARARRAY,
latitude: FLOAT,
longitude: FLOAT,
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       5 
 
airportcode: CHARARRAY,
airportname: CHARARRAY,
injuryseverity: CHARARRAY,
aircraftdamage: CHARARRAY,
aircraftcategory: CHARARRAY,
registrationnumber: CHARARRAY,
make: CHARARRAY,
model: CHARARRAY,
amateurbuilt: CHARARRAY,
numberofengines: INT,
enginetype: CHARARRAY,
fardescription: CHARARRAY,
schedule: CHARARRAY,
purposeofflight: CHARARRAY,
aircarrier: CHARARRAY,
totalfatalinjuries: INT,
totalseriousinjuries: INT,
totalminorinjuries: INT,
totaluninjured: INT,
weathercondition: CHARARRAY,
broadphaseofflight: CHARARRAY,
reportstatus: CHARARRAY,
publicationdate: CHARARRAY
);
GF = Group A BY state;
CC = FOREACH GF GENERATE group, COUNT(A) AS ct;
top = ORDER CC BY ct DESC;
top10 = LIMIT top 10;
DUMP top10;
 
 
1. How many accidents had happened in Illinois by
Injury Severity:
FF = FILTER A BY state == ' IL ';
GF = GROUP FF BY injuryseverity;
injury = FOREACH GF GENERATE group, COUNT(FF)
AS ct;
DUMP injury;
 
2. How many accidents had happened in Illinois
based on the broad phase of flight?
FF = FILTER A BY state == ' IL ';
GF = GROUP FF BY broadphaseofflight;
injury = FOREACH GF GENERATE group, COUNT(FF)
AS ct;
top = ORDER injury BY ct DESC;
DUMP top;
 
 
3. What Aircraft models were more involved when
air accidents happened in Illinois?
FF = FILTER A BY state == ' IL ';
GF = GROUP FF BY make;
injurymake = FOREACH GF GENERATE group,
COUNT(FF) AS ct;
top = ORDER injurymake BY ct DESC;
top10 = LIMIT top 10;
DUMP top10;
 
 
Analyzing Aviation Accidents data with R
To analyze this data, I also can use R, which is a powerful
open source language. In the beginning, big data and R
were not natural friends. R programming requires that all
objects be loaded into the main memory of a single
machine, and distributed file systems such as Hadoop are
missing strong statistical techniques but are ideal for scaling
complex operations and tasks.
Now, there is an integration of Hadoop with R that connect
solutions and integrate high-level programming and
querying languages with Hadoop.
Through R, I can examine through complex data sets,
manipulate data through sophisticated modeling functions,
and create sleek graphics to represent the numbers, in just a
few lines of code. It’s likened to a hyperactive version of
Excel.
Just to give a glance that what we can do with R, in his
study I have used it to visualize the difference between
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       6 
 
Injured and Uninjured people from 2,000 to 2,015.
Fortunately the difference is huge. More people that had
accidents got uninjured.
 
 
 
 
Analyzing Aviation accidents data with Apache Mahout
Apache Mahout is a machine-learning library with
tools for clustering, classification, and several types of
recommenders, including tools to calculate most-similar
items or build item recommendations for users. Mahout
employs the Hadoop framework to distribute calculations
across a cluster, and now includes additional work
distribution methods.
For this project I will use the algorithm
recommenditembased similarity_cooccurrence because I
would want to use Mahout’s matrix algebra to get from
user-behavior histories to useful indicators. To prepare the
data for using Mahout to build a recommendation engine, I
have codified all states and broad phase of flight. Aircraft
Damage was used to rate the phase of flight per state. Doing
this, Mahout Recommender is in the expected input format:
Broad Phase of Flight
Approach 0
Climb 1
Cruise 2
Descent 3
Go_Around 4
Landing 5
Maneuvering 6
Other 7
Standing 8
Takeoff 9
Taxi 10
Aircraft Damage
Unknown 1
Minor 2
Substantial 3
Destroyed 4
For convenience and simplicity of this project, I had
configured the recommendation engine to give 1
recommendations per state.
This is the code to create the Aviation accident
recommender in Mahout:
[cloudera@quickstart ~]$ mahout recommenditembased --
input /user/cloudera/Avion.csv --tempDir
/user/cloudera/temp0 --similarityClassname
SIMILARITY_COOCCURRENCE --output
/user/cloudera/result0 --numRecommendations 1
State
Phase of flight-id: recommendation-
strength
AK [Approach: 2.0]
AL [Landing: 2.0]
AZ [Approach: 2.0]
CO [Approach: 2.0]
CT [Cruise: 2.0]
DC [Cruise: 2.0]
DE [Go-Around: 2.1316726]
GA [Approach: 2.0]
HI [Go-Around: 2.469751]
IA [Standing: 2.2961373]
IL [Cruise: 2.5659575]
LA [Landing: 2.7575758]
MT [Standing: 2.639485]
ND [Standing: 2.4506438]
NE [Go-Around: 2.1032028]
NJ [Approach: 2.0]
NM [Landing: 2.0]
NV [Maneuver: 2.3461537]
NY [Standing: 2.508772]
OH [Standing: 2.2735848]
OK [Climb: 2.1538463]
OR [Cruise: 2.2248063]
PA [Standing: 2.1374407]
RI [Takeoff: 2.0]
SC [Landing: 2.131068]
                                           HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       7 
 
SD [Landing: 2.0]
TN [Standing: 2.2230768]
TX [Approach: 2.0]
UT [Takeoff: 2.0]
VA [Approach: 2.0]
VT [Approach: 2.0]
WA [Descent: 1.7841727]
WI [Descent: 2.0267856]
WV [Landing: 2.0]
WY [Landing: 2.0]
AR [Cruise: 2.0]
Mahout will use cross-action cooccurrence analysis to limit
the views to ones that do predict the phase of flight in
which an accident may occur. Mahout do this by treating
the primary action (phase of flight) as data for the indicator
matrix and use the secondary action (aircraft damage) to
calculate the cross-cooccurrence indicator matrix.
The first number is a user id, in this case a State and the
key-value pairs inside the brackets are phase of flight-
id:recommendation-strength tuples. The recommendation
strengths are at a hundred percent or 4.0 in this case. The
results show preference values greater and less than 2.
Values less than 2.0 mean that there is a weak possibility
that the aircraft had been destroyed during a specific phase
of flight, 2.0 is just an average value, values greater than
2.5 indicate that there is a strong possibility that the aircraft
in a determined phase of flight will end in an accident
where the aircraft will be destroyed.
V. CONCLUSION
The only sure way to avoid risks associated with accidents
and incidents is to study mishap history. This project has
demonstrated that Hadoop has high degree of sustainability
and robustness in analyzing and storing large quantity of
data. Using Hive we have solved some important questions
such as that accidents have mostly occurred when landing is
performed or that based on 1 injury fatality, 2671 accidents
have occurred during 2,000 to 2015, and there were 21,063
Non-Fatal accidents during this period.
One fascinating fact, is that bad metrological conditions do
not influence aviation accidents.
Also, it was uncovered that California had the most
registered accidents during this past 15 years with 296
fatalities.
The aircraft category that have more accidents is the
airplane, and the Aero Company that have been more
involved in accidents where their aircrafts were completely
destroyed is Cessna with 766 accidents during this last 15
years. More interesting is to know that 121 of them
happened during the Maneuvering phase of flight.
Pig was used to analyze aviation accidents in Illinois, and it
help as to uncover that in Illinois most aviation accidents
happen during landing (161 accidents). Also, it was
interesting too know that Illinois had 482 accidents with no
fatalities during 2,000 to 2,015.
Although, it was more interesting to verify that in Illinois
Cessna is also the aircraft that was involved in more
accidents during this last 15 years.
Based on the history of aviation accidents in USA, probable
accidents have been forecasted using Mahout for some
states.
The goal of this study is to present some ideas to help
reducing the rate of aviation accidents; and although many
factors can contribute to them, it is important to identify
which ones are the most commons in order to fix them or at
least decrease them. Specifically, the focus is to see what
we can learn from the past in order to identify appropriate
measures to prevent repetitive errors in the future. For this
Hadoop has accomplished the expectations.
References
[1] A. Garibay and J. Young. “Reducing General Aviation
Accidents using Airline Operational Strategies”.
Department of Aviation Technology
http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1019
&context=atgrads
[2] F. George. “Lessons from past Mishaps”.
http://aviationweek.com/aftermarket-solutions/lessons-past-
mishaps
 
[3] E. McNulty. “Data Science. Understanding Big Data:
The Ecosystem”. 2014.
[4] V. Oster, J. Strong, C. Zorn. “Analyzing aviation safety:
Problems, challenges, opportunities”. Research in
Transportation Economics 43 (2013) 148e164. 2013.
[5] S. Owen, R. Anil, T. Dunning and E. Friedman.
“Mahout in Action”. 2012.
[6] J. Pontani, J. “Hands-on with Apache Mahout”. 2014.
[7] Seshachala, S. (2015) “Big Data – Understanding
Hadoop and its Ecosystem”.
[8] J. Thaker, J. “Why Hadoop is important in Handling Big
Data? ”. 2013.
[9] R. Woodhouse. “Accident Analysis Jet and Turboprop
Business Aircraft 1998-2003 Potential Impact of IS-BAO”.
2006.

Weitere ähnliche Inhalte

Andere mochten auch

Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
Digital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsDigital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsRaghu Pandey
 
Marketing plan for android app
Marketing plan for android appMarketing plan for android app
Marketing plan for android appSai Sachin
 
UX Designer Skills
UX Designer SkillsUX Designer Skills
UX Designer SkillsPhowr Quang
 

Andere mochten auch (7)

UGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & DetailsUGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & Details
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
RBC
RBCRBC
RBC
 
TorchFi platform
TorchFi platformTorchFi platform
TorchFi platform
 
Digital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsDigital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for Schools
 
Marketing plan for android app
Marketing plan for android appMarketing plan for android app
Marketing plan for android app
 
UX Designer Skills
UX Designer SkillsUX Designer Skills
UX Designer Skills
 

Ähnlich wie HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS

Predictive Analytics - NTSB Aviation accidents data
Predictive Analytics - NTSB Aviation accidents dataPredictive Analytics - NTSB Aviation accidents data
Predictive Analytics - NTSB Aviation accidents dataSurya Adavi
 
Predictive analytics NTSB aviation accidents data
Predictive analytics NTSB aviation accidents dataPredictive analytics NTSB aviation accidents data
Predictive analytics NTSB aviation accidents dataAshish Kumar Doke
 
Flight data analysis using apache pig--------------Final Year Project
Flight data analysis using apache pig--------------Final Year ProjectFlight data analysis using apache pig--------------Final Year Project
Flight data analysis using apache pig--------------Final Year ProjectSanjib Mitra
 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining projectAkshay Kumar Bhushan
 
INFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - CopyINFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - CopyBenjamin Levy
 
Validating enterprise data lake using open source data validator
Validating enterprise data lake using open source data validatorValidating enterprise data lake using open source data validator
Validating enterprise data lake using open source data validatorPrachi Gupta
 
Avi news letter 15th issue
Avi news letter 15th issueAvi news letter 15th issue
Avi news letter 15th issueAvitrueSpares
 
AVI-NEWS Letter 15th Issue
AVI-NEWS Letter 15th IssueAVI-NEWS Letter 15th Issue
AVI-NEWS Letter 15th IssueAvitrue Spares
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docxlorainedeserre
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docxjesusamckone
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docxcharisellington63520
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docxtodd521
 
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAYMACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAYIJDKP
 
Master's Thesis_Reber_2007
Master's Thesis_Reber_2007Master's Thesis_Reber_2007
Master's Thesis_Reber_2007Dan Reber
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SASRohan Nanda
 
Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...
Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...
Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...Kavika Roy
 
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptxbigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptxeternalisone
 
General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...
General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...
General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...FAA Safety Team Central Florida
 

Ähnlich wie HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS (20)

Predictive Analytics - NTSB Aviation accidents data
Predictive Analytics - NTSB Aviation accidents dataPredictive Analytics - NTSB Aviation accidents data
Predictive Analytics - NTSB Aviation accidents data
 
Predictive analytics NTSB aviation accidents data
Predictive analytics NTSB aviation accidents dataPredictive analytics NTSB aviation accidents data
Predictive analytics NTSB aviation accidents data
 
Flight data analysis using apache pig--------------Final Year Project
Flight data analysis using apache pig--------------Final Year ProjectFlight data analysis using apache pig--------------Final Year Project
Flight data analysis using apache pig--------------Final Year Project
 
Flight delay detection data mining project
Flight delay detection data mining projectFlight delay detection data mining project
Flight delay detection data mining project
 
INFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - CopyINFORMS AAS Newsletter Spring 2013 - Copy
INFORMS AAS Newsletter Spring 2013 - Copy
 
Validating enterprise data lake using open source data validator
Validating enterprise data lake using open source data validatorValidating enterprise data lake using open source data validator
Validating enterprise data lake using open source data validator
 
Avi news letter 15th issue
Avi news letter 15th issueAvi news letter 15th issue
Avi news letter 15th issue
 
AVI-NEWS Letter 15th Issue
AVI-NEWS Letter 15th IssueAVI-NEWS Letter 15th Issue
AVI-NEWS Letter 15th Issue
 
Foqa good one
Foqa good oneFoqa good one
Foqa good one
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
 
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
2HOW THANKSGIVING AND SUPER BOWL TRAFFIC CONTRIBUTE TO FLIGH.docx
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docx
 
Running Head SAFETY IN AVIATION .docx
Running Head SAFETY IN AVIATION                                  .docxRunning Head SAFETY IN AVIATION                                  .docx
Running Head SAFETY IN AVIATION .docx
 
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAYMACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
MACHINE LEARNING TECHNIQUES FOR ANALYSIS OF EGYPTIAN FLIGHT DELAY
 
Master's Thesis_Reber_2007
Master's Thesis_Reber_2007Master's Thesis_Reber_2007
Master's Thesis_Reber_2007
 
Air Travel Analytics in SAS
Air Travel Analytics in SASAir Travel Analytics in SAS
Air Travel Analytics in SAS
 
Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...
Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...
Data Warehousing Fuses With Data Visualization To Solve Key Problems of Enter...
 
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptxbigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
bigdatatoavoidweatherrelatedflightdelays-201219091805.pptx
 
General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...
General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...
General Aviation Pilot’s Guide to Preflight Weather Planning, Weather Self-Br...
 
Big data
Big data Big data
Big data
 

HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS

  • 1.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       1    HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS Carmen Serpa, Northwestern University. Sunilkumar Kakade, Northwestern University Abstract - This project is based on the data from the National Transportation Safety Board (NTSB). It involves classifying a set of aircraft accident/incident data covering the years 2000 to 2015 in United States. The NTSB provides one of the most comprehensive online aircraft accident and incident databases. It includes dates, places, aircraft and engine types, scheduled and non-scheduled certificated air carrier, and name of air carrier. These filters enable business aircraft operators to identify risks associated with the aircraft they operate, the types of operations and procedures they typically fly, and the airports they frequent. The big data industry has mastered the art of gathering and logging terabytes of data, but the challenge is to base forecasts and make decisions derived from this real data, which is why Hadoop is so important. The purpose of this study is to optimize analyses of the NTSB massive data with Hadoop and uncover hidden patterns that can give some ideas of how we can reduce the rate of aviation accidents and thus decrease risk and improve safety. While many factors can contribute to accidents, it is important to identify which ones are the most commons in order to fix them. Specifically, the focus is to see what we can learn from the past in order to identify appropriate measures to prevent repetitive errors in the future. Hadoop should be used to extend the ability of human pattern recognition to uncover accidents’ causes in multivariate data. Although Hadoop can assimilate more data at any given time than the analyst, it is the analyst, at the end, who must make the necessary decisions and judgements about the data. I. INTRODUCTION Whether its fine monitoring shop floor operations, gauging consumer sentiment, or any number of other large-scale analytic challenges, big data is having a tremendous impact on the aviation enterprise. The amount of data that is generated in every flight has risen over the years and more and more types of information are being stored in digital formats. However, it is not just access to new data sources, but access to patterns and interrelationships among these elements that are of interest. Collecting lots of diverse types of data very quickly does not create value, it is the analysis of this data to uncover insights that will help your organization, which is important. This project is based on data from the National Transportation Safety Board (NTSB) and it consist of 26,931 records from 2000 to 2015. The NTSB not only handles USA data but world data, and it currently manages data from the 1940s to present. This generates tremendous amounts of log data, that storing and analysis become issues that need to be solved. Storing vehicle monitoring data is very important for the NTSB to give information support for departments such as public security, criminal investigation, and economic investigation and front-line police. Hadoop is widely known to solve all types of problems with Big Data, and for this study I have used some Hadoop components, such as: Hive, Pig and Mahout to analyze the data. Many questions about aviation accidents in USA have been answered. However, the big question still remains: how can it be safer, particularly in the context of increasing the amount of flights? Consequently and definitively more questions can be solved an analyzed in order to get a more complete analysis of this data. After analyzing this data, the most-common themes in these mishaps are the apparent failure to stabilize the aircraft during landing that can be due to many factors such as the apparent willingness of the flight crew to attempt to land the aircraft in unsafe conditions. Analyzing this data uncovered that bad metrological conditions are not the main reason for aviation accidents. A more exhaustive analyses can be done only in airports were bad metrological conditions have influenced aviation accidents. A detailed analyses was done to the Illinois State, revealing that most aviation accidents in Illinois happen during landing and that Cessna is the aircraft that was involved in more of its accidents during this last 15 years. Based on the history of aviation accidents in USA, probable accidents have been forecasted using Mahout for some states. It predicts in what phase of flight Aviation accidents are more probable to occur. This project has demonstrated that Hadoop has high degree of sustainability and robustness in analyzing and storing large quantity of data. Big data doesn’t only bring new data types and storage mechanisms, but new types of analysis as well. Big data analysis is a continuum, not an isolated set of activities.
  • 2.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       2    II. DATASET PREPARATION AND DESCRIPTION This project is based on the data from the National Transportation Safety Board (NTSB). The data contains information from 1941 to present from over the world, but for the purpose of this study I have worked with data from 2,000 to 2,015 in USA. The reason of this reduction of data is that after a quick inspection, I have realized that there are many missing data, especially in aircraft category. During the period of 2,000 to 2,015 there are also missing data but this is in less proportion than before the year 2,000. Missing data for this study was replaced by “unknown” and, thus calculations had to be based on this information. After that, the aircraft accident and incident records were uploaded into Hadoop HDFS using Hue. The workflow here would be to download each year of Aviation data, then upload it using the “Upload –> Files” menu drop-down. The Aviation file was in a csv format. After the file was uploaded, a table containing 32 attributes and 26,931was created: Aviationusa.csv However, to facilitate the use of Pig, I have created a txt file based on the original aviation accidents file in csv. Pig can also use a csv file, but it needs to use a different script. III. METHODOLOGY To analyze this data I have used Hive and Pig in order to uncover hidden patterns that will help to solve specific questions. Mahout is also used and it will predict aviation accidents during a specific phase of flight. R is used in this study to visualize the difference between injured and uninjured persons in aviation accidents from 2,000 to 2,015. IV. RESULTS Analyzing Aviation Accidents Dataset with Hive Every flight is different, but in some cases accidents follow well-worn patterns. Whether heedless, hapless, or simply clueless, pilots keep falling into the same traps that have snared others before them and that is what I will try to discover in this study. I have performed quick ad hoc analyses for the dataset in order to get some insights of what are the factors that may cause air accidents. First, locating aviation accidents in a USA map in Cloudera. SELECT Latitude, Longitude FROM Aviationusa;   Map from Cloudera Answering some questions: 1. Weather can be a factor to have accidents. Analyzing the data we can see that in USA is not a determinant influence. IMC – Instrument Meteorological Condition - this means flying in cloud or bad weather. VMC - Visual meteorological conditions - It is an aviation flight category in which pilots have sufficient visibility to fly the aircraft.   2. Many aviation occurrence reporting systems capture the phase of operation or the phase of flight in which the event that is to be reported occurred. Analyzing this factor, I obtained that accidents have mostly occurred when Landing is performed.
  • 3.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       3        3. How many Air accidents happen based on the different types of injury and fatalities? There were 21,063 Non-Fatal accidents, but 2,671 accidents where it was at least one fatality. See harts below.       Number of Occurrences per type of Injury and quantity of fatalities.   Number of Occurrences per type of Injury (general) 4. What are the states with more accidents during the last 15 years? California has the most registered accidents.   5. Which Aircraft category have more accidents? Airplane
  • 4.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       4    Chart from Cloudera indicating which type of aircraft has been involved in the majority of air accidents:   6. Based on that the accident was Fatal, what was the Aircraft’s phase of flight at the moment of the accident and what was the aircraft damage?   7. What Aero companies have been more involved in accidents where their aircraft was completely destroyed? CESSNA        8. When a Fatal accident happened, in what phase of flight was the aircraft? What was the aircraft category and model?   9. We can determine accidents in a specific date: How many accidents happen on November 23, 2014 in USA? Where they Fatal?   10. During the last 15 years, how many accidents happened per state per type of Injury Severity?   Analyzing Aviation Accidents in Illinois using Pig How many accidents we had the last 15 years per state: A = LOAD '/user/hive/warehouse/aviation/Aviationusa.txt' AS ( eventid: CHARARRAY, investigationtype: CHARARRAY, accidentnumber: CHARARRAY, eventdate: CHARARRAY, location: CHARARRAY, state: CHARARRAY, country: CHARARRAY, latitude: FLOAT, longitude: FLOAT,
  • 5.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       5    airportcode: CHARARRAY, airportname: CHARARRAY, injuryseverity: CHARARRAY, aircraftdamage: CHARARRAY, aircraftcategory: CHARARRAY, registrationnumber: CHARARRAY, make: CHARARRAY, model: CHARARRAY, amateurbuilt: CHARARRAY, numberofengines: INT, enginetype: CHARARRAY, fardescription: CHARARRAY, schedule: CHARARRAY, purposeofflight: CHARARRAY, aircarrier: CHARARRAY, totalfatalinjuries: INT, totalseriousinjuries: INT, totalminorinjuries: INT, totaluninjured: INT, weathercondition: CHARARRAY, broadphaseofflight: CHARARRAY, reportstatus: CHARARRAY, publicationdate: CHARARRAY ); GF = Group A BY state; CC = FOREACH GF GENERATE group, COUNT(A) AS ct; top = ORDER CC BY ct DESC; top10 = LIMIT top 10; DUMP top10;     1. How many accidents had happened in Illinois by Injury Severity: FF = FILTER A BY state == ' IL '; GF = GROUP FF BY injuryseverity; injury = FOREACH GF GENERATE group, COUNT(FF) AS ct; DUMP injury;   2. How many accidents had happened in Illinois based on the broad phase of flight? FF = FILTER A BY state == ' IL '; GF = GROUP FF BY broadphaseofflight; injury = FOREACH GF GENERATE group, COUNT(FF) AS ct; top = ORDER injury BY ct DESC; DUMP top;     3. What Aircraft models were more involved when air accidents happened in Illinois? FF = FILTER A BY state == ' IL '; GF = GROUP FF BY make; injurymake = FOREACH GF GENERATE group, COUNT(FF) AS ct; top = ORDER injurymake BY ct DESC; top10 = LIMIT top 10; DUMP top10;     Analyzing Aviation Accidents data with R To analyze this data, I also can use R, which is a powerful open source language. In the beginning, big data and R were not natural friends. R programming requires that all objects be loaded into the main memory of a single machine, and distributed file systems such as Hadoop are missing strong statistical techniques but are ideal for scaling complex operations and tasks. Now, there is an integration of Hadoop with R that connect solutions and integrate high-level programming and querying languages with Hadoop. Through R, I can examine through complex data sets, manipulate data through sophisticated modeling functions, and create sleek graphics to represent the numbers, in just a few lines of code. It’s likened to a hyperactive version of Excel. Just to give a glance that what we can do with R, in his study I have used it to visualize the difference between
  • 6.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       6    Injured and Uninjured people from 2,000 to 2,015. Fortunately the difference is huge. More people that had accidents got uninjured.         Analyzing Aviation accidents data with Apache Mahout Apache Mahout is a machine-learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Mahout employs the Hadoop framework to distribute calculations across a cluster, and now includes additional work distribution methods. For this project I will use the algorithm recommenditembased similarity_cooccurrence because I would want to use Mahout’s matrix algebra to get from user-behavior histories to useful indicators. To prepare the data for using Mahout to build a recommendation engine, I have codified all states and broad phase of flight. Aircraft Damage was used to rate the phase of flight per state. Doing this, Mahout Recommender is in the expected input format: Broad Phase of Flight Approach 0 Climb 1 Cruise 2 Descent 3 Go_Around 4 Landing 5 Maneuvering 6 Other 7 Standing 8 Takeoff 9 Taxi 10 Aircraft Damage Unknown 1 Minor 2 Substantial 3 Destroyed 4 For convenience and simplicity of this project, I had configured the recommendation engine to give 1 recommendations per state. This is the code to create the Aviation accident recommender in Mahout: [cloudera@quickstart ~]$ mahout recommenditembased -- input /user/cloudera/Avion.csv --tempDir /user/cloudera/temp0 --similarityClassname SIMILARITY_COOCCURRENCE --output /user/cloudera/result0 --numRecommendations 1 State Phase of flight-id: recommendation- strength AK [Approach: 2.0] AL [Landing: 2.0] AZ [Approach: 2.0] CO [Approach: 2.0] CT [Cruise: 2.0] DC [Cruise: 2.0] DE [Go-Around: 2.1316726] GA [Approach: 2.0] HI [Go-Around: 2.469751] IA [Standing: 2.2961373] IL [Cruise: 2.5659575] LA [Landing: 2.7575758] MT [Standing: 2.639485] ND [Standing: 2.4506438] NE [Go-Around: 2.1032028] NJ [Approach: 2.0] NM [Landing: 2.0] NV [Maneuver: 2.3461537] NY [Standing: 2.508772] OH [Standing: 2.2735848] OK [Climb: 2.1538463] OR [Cruise: 2.2248063] PA [Standing: 2.1374407] RI [Takeoff: 2.0] SC [Landing: 2.131068]
  • 7.                                            HOW DATA CAN HELP TO REDUCE AVIATION ACCIDENTS                                       7    SD [Landing: 2.0] TN [Standing: 2.2230768] TX [Approach: 2.0] UT [Takeoff: 2.0] VA [Approach: 2.0] VT [Approach: 2.0] WA [Descent: 1.7841727] WI [Descent: 2.0267856] WV [Landing: 2.0] WY [Landing: 2.0] AR [Cruise: 2.0] Mahout will use cross-action cooccurrence analysis to limit the views to ones that do predict the phase of flight in which an accident may occur. Mahout do this by treating the primary action (phase of flight) as data for the indicator matrix and use the secondary action (aircraft damage) to calculate the cross-cooccurrence indicator matrix. The first number is a user id, in this case a State and the key-value pairs inside the brackets are phase of flight- id:recommendation-strength tuples. The recommendation strengths are at a hundred percent or 4.0 in this case. The results show preference values greater and less than 2. Values less than 2.0 mean that there is a weak possibility that the aircraft had been destroyed during a specific phase of flight, 2.0 is just an average value, values greater than 2.5 indicate that there is a strong possibility that the aircraft in a determined phase of flight will end in an accident where the aircraft will be destroyed. V. CONCLUSION The only sure way to avoid risks associated with accidents and incidents is to study mishap history. This project has demonstrated that Hadoop has high degree of sustainability and robustness in analyzing and storing large quantity of data. Using Hive we have solved some important questions such as that accidents have mostly occurred when landing is performed or that based on 1 injury fatality, 2671 accidents have occurred during 2,000 to 2015, and there were 21,063 Non-Fatal accidents during this period. One fascinating fact, is that bad metrological conditions do not influence aviation accidents. Also, it was uncovered that California had the most registered accidents during this past 15 years with 296 fatalities. The aircraft category that have more accidents is the airplane, and the Aero Company that have been more involved in accidents where their aircrafts were completely destroyed is Cessna with 766 accidents during this last 15 years. More interesting is to know that 121 of them happened during the Maneuvering phase of flight. Pig was used to analyze aviation accidents in Illinois, and it help as to uncover that in Illinois most aviation accidents happen during landing (161 accidents). Also, it was interesting too know that Illinois had 482 accidents with no fatalities during 2,000 to 2,015. Although, it was more interesting to verify that in Illinois Cessna is also the aircraft that was involved in more accidents during this last 15 years. Based on the history of aviation accidents in USA, probable accidents have been forecasted using Mahout for some states. The goal of this study is to present some ideas to help reducing the rate of aviation accidents; and although many factors can contribute to them, it is important to identify which ones are the most commons in order to fix them or at least decrease them. Specifically, the focus is to see what we can learn from the past in order to identify appropriate measures to prevent repetitive errors in the future. For this Hadoop has accomplished the expectations. References [1] A. Garibay and J. Young. “Reducing General Aviation Accidents using Airline Operational Strategies”. Department of Aviation Technology http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1019 &context=atgrads [2] F. George. “Lessons from past Mishaps”. http://aviationweek.com/aftermarket-solutions/lessons-past- mishaps   [3] E. McNulty. “Data Science. Understanding Big Data: The Ecosystem”. 2014. [4] V. Oster, J. Strong, C. Zorn. “Analyzing aviation safety: Problems, challenges, opportunities”. Research in Transportation Economics 43 (2013) 148e164. 2013. [5] S. Owen, R. Anil, T. Dunning and E. Friedman. “Mahout in Action”. 2012. [6] J. Pontani, J. “Hands-on with Apache Mahout”. 2014. [7] Seshachala, S. (2015) “Big Data – Understanding Hadoop and its Ecosystem”. [8] J. Thaker, J. “Why Hadoop is important in Handling Big Data? ”. 2013. [9] R. Woodhouse. “Accident Analysis Jet and Turboprop Business Aircraft 1998-2003 Potential Impact of IS-BAO”. 2006.