SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Downloaden Sie, um offline zu lesen
Big Data Analysis 1Author: Vikram Andem ISRM & IT GRC Conference
Big Data Analysis
 Concepts and References
 Use Cases in Airline Industry
Big Data Analysis 2Author: Vikram Andem ISRM & IT GRC Conference
 The objective of this presentation is to provide awareness and familiarize a general business or
management user with terms and terminology of Big Data Analysis and some references to use cases that
can be (or are currently) applied in Airline industry.
 The presentation is intended for an business or a management user to help with the thinking process on
formulating an analytical question given business situation / problem for Big Data Analysis.
 The presentation may also help provide an insight on basic terms and concepts that are a need to know,
what to ask, how to evaluate and/or help solve a business problem for a potential Big Data Analysis use
case and what to expect from the work of an competent Data Scientist when dealing with such use case
for Big Data Analysis.
NOTE: Just reviewing this presentation will most likely NOT make you competent enough to instantly perform Big Data
Analysis. Big Data Analysis is a new (very recent) aspect of Data Science and requires some college or university level
course work in (fields such as, but not limited to) mathematics, statistics, computer science , management science,
econometrics, engineering etc.
The presentation is divided into three parts following a separate presentation on Big Data Security &
Governance, Risk Management & Compliance
Part 1. Big Data : Introduction ( page # 3)
Part 2. Very quick introduction to understanding Data and analysis of Data ( page # 8)
(Beginner: if you are new to understanding data and use of data you should start here)
Part 3. Big Data Analysis : Concepts and References to Use Cases in Airline Industry ( page # 17)
(Advanced: if you understand data and how to use data, you may jump to this part).
Big Data Analysis 3Author: Vikram Andem ISRM & IT GRC Conference
Big Data: Introduction
You may skip this section
if you are familiar with Big Data
and directly jump to Part 2 (page # 8)
Part 1
Big Data Analysis 4Author: Vikram Andem ISRM & IT GRC Conference
Introduction
Projected growth and use of
Unstructured vs. Structured data¹
2012 2013 2014 2015 2016 2017 2018 2019 2020
Unstructured Structured
¹ 2013 IEEE Bigdata conference (projected growth of data
combined for all fortune 500 companies only)
Limitations of existing Data Analytics Architecture
BI Reports + Interactive Apps
RDBMS (aggregated data)
ETL Computer Grid
Collection
Instrumentation
Storage Only Grid
(original raw data)
Mostly Append
Limit #1 : Moving data to
compute doesn’t scale.
Limit #2 :
Can’t explore
high fidelity
raw data
Limit #3 :
Archiving =
Premature
data death
*Zettabyte’sofdata
* 1 Zettabyte = 1000 Exabyte's = 1 Million Petabyte’s = 1 Billion Terabyte’s.
Big Data a general term refers to the large voluminous amounts (at least terabytes)
of poly-structured data that is gleaned from traditional and non-traditional sources
and continuously flows through and around organizations, including but not limited-to e-mail,
text, event logs, audio, video, blogs, social media and transactional records.
What does
this
information
hold?
What is the
challenge
extracting
it?
It holds the promise of giving enterprises like United a deeper insight into their
customers, partners, and business. This data can provide answers to questions
they may not have even thought to ask. Companies like United can benefit
from a multidimensional view of their business when they add insight from big
data to the traditional types of information they collect and analyze.
NumberofResults
Demand
<- More Generic More Specific ->
Popularity Rank
<- Small Tail Long Tail ->
Traditional EDW
Classical Statistics
Big Data
Specific Spikes
Transactional Data
(e.g., Reservations)
Non-Transactional & Raw Data
(e.g., Search's, Event logs)
-
+
The challenge of extracting value from big data is similar in many ways to
the age-old problem of distilling business intelligence from transactional
data. At the heart of this challenge is the process used to extract data
from multiple sources, transform it to fit your analytical needs, and load it
into a data warehouse for subsequent analysis, a process known as
“Extract, Transform & Load” (ETL). The nature of big data requires that the
infrastructure for this process can scale cost-effectively.
While the storage capacities of hard
drives have increased massively over
the years, access speeds — the rate at
which data can be read from drives —
have not kept up. One typical drive
from year 1990 could store 1,370 MB of
data and had a transfer speed of 4.4
MB/s, so you could read all the data
from a full drive in around five minutes.
Over 20 years later, one terabyte
drives are the norm, but the transfer
speed is around 100 MB/s, so it takes
more than two and a half hours to
read all data off the disk. This is a long
time to read all data on a single drive
— and writing is even slower.
Data Storage and Analysis
Big Data Analysis 5Author: Vikram Andem ISRM & IT GRC Conference
Hadoop
Apache Hadoop is a scalable fault- tolerant distributed
system for storage and large scale processing of data-sets on clusters of
commodity hardware. Hadoop can be used to store Exabyte's of unstructured
and semi-structured data reliably on tens of thousands of general purpose
servers while scaling performance cost-effectively by merely adding
inexpensive nodes to the cluster. Using Hadoop in this way, organization like
United gains an additional ability to store and access data that they “might”
need, data that may never be loaded into the data warehouse.
The Key Benefit: Agility/Flexibility
Schema-on-Write (RDBMS)
 Schema must be created before
any data can be loaded.
 An explicit load operation has to
take place which transforms data
to DB internal structure.
 New columns must be added
explicitly before new data for such
columns can be loaded into the
database.
Schema-on-Read (Hadoop)
 Data is simply copied to the file
store, no transformation is needed.
 A SerDe (Serializer/ Deserlizer) is
applied during read time to extract
the required columns (late binding).
 New data can start flowing anytime
and will appear retroactively once
the SerDe is updated to parse it.
 Read is Fast
 Standards / Governance
 Load is Fast
 Flexibility / AgilityPros
 Interactive OLAP Analytics (<1sec)
 Multistep ACID Transactions
 100% SQL Compliance
 Structured or Not (Flexibility)
 Scalability of Storage/Compute
 Complex Data Processing
Use
When
Hadoop Architecture
Central to the scalability of Hadoop is the
distributed processing framework known as
MapReduce which splits the input
data-set into multiple chunks, each of
which is assigned a map task that can
process the data in parallel. Each map task
reads the input as a set of (key, value) pairs
and produces a transformed set of (key,
value) pairs as the output. The framework
shuffles and sorts outputs of the map tasks,
sending the intermediate (key, value) pairs
to the reduce tasks, which group them into
final results. MapReduce uses Job Tracker
and Task Tracker mechanisms to schedule
tasks, monitor them, & restart any that fail.
Hadoop Distributed File System (HDFS) is designed for scalability and fault tolerance. HDFS stores large files by dividing them
into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read
and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data
placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes.
² Pictures source :Intel White Paper on Big Data Analytics
Big Data Analysis 6Author: Vikram Andem ISRM & IT GRC Conference
Use Case Description
Data Storage Collect and store unstructured data in a fault-resilient scalable data
store that can be organized and sorted for indexing and analysis.
Analytics Ability to query in real time at the speed of thought on petabyte
scale unstructured and semi structured data using Hbase and Hive.
Batch Processing of
Unstructured data
Ability to batch-process (index, analyze etc.) tens to hundreds of
petabytes of unstructured and semi-structured data.
Data Archive Medium-term (12-36 months) archival of data from EDW/DBMS to
meet data retention policies.
Integration with
EDW
Extract, transfer and load data in and out of Hadoop into separate
DBMS for advanced analytics.
Search and
predictive analytics
Crawl, extract, index and transform structured and unstructured
data for search and predictive analytics.
Use Cases
Southwest Airlines uses Hadoop
based solution for its "Rapid
Rewards loyalty program" for
Customer Service.
CommonPatternsofHadoopUse
Big Data Exploration
Pattern # 1: Hadoop as a Data Refinery
Traditional Sources
(RDBMS, OLTP,
OLAP)
DATASYSTEMSDATASOURCESAPPLICATIONS
New Sources
(Web logs, email, sensor
data, social media)
RDBMS EDW
Traditional Repos
Business
Analytics
Custom
Applications
Enterprise
Applications
1
3
2
1
3
2
Collect data and
apply a known
algorithm into trusted
operational process
Capture
Capture all data
Process
Parse, cleanse,
apply structure &
transform
Exchange
Push to existing
data warehouse for
use with existing
analytic tools
Pattern # 2: Data Exploration with Hadoop
Traditional Sources
(RDBMS, OLTP,
OLAP)
DATASYSTEMSDATASOURCESAPPLICATIONS
New Sources
(Web logs, email, sensor
data, social media)
RDBMS EDW
Traditional Repos
Business
Analytics
Custom
Applications
Enterprise
Applications
1
3
2
1
3
2
Collect data and
perform iterative
investigation for value
Capture
Capture all data
Process
Parse, cleanse,
apply structure &
transform
Exchange
Explore and
visualize with
analytics tools
supporting Hadoop
Pattern # 3: Application Enrichment with Hadoop
Traditional Sources
(RDBMS, OLTP,
OLAP)
DATASYSTEMSDATASOURCESAPPLICATIONS
New Sources
(Web logs, email, sensor
data, social media)
RDBMS EDW
Traditional Repos
Business
Analytics
Custom
Applications
Enterprise
Applications
1
3
2
1
3
2
Collect data, analyze
and present salient
results for online apps
Capture
Capture all data
Process
Parse, cleanse,
apply structure &
transform
Exchange
Incorporate data
directly into
applications
Top 5 General Usages
Obtaining a 360-degree
view of Customers
Operations Analytics
Data Warehouse
Augmentation
Social Media
How Airline Industry is using Hadoop ? ³
Capturing Sensor Data to
Optimize Maintenance
Top 5 Airline Usages
Forecasting the Weather to
Optimize Fuel Loads
Identifying and Capturing the
Demand Signal (Competitive
offerings, Travel partner feeds)
Loyalty and Promotions
Webpage Visit’s, Log Storage
 When is the best time of
day/day of week/time of
year to fly to minimize
delays?
 Do older planes suffer more
delays?
 How does the number of
people flying between
different locations change
over time?
 How well does weather
predict plane delays?
 Can you detect cascading
failures as delays in one
airport create delays in
others? Are there critical
links in the system?
American Airlines is utilizing
Hadoop based solution for
Clickstream, Customer, Kiosk and
Data Analytics.
British Airlines uses Hadoop
based solution for Forecasting and
Revenue Management.
Orbitz Worldwide utilizes
Hadoop based solution for
Statistical Analysis to identify best
possible promotions that combine
air travel with hotel stay.
Prospects: CTO Q’s?
³ source : Internet – Google search
Big Data Analysis 7Author: Vikram Andem ISRM & IT GRC Conference
Cost Benefit Analysis
Gigabyte scale Petabyte scale
Size of data
CostperTerabyte
Cost of current, structured data
management technologies
Economics of Data
Cost of Hadoop
Value captured
by customers
(like United)
using Hadoop
Why is Hadoop a value proposition?
TCOD (Total Cost of Data) is the cost of owning
(and using!) data over time for analytic purposes is a
better metric than TCO (Total Cost of Ownership) for
Cost Benefit Analysis of this case. TCOD estimates what
a company like United will really spend to get to its
business goal and the focus is on on total cost, not just
the platform cost. In this case the TCOD comparison is
made between EDW Platform/Appliance and Hadoop
for the same amount of (raw or unstructured) data.
Cost Comparison & Benefits are based on underlying “Data Management” Requirements⁴
1.Hundreds of TB of data per week –500 TB data.
2.Raw data life: few hours to a few days.
3.Challenge: find the important events or trends.
4.Analyze the raw data once or a few times.
5.When analyzing, read entire files.
6.Keep only the significant data.
Project A: Emphasis on “Data Refining” Requirements
EDW Appliance Hadoop
Total System Cost $23M $1.3M
System and Data Admin $2M $2M
Application Development $5M $6M
Total Cost of Data $30M $9.3M
Cost equation is favorable to Hadoop for data
refining, data landing and archival requirements.
1.Data volume 500 TB to start – all must be retained for at
least five years.
2.Continual growth of data and workload
3.Data sources: thousands
4. Data sources change their feeds frequently
5.Challenges: Data must be correct and data must be integrated
6.Typical enterprise data lifetime: decades
7.Analytic application lifetime: years
8.Millions of data users
9.Hundreds of analytic applications
10.Thousands of one time analyses
11.Tens of thousands of complex queries
Project B: Emphasis on “EDW” Requirements
EDW Appliance Hadoop
Total System Cost $45M $5M
System and Data Admin $50M $100M
Application Development $40M $300M
ETL $60M $100M
Complex Queries $40M $80M
Analysis $30M $70M
Total Cost of Data $265M $655M
Cost equation is favorable to EDW for Data
Warehouse appliance requirements
Cost Comparison Conclusions Each technology has large advantages in its sweet spot(s).
 Neither platform is cost effective in the other’s sweet spot.
 Biggest differences for the data warehouse are the development cost for “Complex Queries” & “Analytics”.
 Total cost is extremely sensitive to technology choice.
 Analytic architectures will require both Hadoop and data warehouse platforms.
 Focus on total cost, not platform cost, in making your choice for a particular application or use.
 Many analytic processes will use both Hadoop & EDW technology – so integration cost also counts!
⁴ source for TCOD comparison is “The Real Cost of Big Data Spreadsheet” provided by Winter Corp (www.wintercorp.com/tcod)
Hadoop is ideal for data storage of
• data which rarely needed;
• data which can grow rapidly;
• data which can grow very large;
• data for which it is uncertain how it will be
needed in the future;
• data which may or may not have structure;
• data which may require ETL and Analysis
sometime in the future but just needs to
be stored now …..for some unknown use.
TCOD is the cost of owning (and using!) dataover timefor analytic purposes
* ETL is extract, transform and load (preparing data for analytic use)
Software Development/Maintenance Cost
SYSTEM
COST
ADMIN
COST
ETL * APPS QUERIES ANALYTICS
Big Data Analysis 8Author: Vikram Andem ISRM & IT GRC Conference
Very quick introduction to
understanding Data
and analysis of Data
Start here if you are new to
understanding the data or
do not know how to analyze data.
Part 2
Big Data Analysis 9Author: Vikram Andem ISRM & IT GRC Conference
Introduction to Data
 Targets business constraints.
 Assesses and determines new ways to operate.
What is the best
action/outcome?
What could
happen?
What happened?
 Finds associations in data not readily apparent with
customary analysis .
 Forecasts future probabilities and trends.
 Pattern recognition from samples for reporting of trends.
 Formulates and analyzes historical data.
Descriptive
Analytics
Predictive
Analytics
Prescriptive
Analytics
Benefits
Very High
High
Medium
Myth: I have large sets of data on Hadoop File System. Running powerful analytical tools (e.g., R, SAS, Tableau etc.) on Hadoop Infrastructure will
perform all the data analysis work for me and provide/deliver useful information.
Fact : The data by itself may not contain the answer, Big or Small - you need the right data. The combination of data and an aching desire for an
answer does not ensure that a reasonable answer can be extracted from a given body of data . Performing data analysis with an understanding (and
application of) data science principles, by correctly framing the analytical problem (with correct data sampling/collection methods) and with the use
of appropriate analytical tools will most likely provide useful information for statistical/analytical inference.
Data Analysis Benefits
Data Facts
 Data is just data.
 Data does not live nor die.
 Data does not offer truth nor does it lie.
 Data is not large nor its small.
 Data has always existed as both big data and small data.
Data is the second most important thing
 The most important thing in data science is the question.
 The second most important is the data.
 Often the data will limit or enable the questions.
 But having data is useless if you don't have an appropriate
question.
Data are values of qualitative or quantitative variables,
belonging to a set of items. Source: Wikipedia
Big Data Analysis 10Author: Vikram Andem ISRM & IT GRC Conference
Data Types : Basics
Types of Data Variables
Examples (from above table):
 gender: categorical
 sleep: numerical, continuous
 bedtime: categorical, ordinal
 countries: numerical, discrete
 dread: categorical, ordinal - could also
be used as numerical
Data variables
numerical categorical
continuous discrete
regular
categorical
ordinal
Numerical (quantitative)
variables take on numerical
values. It is sensible to add,
subtract, take averages, etc. with
these values.
Categorical (qualitative) variables take on a limited
number of distinct categories. These categories can be
identified with numbers, but it wouldn’t be sensible to
do arithmetic operations with these values.
Continuous numerical
variables are measured,
and can take on any
numerical value.
Discrete numerical variables are
counted, and can take on only
whole non-negative numbers.
Categorical variables that have
ordered levels are called ordinal. Think
about a flight survey question where
you are asked about how satisfied you
are with the customer service you
received, and the options are very
unsatisfied, unsatisfied, neutral,
satisfied or very satisfied. These levels
will have inherent ordering and hence
the variable will be called ordinal.
If the levels of the categorical variable do
not have an inherent ordering to them,
then the variable is simply called (regular)
categorical. (e.g., do you prefer morning
flight or an evening flight?)
Observations, Variables and Data Matrices
Data are organized in what we call a data matrix, where each row represents
an observation (or a case), and each column represents a variable.
City no_flights %_ontime ……. region
Chicago 350 90 …… Midwest
Houston 330 96 …… South
……… ……… ……… …… ………
Newark 306 92 …… Northeast
San Francisco 310 93 …… West
data matrix
observation
(case)
variable
 First variable is City which is an identifier variable for the name of the city
United serves to which the data are gathered.
 Next is the no_flights (number of flights) served by united daily, and is a
discrete numerical variable.
 Next is the %_ontime (percentage on time) representing the united flights
that operated on-time (arrival or departure) which represents a continuous
numerical variable (as it can take on any value between zero and 100, even
though the values shown here are rounded to whole numbers).
 Last column is region representing where the city is located in USA as
designated by US census (Northeast, Midwest, South, and West) and this a
categorical variable.
 When two variables show some connection with one another, they are called associated, or
dependent variables.
 The association can be further described as positive or negative
 If two variables are not associated they are said to be independent.
Relationships between Variables
example
Big Data Analysis 11Author: Vikram Andem ISRM & IT GRC Conference
Data Observational Studies and Experiments
Observational Study
 In an observational study you collect data in
a way that does not directly interfere with
how the data arise, i.e. merely “observe”.
 We can only establish an association (or
correlation) between the explanatory and
response variables.
 If an observation study uses data from the
past, it’s called a retrospective study,
whereas if data are collected throughout the
study, it’s called prospective.
Experiment
 In an experiment, you randomly
assign subjects to various
treatments and can therefore
establish a causal connection
between the explanatory and
response variables.
work out
don't
work out
average
energy
level
average
energy
level
work out don't
work out
average
energy level
average
energy level
random
assignment
In observational study we sample two types of people from the population, those who
choose to work out regularly and those who don’t, then we find the average energy level
for the two groups of people and compare. On the other hand in an experiment, we
sample a group of people from the population and then we randomly assign these
people into two groups, those who will regularly workout throughout the course of the
study and those who will not. The difference is that the decision of whether to work out
or not is not left to the subjects as in the observational study but is instead imposed by
the researcher. At the end we compare the average energy levels of the two groups.
Based on the observational study, even if we find the difference between the energy
levels of these two groups of people, we really can’t attribute this difference solely to
working out, because there may be other variables that we didn’t control for in this
study that contribute to the observed difference. For example people who are in better
shape might be more likely to work out and also have high energy levels. However in the
experiment such variables that likely to contribute to the outcome are equally
represented in the two groups, due to random assignment. Therefore if we find a
difference between the two averages, we can indeed make a causal statement
attributing this difference to working out.
Example: Suppose you want
to evaluate the relationship
between regularly working
out and energy level. We
can design the study as
observational study or an
experiment.
What type of study is this, observational
study or an experiment?
“Girls who regularly ate breakfast, particularly one that includes cereal,
were slimmer than those who skipped the morning meal, according to a
study that tracked nearly 2,400 girls for 10 years. [...] As part of the
survey, the girls were asked once a year what they had eaten during the
previous three days.”
This is an observational study since the researchers merely observed the
behavior of the girls (subjects) as opposed to imposing treatments on
them. The study concludes there is an association between girls eating
breakfast and being slimmer.
#1 : Eating breakfast causes girls to be thinner.
#2: Being thin causes girls to eat breakfast
#3: A third variable is responsible for both. What could it be? An
extraneous variable that affects both the explanatory and the response
variable and that make it seem like there is a relationship between the
two are called confounding variables.
Images from: http://www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg,
http://www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model-image2814892.
• What determines whether we can infer correlation or causation
depends on the type of study that we are basing our decision on.
• Observational studies for the most part only allow us to make
correlation statements, while experiments infer us to causation.
Correlation does not
imply causation.
Question
Answer
3 Possible Explanations
vs.
Big Data Analysis 12Author: Vikram Andem ISRM & IT GRC Conference
Data Sampling and Sources of Bias
Data Sampling
 Think about sampling something that you are
cooking - you taste (examine) a small part of what
you’re cooking to get an idea about the dish as a
whole.
 When you taste a spoonful of soup and decide the
spoonful you tasted isn’t salty enough, that’s called
exploratory analysis for the sample at hand.
 If you can generalize and conclude that your entire
soup needs salt, that’s making an inference.
 For your inference to be valid, the spoonful you
tasted (the sample) needs to be representative of
the entire pot (the population).
 If your spoonful comes only from the surface and
the salt is collected at the bottom of the pot, what
you tasted is probably not representative of the
whole pot.
 On the other hand, if you first stir the soup
thoroughly before you taste, your spoonful will
more likely be representative of the whole pot.
Exploratory
Analysis
Representative
Sample
Inference
Sources of Sampling Bias
 Convenience sample bias: Individuals who are easily accessible are
more likely to be included in the sample. Example: say you want to find
out how people in your city feel about recent increase in public
transportation costs. If you only poll people in your neighborhood as
opposed to a representative of a whole sample from the entire city your
study will suffer from Convenience sample bias.
 Voluntary Response bias: Occurs when the sample consists of people
who volunteer to respond because they have strong opinions on the issue.
Such a sample will also not be representative of the entire population.
Example: say you place polling machines at all bus stops and metro
stations in your city, but only those who suffered by the price increase
choose to actually take the time to vote and express their opinion on the
recent increase in public transportation fares. The people who respond to
such sample do not make up the representative of the entire population.
 Non-Response sampling bias: If only a (non-random) fraction of the
randomly sampled people choose to respond to a survey, the sample is no
longer a representative of the entire population. Example: say you take a
random sample of individuals from your city and attempt to survey them
but certain segments of the population; say those from the lower socio-
economic status are less likely to respond to the survey then its not a
representative of entire population.
Sampling Bias a historical example: Landon vs. FDRIn 1936, Landon sought the Republican presidential
nomination opposing the re-election of FDR.
A popular magazine of the time (1936) “Literary Digest”
polled about 10 million Americans, and got responses from
about 2.4 million. To put things in perspective, nowadays
reliable polls in USA poll about 1500 to 3000 people, so the
“10 million” poll was very huge sample.
The poll showed that Landon would likely be the
overwhelming winner and FDR would get only 43% of the
votes.
Election result: FDR won, with 62% of the votes.
What went wrong with
the Literary Digest Poll?
The magazine had surveyed
 its own readers:
 registered automobile owners,
 and registered telephone users.
These groups had incomes well above the national average of the
day (remember, this is Great Depression era) which resulted in lists
of voters far more likely to support Republicans than a truly typical
voter of the time, i.e. the sample was not representative of the
American population at the time.
The Literary Digest election
poll was based on a sample
size of 2.4 million, which is huge,
but since the sample was
biased, the sample did not
yield an accurate prediction.
Back to the soup analogy:
If the soup is not well stirred, it
doesn’t matter how large a
spoon you have, it will still not
taste right. If the soup is well
stirred, a small spoon will suffice
to test the soup.
Big Data Analysis 13Author: Vikram Andem ISRM & IT GRC Conference
Data Sampling Methods & Experimental Design
Obtaining Good Samples
 Almost all statistical methods are based on the notion of
implied randomness.
 If observational data are not collected in a random framework
from a population, these statistical methods the estimates &
errors associated with the estimates are not reliable.
 Most commonly used random sampling techniques are simple,
stratified, and cluster sampling.
Simple Random Sample
Randomly select cases
from the population,
where there is no
implied connection
between the points
that are selected.
Stratified Sample
Strata are made up of
similar observations. We
take a simple random
sample from each
stratum.
Cluster Sample
Clusters are usually not
made up of homogeneous
observations, and we take
a simple random sample
from a random sample of
clusters. Usually preferred
for economical reasons.
(data) Experimental Design Concepts
• Control: Compare treatment of interest to a control group.
• Randomize: Randomly assign subjects to treatments, and randomly sample from the population
whenever possible.
• Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the
entire study.
• Block: If there are variables that are known or suspected to affect the response variable, first
group subjects into blocks based on these variables, and then randomize cases within each block
to treatment groups.
• Blocking example: We would like to design an experiment to investigate if energy gels makes you
run faster: Treatment: energy gel ; Control: no energy gel
• It is suspected that energy gels might affect pro and amateur athletes differently, therefore we
block for pro status:
• Divide the sample to pro and amateur. Randomly assign pro athletes to treatment and
control groups. Randomly assign amateur athletes to treatment and control groups.
• Pro/amateur status is equally represented in resulting treatment and control groups
Random Assignment vs. Random Sampling
Big Data Analysis 14Author: Vikram Andem ISRM & IT GRC Conference
Hypothesis Testing
Two competing claims
Claim 1. “There is nothing going on.”
Promotion and gender are independent, no gender discrimination,
observed difference in proportions is simply due
to chance. => Null hypothesis
Claim 2. “There is something going on.”
Promotion and gender are dependent, there is gender
discrimination, observed difference in proportions is not due to
chance. => Alternative hypothesis
A court trial as a hypothesis test
Hypothesis testing is very much like a court trial.
 H0: Defendant is innocent
 HA: Defendant is guilty
 We then present the evidence: collect data.
 Then we judge the evidence - “Could these data plausibly have happened by
chance if the null hypothesis were true?”
 If they were very unlikely to have occurred, then the evidence raises
more than a reasonable doubt in our minds about the null
hypothesis.
 Ultimately we must make a decision. How unlikely is unlikely?
 If the evidence is not strong enough to reject the assumption of innocence, the
jury returns with a verdict of “not guilty”.
 The jury does not say that the defendant is innocent, just that there
is not enough evidence to convict.
 The defendant may, in fact, be innocent, but the jury has no way of
being sure.
 Said statistically, we fail to reject the null hypothesis.
 We never declare the null hypothesis to be true, because we simply
do not know whether it’s true or not.
 Therefore we never “accept the null hypothesis”.
 In a trial, the burden of proof is on the prosecution.
 In a hypothesis test, the burden of proof is on the unusual claim.
 The null hypothesis is the ordinary state of affairs (the status quo), so it’s the
alternative hypothesis that we consider unusual and for which we must gather
evidence.
Hypothesis Testing
Big Data Analysis 15Author: Vikram Andem ISRM & IT GRC Conference
Statistical Inference and Prediction
Statistical Inference
Statistical inference is the process
of drawing conclusions from data
that are subject to random variation,
for example, observational errors or
sampling variation. Initial
requirements of such a system of
procedures for inference and
induction are that the system should
produce reasonable answers when
applied to well-defined situations and
that it should be general enough to
be applied across a range of
situations. Inferential statistics are
used to test hypotheses and make
estimations using sample data.
Confidence Interval
Outcome of statistical inference may
be an answer to the question "what
should be done next?", where this
might be a decision about making
further experiments or surveys, or
about drawing a conclusion before
implementing some organizational or
governmental policy.
A confidence interval gives an estimated range of values which is likely to
include an unknown population parameter, the estimated range being
calculated from a given set of sample data. If independent samples are taken
repeatedly from the same population, and a confidence interval calculated for
each sample, then a certain percentage (confidence level) of the intervals will
include the unknown population parameter. Confidence intervals are usually
calculated so that this percentage is 95%, but we can produce 90%, 99%,
99.9% (or whatever) confidence intervals for the unknown parameter.
Confidence Level
Confidence level is the probability value (1-alpha) associated with a
confidence interval. It is often expressed as a percentage. For example,
say alpha = 0.05 = 5%, then the confidence level is equal to
(1-0.05) = 0.95, i.e. a 95% confidence level.
Prediction
In statistics, prediction
is the process of
determining the
magnitude of statistical
variates at some future
point of time. For data
analysis context the word
may also occur in slightly
different meanings; e.g. in a
regression equation expressing a dependent variate y in terms of
dependent x’s, the value given for y by specified values of x’s is
called the “predicted” value even when no temporal element is
involved. Prediction vs. Inference
 Using data to predict an event that has yet to occur is statistical
prediction. Inferring the value of a population quantity such as the
average income of a country or the proportion of eligible voters who say
they will vote ‘yes’ is statistical inference.
 Prediction and inference answer different types of data analysis
questions.
Examples of predictions (because the events have not occurred at the time
of writing this content):
 The probability that Chicago Bulls will win the 2018 NBA playoffs is __.
 The probability that Republican Party will win the 2020 Presidential
election is __.
Examples of inferences: (because the questions involve estimating a
population value.)
 The proportion of NBA fans that currently believe Chicago Bulls will win
the 2018 playoffs is __.
 The proportion of eligible voters that currently state they will vote for
Republican Party in the 2020 Presidential election is __.
Big Data Analysis 16Author: Vikram Andem ISRM & IT GRC Conference
Before you proceed to Part 3 :
Please quickly review the
Appendix section to familiarize
with terms and terminology that will
be used in the rest of the presentation.
Big Data Analysis 17Author: Vikram Andem ISRM & IT GRC Conference
Big Data Analysis
 Concepts and References
 Use Cases in Airline Industry
Jump here directly,
if you are a advanced user
who understands data and
knows how to analyze data.
Part 3
Big Data Analysis 18Author: Vikram Andem ISRM & IT GRC Conference
Big Data Analysis: Concepts and Airline Industry Use Cases
Data Analysis Machine Learning
&
Models derive useful analytical information so humans
can better understand it.
Examples:
 Does spending more money on marketing & sales in
area “X” vs. area “Y” makes the company more
profitable?
 What does the customer want ? (e.g., customer survey).
Models allow machines (software programs &
applications) to make “real-time” (auto) decisions.
Examples:
 Google search / Amazon product recommendations,
Facebook news feed, etc.
 Geographic (GPS based) specific advertisements or
event (holiday, weather, traffic) based promotions.
Big Data application areas
Big Data Analysis 19Author: Vikram Andem ISRM & IT GRC Conference
Bayesian Approach and Bayes Rule
Bayesian Approach
Differences between Bayesians and Non-Bayesians
Bayes
Theorem
The probability the email message is spam, given the words in the
email message is expressed as below:
Example: Email Spam Filtering
With Bayes:
• A key benefit: The ability to incorporate prior knowledge
• A key weakness: The need to incorporate prior knowledge
Big Data Analysis 21Author: Vikram Andem ISRM & IT GRC Conference
Bayesian Belief Network Representation of Airline Passenger Behavior
Source: Booz Allen Hamilton
Bayesian Belief Network Representation of
Airline Passenger BehaviorThe basis of this slide is from the paper titled
Airline Analytics:
Decision Analytics Center of Excellence
by Cenk Tunasar, Ph.D., and Alex Cosmas
of Booz Allen Hamilton
In the above listed paper authors claim Booz
Allen used the Big Data infrastructure of an airline client,
and were able to analyze large datasets containing more
than 3 years’ worth of passenger data of approximately 100
GB+. Booz Allen generated hypotheses to test from the Big
Data set including , but not limited to:
Airline Market Performance
• What are the client’s natural market types and their
distinct attributes?
• What is the client’s competitive market health?
• Where does the client capture fare premiums or fare
discounts relative to other carriers?
Passenger Behavior
• What is the variability of booking curves by market type?
• What are the intrinsic attributes of markets with the
highest earn and highest burn rates?
• Can predictive modeling be developed for reservation
changes and no-show rates for individual passengers on
individual itineraries?
Consumer Choice
• What is the demand impact of increasing connection time?
• What is the effect of direct versus connecting
itineraries on passenger preference?
A use
case in
Airline
industry
(URL: http://www.boozallen.com/media/file/airline-analytics-brochure.pdf)
Big Data Analysis 22Author: Vikram Andem ISRM & IT GRC Conference
Bayesian Ideas are very important for Big Data Analysis
Bayesian Themes
Prediction
Average over
unknowns,
don't maximize.
Uncertainty
Probability coherently
represents uncertainty.
Combine Information
Hierarchical models
combine information
from multiple sources.
Source: Steve Scott (Google Inc.
Sparsity
Sparsity plays an important role in
modeling Big Data
 Models are "big" because of a small
number of factors with many levels.
 Big data problems are often big
collections of small data problems.
Multi-armed
Bandits Problem
Multi-armed bandit
problem is the problem a
gambler faces at a row of
slot machines, sometimes
known as "one-armed
bandits", when deciding
which slot machines to
play, how many times to
play each machine and in
which order to play them.
When played, each
machine provides a
random reward from a
distribution specific to
that machine. The
objective of the gambler
is to maximize the sum of
rewards earned through a
sequence of lever pulls.
Source: Wikipedia
Bayes Rule applied to Machine Learning
A use
case in
Airline
industry
Big Data Project at South West Airlines
The below URL provides a visual (interactive graphics) presentation of the Big Data Project at South West Airlines
and how they used Bayesian approach and Naive Bayes classification with WEKA("Waikato Environment for
Knowledge Analysis") tool for analysis of the following questions:
1) What are the important factors that cause delays and their weightage ?
2) What kind of weather (e.g. sunny, cloudy, snow, rain, etc.) causes weather delays?
3) Are some of the time periods during the day (e.g. early morning, morning, noon, etc.) that are
more prone to delays than others?
(URL: http://prezi.com/f3bsv9m6yl2g/big-data-project_southwest-airlines/)
Entirely driven by parameter uncertainty
Big Data Analysis 23Author: Vikram Andem ISRM & IT GRC Conference
Example: Bayesian based “Search Optimization” on Google File System (Source: Google Analytics)
Source: Steve Scott (Google Inc.)
Personalization as a “Big Logistic Regression" Search words: “Chicago to Houston today”
Search words: “Chicago to Houston flight tomorrow”
Search words: “Chicago to Houston cheapest”
Big Data Analysis 24Author: Vikram Andem ISRM & IT GRC Conference
Meta Analysis
Meta Analysis
Meta-analysis refers to methods that focus on contrasting and
combining results from different studies, in the hope of identifying
patterns among study results, sources of disagreement among those
results, or other interesting relationships that may come to light in
the context of multiple studies. In its simplest form, meta-analysis is
normally done by identification of a common measure of effect size. A
weighted average of that common measure is the output of a meta-
analysis. The weighting is related to sample sizes within the individual
studies. More generally there are other differences between the
studies that need to be allowed for, but the general aim of a meta-
analysis is to more powerfully estimate the true effect size as opposed
to a less precise effect size derived in a single study under a given
single set of assumptions and conditions. Source: Wikipedia
Advantages
 Results can be generalized to a larger population,
 The precision and accuracy of estimates can be improved as
more data is used. This, in turn, may increase the statistical
power to detect an effect.
 Inconsistency of results across studies can be quantified and
analyzed. For instance, does inconsistency arise from sampling
error, or are study results (partially) influenced by between-
study heterogeneity.
 Hypothesis testing can be applied on summary estimates.
A use
case in
Airline
industry
Price Elasticities of Demand for Passenger Air Travel
A good discussion of the topic is detailed in the paper listed below:
Price Elasticities of Demand for Passenger Air Travel: A Meta-Analysis
by Martijn Brons, Eric Pels, Peter Nijkamp, Piet Rietveld
of Tinbergen Institute
(URL: http://papers.tinbergen.nl/01047.pdf)
Meta Analysis and Big Data
A good discussion of the topic is detailed in the article listed below:
Meta-Analysis: The Original 'Big Data‘
by Blair T. Johnson , Professor at University of Connecticut
(URL: http://meta-analysis.ning.com/profiles/blogs/meta-analysis-the-original-big-data)
Big Data Analysis 25Author: Vikram Andem ISRM & IT GRC Conference
Effect Size
Effect Size
Effect size is a measure of the strength of a phenomenon (for
example, the change in an outcome after experimental
intervention). An effect size calculated from data is a descriptive
statistic that conveys the estimated magnitude of a relationship
without making any statement about whether the apparent
relationship in the data reflects a true relationship in the
population. In that way, effect sizes complement inferential
statistics such as p-values. Among other uses, effect size measures
play an important role in meta-analysis studies that summarize
findings from a specific area of research, and in statistical power
analyses. Source: Wikipedia
Example: A weight loss program may boast that it leads to an
average weight loss of 30 pounds. In this case, 30 pounds is the
claimed effect size. if the weight loss program results in an average
loss of 30 pounds, it is possible that every participant loses exactly
30 pounds, or half the participants lose 60 pounds and half lose
no weight at all.
"Small", “Medium", “Large" Effect Sizes
Effect sizes apply terms such as "small", "medium" and "large" to the
size of the effect and are relative. Whether an effect size should be
interpreted small, medium, or large depends on its substantive context
and its operational definition. Cohen's conventional criteria small,
medium, or big are near ubiquitous across many fields. Power analysis or
sample size planning requires an assumed population parameter of
effect sizes.
For Cohen's an effect size of 0.2 to 0.3 might be a "small" effect, around
0.5 a "medium" effect and 0.8 to infinity, a "large" effect.
Big Data Analysis 26Author: Vikram Andem ISRM & IT GRC Conference
Monte Carlo Method
Monte Carlo Method
Monte Carlo methods (or experiments) are a broad
class of computational algorithms that rely on
repeated random sampling to obtain numerical results;
typically one runs simulations many times over in order
to obtain the distribution of an unknown probabilistic
entity. The name comes from the resemblance of the
technique to the act of playing and recording results in
a real gambling casino. They are often used in physical
and mathematical problems and are most useful when
it is difficult or impossible to obtain a closed-form
expression, or infeasible to apply a deterministic
algorithm. Monte Carlo methods are mainly used in
three distinct problem classes: optimization, numerical
integration and generation of draws from a probability
distribution.
Monte Carlo methods vary, but tend to follow a particular
pattern:
 Define a domain of possible inputs.
 Generate inputs randomly from a probability distribution
over the domain.
 Perform a deterministic computation on the inputs.
 Aggregate the results.
For example: Consider a circle inscribed in a unit square. Given
that circle and the square have a ratio of areas that is π/4, the
value of π can be approximated using a Monte Carlo method:
 Draw a square on ground, then inscribe a circle within it.
 Uniformly scatter some objects of uniform size (grains of
rice or sand) over the square.
 Count the number of objects inside the circle and the total
number of objects.
 The ratio of the two counts is an estimate of the ratio of
the two areas, which is π/4. Multiply the result by 4 to
estimate π.
Monte Carlo Methods for Bayesian Analysis and Big Data
A good discussion of the topic is detailed in the paper listed below:
A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
by David Madigan, Professor and Dean at Columbia University
and Greg Ridgeway, Deputy Director at National Institute of Justice.
(URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2753529/ )
Source: Wikipedia
A use
case in
Airline
industry
Flight Delay-Cost (Initial delay – “type I” and Propagated delay “type II”)
and Dynamic Simulation Analysis for Airline Schedule Optimization
Flight Delay-Cost Simulation Analysis and
Airline Schedule Optimization
by Duojia Yuan of RMIT University, Victoria, Australia
(URL: http://researchbank.rmit.edu.au/eserv/rmit:9807/Yuan.pdf
General use
case for
Customer
Satisfaction
and
Customer
Loyalty
Concurrent Reinforcement Learning from Customer Interactions
Concurrent Reinforcement Learning from Customer Interactions
by David Silver of University College London (published 2013)
and Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall of Causata Ltd .
(URL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/concurrent-rl.pdf )
A good discussion of the topic is detailed in the Ph.D. thesis listed below. The reliability
modeling approach developed in this project (to enhance the dispatch reliability of
Australian X airline fleet) is based on the probability distributions and Monte Carlo
Simulation (MCS) techniques. Initial (type I) delay and propagated (type II) delay are
adopted as the criterion for data classification and analysis.
In the below paper, authors present a framework for concurrent reinforcement learning, a new method of
a company interacting concurrently with many customers with an objective function to maximize revenue,
customer satisfaction, or customer loyalty, which depends primarily on the sequence of interactions
between company and customer (such as promotions, advertisements, or emails) and actions by the
customer (such as point-of- sale purchases, or clicks on a website).
The proposed concurrent reinforcement learning framework uses a variant of temporal-
difference learning to learn efficiently from partial interaction sequences. The goal is to
maximize the future rewards for each customer, given their history of interactions with
the company. The proposed framework differs from traditional reinforcement learning
paradigms, due to the concurrent nature of the customer interactions. This distinction
leads to new considerations for reinforcement learning algorithms.
Big Data Analysis 27Author: Vikram Andem ISRM & IT GRC Conference
Bayes and Big Data: Consensus Monte Carlo and Nonparametric Bayesian Data Analysis
A good discussion of the topic is detailed in the article listed below:
“Bayes and Big Data: The Consensus Monte Carlo Algorithm”
by
 Robert E. McCulloch, of University of Chicago, Booth School of Business
 Edward I. George, of University of Pennsylvania, The Wharton School
 Steven L. Scott, of Google, Inc
 Alexander W. Blocker, of Google, Inc
 Fernando V. Bonassi, Google, Inc.
(URL: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf)
Consensus Monte Carlo
For Bayesian methods to work in a MapReduce / Hadoop environment,
we need algorithms that require very little communication.
Need: A useful definition of “big data” is data that is too big to fit on a
single machine, either because of processor, memory, or disk
bottlenecks. Graphics Processing Units (GPU) can alleviate the processor
bottleneck, but memory or disk bottlenecks can only be alleviated by
splitting “big data” across multiple machines. Communication between
large numbers of machines is expensive (regardless of the amount of
data being communicated), so there is a need for algorithms that
perform distributed approximate Bayesian analyses with minimal
communication.
Consensus Monte Carlo operates by running a separate Monte Carlo
algorithm on each machine, and then averaging the individual Monte
Carlo draws. Depending on the model, the resulting draws can be nearly
indistinguishable from the draws that would have been obtained by
running a single machine algorithm for a very long time.
Source: Steve Scott (Google Inc.)
Non-Parametric Bayesian Data Analysis
A use
case in
Airline
industry
Airline Delays in International Air Cargo Logistics
A good discussion of the topic is detailed in the paper below:
“Nonparametric Bayesian Analysis in
International Air Cargo Logistics”
by Yan Shang
of Fuqua School of Business, Duke University
(URL: https://bayesian.org/abstracts/5687 )
Non-Parametric Analysis refers to comparative properties (statistics) of the data, or
population, which do not include the typical parameters, of mean, variance,
standard deviation, etc.
Need / Motivation: Models are never correct for real world data.
Non-Parametric Modelling of Large Data Sets
What is a nonparametric model?
 A parametric model where the number of parameters increases with data.
 A really large parametric model.
 A model over infinite dimensional function or measure spaces.
 A family of distributions that is dense in some large space.
Why nonparametric models in Bayesian theory of learning?
 Broad class of priors that allows data to “speak for itself”.
 Side-step model selection and averaging.
Bayes and Big Data
Big Data Analysis 28Author: Vikram Andem ISRM & IT GRC Conference
Homoscedasticity vs. Heteroskedasticity
Homoscedasticity
In regression analysis , homoscedasticity means a situation in
which the variance of the dependent variable is the same for all
the data. Homoscedasticity facilitates analysis because most
methods are based on the assumption of equal variance.
A sequence or a vector of random variables is homoscedastic if all
random variables in the sequence or vector have the same finite
variance. This is also known as homogeneity of variance.
In regression analysis , heteroskedasticity means a situation in which the
variance of the dependent variable varies across the data. Heteroskedasticity
complicates analysis because many methods in regression analysis are based
on an assumption of equal variance.
A collection of random variables is heteroscedastic if there are sub-
populations that have different variabilities from others. Here "variability"
could be quantified by the variance or any other measure of statistical
dispersion. Thus heteroscedasticity is the absence of homoscedasticity.
Heteroskedasticity
Big Data Analysis 29Author: Vikram Andem ISRM & IT GRC Conference
Benford’s Law
Benford’s Law
Benford's Law, also called the First-Digit Law, refers to the
frequency distribution of digits in many (but not all) real-life
sources of data. In this distribution, the number 1 occurs as the
leading digit about 30% of the time, while larger numbers occur in
that position less frequently: 9 as the first digit less than 5% of the
time. Benford's Law also concerns the expected distribution for
digits beyond the first, which approach a uniform distribution.
This result has been found to apply to a wide variety of data sets,
including electricity bills, street addresses, stock prices, population
numbers, death rates, lengths of rivers, physical and mathematical
constants, and processes described by power laws (which are very
common in nature). It tends to be most accurate when values are
distributed across multiple orders of magnitude. Source: Wikipedia
Numerically, the
leading digits have
the following
distribution in
Benford's Law,
where d is the leading
digit and P(d) the
probability:
Benford’s Law Big Data Application: Fraud Detection Facts
The graph below shows Benford's Law for base 10. There is a
generalization of the law to numbers expressed in other bases (for
example, base 16), and also a generalization from leading 1 digit to
leading n digits. A set of numbers is said to satisfy Benford's Law
if the leading digit d (d ∈ {1, ..., 9}) occurs with Probability.
Benford’s Law holds true for a data set that grows
exponentially (e.g., doubles, then doubles again in the
same time span). It is best applied to data sets that go
across multiple orders of magnitude . The theory does
not hold true for data sets in which digits are
predisposed to begin with a limited set of digits. The
theory also does not hold true when a data set covers
only one or two orders of magnitude.
 Helps identify duplicates & other data pattern anomalies in large data sets.
 Enables auditors and data analysts to focuses on possible anomalies in very large data
sets.
 It does not "directly" prove that error or fraud exist, but identifies items that deserve
further study on statistical grounds.
 Mainly used for setting future auditing plans and is a low cost entry for continuous
analysis of very large data sets
 Not good for sampling – results in very large selection sizes.
 As technology matures, finding fraud will increase (not decrease).
 Not all data sets are suitable for analysis .
A use case
in Airline
industry
An financial/accounting auditor can evaluate very large data
sets (in a continuous monitoring or continuous audit
environment) that represents a continuous stream of
transactions , such as the sales made by an (third party)
online retailer or the internal airline reservation system.
Fraud Detection in Airline Ticket Purchases
Christopher J. Rosetti, CPA, CFE, DABFA of KPMG states in his presentation titled "SAS
99: Detecting Fraud Using Benford’s Law" presented at the FAE/NYSSCPA, Technology
Assurance Committee , on March 13, 2003 claims that United Airlines currently
uses Benford's law for fraud detection!
(URL: http://www.nysscpa.org/committees/emergingtech/law.ppt )
Big Data Analysis 30Author: Vikram Andem ISRM & IT GRC Conference
Multiple Hypothesis Testing
Multiple Testing Problem
Multiple testing problem occurs when one considers a set of
statistical inferences simultaneously or infers a subset of
parameters selected based on the observed values. Errors in
inference, including confidence intervals that fail to include their
corresponding population parameters or hypothesis tests that
incorrectly reject the null hypothesis are more likely to occur when
one considers the set as a whole. Source: Wikipedia
For example, one might declare that a coin was biased if in 10 flips it landed
heads at least 9 times. Indeed, if one assumes as a null hypothesis that the
coin is fair, then the probability that a fair coin would come up heads at least
9 out of 10 times is (10 + 1) × (1/2)10 = 0.0107. This is relatively unlikely, and
under statistical criteria such as p-value < 0.05, one would declare that the
null hypothesis should be rejected — i.e., the coin is unfair.
A multiple-comparisons problem arises if one wanted to use this test (which
is appropriate for testing the fairness of a single coin), to test the fairness of
many coins. Imagine if one were to test 100 fair coins by this method. Given
that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107,
one would expect that in flipping 100 fair coins ten times each, to see a
particular (i.e., pre-selected) coin comes up heads 9 or 10 times would still be
very unlikely, but seeing any coin behave that way, without concern for which
one, would be more likely than not. Precisely, the likelihood that all 100 fair
coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore
the application of our single-test coin-fairness criterion to multiple
comparisons would be more likely to falsely identify at least one fair coin as
unfair.
Multiple Hypothesis Testing
A use
case in
Airline
industry
Predicting Flight Delays using Multiple Hypothesis Testing
A good discussion of the topic is detailed in the paper listed below:
Predicting Flight Delays
by Dieterich Lawson and William Castillo of Stanford University
(URL: http://cs229.stanford.edu/proj2012/CastilloLawson-PredictingFlightDelays.pdf )
Also detailed in the book
“Big Data for Chimps: A Seriously Fun guide to Terabyte-scale data processing“
by the same author (Dieterich Lawson) and Philip Kromer. Sample Source Code for
modelling in Matlab is also provided by the Dieterich Lawson and can be found at
URL: https://github.com/infochimps-labs/big_data_for_chimps
Big Data Analysis 31Author: Vikram Andem ISRM & IT GRC Conference
The German Tank Problem
The German Tank Problem
The problem of estimating the
maximum of a discrete uniform
distribution from sampling without
replacement is known in English as the
German tank problem, due to its
application in World War II to the
estimation of the number of German
tanks. The analyses illustrate the
difference between frequentist
inference and Bayesian inference.
Estimating the population maximum
based on a single sample yields
divergent results, while the estimation
based on multiple samples is an
instructive practical estimation
question whose answer is simple but
not obvious. Source: Wikipedia
During World War II, production of
German tanks such as the Panther
(below photo) was accurately
estimated by Allied intelligence using
statistical methods.
Example: Suppose an intelligence officer has spotted k = 4 tanks with
serial numbers, 2, 6, 7, and 14, with maximum observed serial number, m = 14.
The unknown total number of tanks is called N.
The formula for estimating the total
number of tanks suggested by the
frequentist approach outlined is:
Whereas, the Bayesian analysis below yield
(primarily) a probability mass function for
the number of tanks:
from which we can estimate the number of
tanks according to:
This distribution has positive skewness,
related to the fact that there are at least 14 tanks.
During the course of the war the Western Allies
made sustained efforts to determine the extent of
German production, and approached this in two major ways: conventional intelligence
gathering and statistical estimation. To do this they used the serial numbers on
captured or destroyed tanks. The principal numbers used were gearbox numbers, as
these fell in two unbroken sequences. Chassis and engine numbers were also used,
though their use was more complicated. Various other components were used to cross-
check the analysis. Similar analyses were done on tires, which were observed to be
sequentially numbered (i.e., 1, 2, 3, ..., N). The analysis of tank wheels yielded an
estimate for the number of wheel molds that were in use.
Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an
estimate of 270 produced in February 1944, substantially more than had previously
been suspected. German records after the war showed production for the month of
February 1944 was 276. The statistical approach proved to be far more accurate than
conventional intelligence methods, and the phrase German tank problem
became accepted as a descriptor for this type of statistical analysis.
Application in Big Data Analysis
 Similar to German Tank Problem we can estimate/analyze (large or
small) data sets that we don’t have (or assumed that we don’t have).
 There is “leaky” data all around us; all we have to do is to think outside
the box. Companies very often don’t think about the data they publish
publicly and we can either extrapolate from that data (as in the German
Tank problem) or simply extract useful information from it.
 A company’s competitors' websites (publicly available data) can be a
valuable hunting ground. Think about whether you can use it to estimate
some missing data (as with the serial numbers) and/or combine that
data with other, seemingly innocuous, sets to produce some vital
information. If that information gives your company a commercial
advantage and is legal, then you should use it as part of your analysis.
Source: Wikipedia
Big Data Analysis 32Author: Vikram Andem ISRM & IT GRC Conference
Nyquist–Shannon Sampling Theorem
Nyquist–Shannon Sampling Theorem
The Nyquist Theorem, also known as the sampling theorem, is a
principle that engineers follow in the digitization of analog signals.
For analog-to-digital conversion (ADC) to result in a faithful
reproduction of the signal, slices, called samples, of the analog
waveform must be taken frequently. The number of samples per
second is called the sampling rate or sampling frequency.
Any analog signal consists of components at various frequencies.
The simplest case is the sine wave, in which all the signal energy is
concentrated at one frequency. In practice, analog signals usually
have complex waveforms, with components at many frequencies.
The highest frequency component in an analog signal determines
the bandwidth of that signal. The higher the frequency, the
greater the bandwidth, if all other factors are held constant.
Suppose the highest frequency component, in hertz, for a given
analog signal is fmax. According to the Nyquist Theorem, the
sampling rate must be at least 2fmax, or twice the highest analog
frequency component. The sampling in an analog-to-digital
converter is actuated by a pulse generator (clock). If the sampling
rate is less than 2fmax, some of the highest frequency components
in the analog input signal will not be correctly represented in the
digitized output. When such a digital signal is converted back to
analog form by a digital-to-analog converter, false frequency
components appear that were not in the original analog signal.
This undesirable condition is a form of distortion called aliasing.
Application in Big Data Analysis
 Even though the “Nyquist–Shannon Sampling Theorem” is about the
minimum sampling rate of a continuous wave, but with Big Data Analysis
practice it will tell you how frequently you need to collect that Big Data
from sensors like smart meters.
 The frequency of data collection for Big Data is the “Velocity”, one of
the three “V”s for terms that define Big Data; Volume, Velocity and
Varity.
Left figure: X(f) (top blue) and XA(f) (bottom blue) are continuous Fourier transforms of two
different functions, x(t) and xA(t) (not shown). When the functions are sampled at rate fs, the
images (green) are added to the original transforms (blue) when one examines the discrete-time
Fourier transforms (DTFT) of the sequences. In this hypothetical example, the DTFTs are identical,
which means the sampled sequences are identical, even though the original continuous pre-
sampled functions are not. If these were audio signals, x(t) and xA(t) might not sound the same. But
their samples (taken at rate fs) are identical and would lead to identical reproduced sounds; thus
xA(t) is an alias of x(t) at this sample rate. In this example (of a bandlimited function), such aliasing
can be prevented by increasing fs such that the green images in the top figure do not overlap the
blue portion.
Right figure: Spectrum, Xs(f), of a properly sampled bandlimited signal (blue) and the adjacent DTFT
images (green) that do not overlap. A brick-wall low-pass filter, H(f), removes the images, leaves the
original spectrum, X(f), and recovers the original signal from its samples. Source: Wikipedia
Source: Wikipedia
Big Data Analysis 33Author: Vikram Andem ISRM & IT GRC Conference
Simpson’s Paradox
Simpson’s Paradox
Simpson's paradox is a paradox in which a trend that appears in
different groups of data disappears when these groups are
combined, and the reverse trend appears for the aggregate data.
This result is particularly confounding when frequency data are
unduly given causal interpretations. Simpson's Paradox disappears
when causal relations are brought into consideration.
Example:
 It's a well accepted rule of thumb that the larger the data set, the more
reliable the conclusions drawn. Simpson' paradox, however, slams a
hammer down on the rule and the result is a good deal worse than a
sore thumb. Simpson's paradox demonstrates that a great deal of care
has to be taken when combining small data sets into a large one.
Sometimes conclusions from the large data set may be the exact
opposite of conclusion from the smaller sets. Unfortunately, the
conclusions from the large set can (also) be wrong.
 The lurking variables (or confounding variable) in Simpson’s paradox are
categorical. That is, they break the observation into groups, such as the
city of origin for the airline flights. Simpson’s paradox is an extreme form
of the fact that the observed associations can be misleading when there
are lurking variables.
Status Airline
A
Airline
B
On Time 718 5534
Delayed 74 532
Total 792 6066
From the left table: Airline A is
delayed 9.3% (74/792) of the time;
Airline B is delayed only 8.8%
(532/6066) of the time.
So Airline A would
NOT be preferable.
Chicago Houston
Airline
On
Time
Delayed Total On
Time
Delayed Total
A 497 62 559 221 12 233
B 694 117 811 4840 415 5255
From the above table:
From Chicago, Airline A is delayed 11.1% (62/559) of the time, but
Airline B is delayed 14.4% (117/811) of the time.
From Houston, Airline A is delayed 5.2% (12/233) of the time, but
Airline B is delayed 7.9% (415/5255).
Consequently, Airline A would be preferable.
This conclusion contradicts the previous conclusion.
Simpsons' Paradox is when Big Data sets CAN go wrong
A use
case in
Airline
industry
Airline On-Time Performance at Hub-and-Spoke Flight Networks
A good discussion of the topic is detailed in the paper listed below:
Simpson’s Paradox, Aggregation,
and Airline On-time Performance
by Bruce Brown of Cal State Polytechnic University
(URL: http://www.csupomona.edu/~bbrown/Brown_SimpPar_WEAI06.pdf)
 Big Data doesn’t happen overnight and there’s no
magic to it.
 Just deploying Big Data tools and analytical
solutions (R, SAS, and Tableau etc.) doesn’t
guarantee anything, as Simpson’s Paradox proves.
Big Data Analysis 34Author: Vikram Andem ISRM & IT GRC Conference
Machine Learning
Machine Learning and Data Mining
Machine learning concerns the construction and study
of systems that can learn from data. For example, a machine
learning system could be trained on email messages to learn to
distinguish between spam and non-spam messages. After learning,
it can then be used to classify new email messages into spam and
non-spam folders. The core of machine learning deals with
representation and generalization. Representation of data
instances and functions evaluated on these instances are part of
all machine learning systems. Generalization is the property that
the system will perform well on unseen data instances.
Source: Wikipedia
These two terms are commonly confused, as they often employ the same
methods and overlap significantly. They can be roughly defined as follows:
 Machine learning focuses on prediction, based on known properties
learned from the training data.
 Data mining focuses on the discovery of (previously) unknown properties
in the data. This is the analysis step of Knowledge Discovery in
Databases.
Terminology
 Classification: The learned attribute is categorical (“nominal”)
 Regression: The learned attribute is numeric
 Supervised Learning (“Training”) : We are given examples of inputs and
associated outputs and we learn the relationship between them.
 Unsupervised Learning (sometimes: “Mining”): We are given inputs, but
no outputs (such as unlabeled data) and we learn the “Latent” labels.
(example: Clustering).
Example:
Document
Classification
• Highly accurate predictions on real time and continuous data (based
on rule sets with earlier training / learning and training / historical data).
• Goal is not to uncover underlying “truth”.
• Emphasis on methods that can handle very large datasets for better
predictions.
A use
case in
Airline
industry
South West Airlines use of Machine Learning for Airline Safety
The below URL details an article (published September 2013) on how
South West Airlines uses Machine Learning algorithms for Big Data
purposes to analyze vast amounts of very large data sets (which are
publicly accessible from NASA’s DASHlink site) to find anomalies and
potential safety issues and to identify patterns to improve airline safety.
URL: http://www.bigdata-startups.com/BigData-startup/southwest-
airlines-uses-big-data-deliver-excellent-customer-service/
Primary Goal of Machine Learning
Why Machine Learning?
Increase barrier to entry
when product / service
quality is dependent on data
Customize product / service to
increase engagement and profits.
Example: Customize sales page to increase
conversion rates for online products.
vs.
Use Case1 Use Case 2
Big Data Analysis 35Author: Vikram Andem ISRM & IT GRC Conference
Classification Rules and Rule Sets
Rule Set to Classify Data
Golf Example: To Play or Not to Play
A use
case in
Airline
industry
Optimal Airline Ticket Purchasing (automated feature selection)
A good discussion of the topic is detailed in the paper listed below:
Optimal Airline Ticket Purchasing Using
Automated User-Guided Feature Selection
by William Groves and Maria Gini of University of Minnesota
(URL: http://ijcai.org/papers13/Papers/IJCAI13-032.pdf )
Classification Problems
Examples of Classification Problems:
• Text categorization (e.g., spam filtering)
• Market segmentation (e.g.: predict if
customer will respond to promotion).
• Natural-language processing
(e.g., spoken language understanding).
Big Data Analysis 36Author: Vikram Andem ISRM & IT GRC Conference
Decision Tree Learning
Example:
Good vs. Evil
Decision tree learning uses
a decision tree as a
predictive model which
maps observations about
an item to conclusions
about the item's target
value. More descriptive
names for such tree models
are classification trees or
regression trees. In these
tree structures, leaves
represent class labels and
branches represent
conjunctions of features
that lead to those class
labels. In decision analysis,
a decision tree can be used
to visually and explicitly
represent decisions and
decision making. In data
mining, a decision tree
describes data but not
decisions; rather the
resulting classification tree
can be an input for decision
making. Source: Wikipedia
Big Data Analysis 37Author: Vikram Andem ISRM & IT GRC Conference
Tree Size vs. Accuracy
Accuracy, Confusion Matrix, Overfitting, Good/Bad Classifiers, and Controlling Tree Size
Building an Accurate Classifier
Good and
Bad
Classifiers
A use
case in
Airline
industry
Predicting Airline Customers Future Values
A good discussion of the topic is detailed in the paper listed below:
Applying decision trees for value-based customer relations
management: Predicting airline customers future values
by Giuliano Tirenni, Christian Kaiser and Andreas Herrmann
of the Center for Business Metrics at University of St. Gallen, Switzerland.
(URL: http://ipgo.webs.upv.es/azahar/Pr%C3%A1cticas/articulo2.pdf )
Theory
Overfitting example
Accuracy and Confusion Matrix
Big Data Analysis 38Author: Vikram Andem ISRM & IT GRC Conference
Entropy and Information Gain
Entropy
Question: How do you determine which
attribute best classifies data or a data set?
Answer: Entropy
Entropy is a measure of unpredictability
of information content.
Example : A poll on some political issue.
Usually, such polls happen because the
outcome of the poll isn't already known.
In other words, the outcome of the poll
is relatively unpredictable, and actually
performing the poll and learning the
results gives some new information;
these are just different ways of saying
that the entropy of the poll results is
large. Now, consider the case that the
same poll is performed a second time
shortly after the first poll. Since the
result of the first poll is already known,
the outcome of the second poll can be
predicted well and the results should not
contain much new information; in this case
the entropy of the second poll results is
small. Source: Wikipedia
Statistical quantity measuring how well an
attribute classifies the data.
 Calculate the information gain for each attribute.
 Choose attribute with greatest information gain.
If there are n equally probable possible messages,
then the probability p of each is 1/n
Information conveyed by a message is -log(p) = log(n)
Example, if there are 16 messages, then log(16) = 4
and we need 4 bits to identify/send each message.
In general, if we are given a probability distribution
P = (p1, p2, .., pn)
The information conveyed by distribution
(aka Entropy of P) is:
I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn))
Information Theory : Background
Information
Gain
Largest Entropy: Boolean functions with the same
number of ones and zero's have largest entropy.
In machine learning, this concept can be used to
define a preferred sequence of attributes to investigate to most rapidly
narrow down the state of X. Such a sequence (which depends on the
outcome of the investigation of previous attributes at each stage) is
called a decision tree. Usually an attribute with high mutual
information should be preferred to other attributes.
A use
case in
Airline
industry
An Airline matching Airplanes to Routes
(using Machine Learning)
((URL: http://machinelearning.wustl.edu/mlpapers/paper_files/jmlr10_helmbold09a.pdf )
A good discussion of the topic is detailed in the paper listed below:
Learning Permutations with Exponential Weights
by David P. Helmbold and Manfred K.Warmuth
of University of California, Santa Cruz
Big Data Analysis 39Author: Vikram Andem ISRM & IT GRC Conference
The Bootstrap
The Bootstrap
A good discussion of the topic is detailed in the article listed below:
“The Big Data Bootstrap”
by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar
and Michael I. Jordan of University of California, Berkeley
(URL: http://www.cs.berkeley.edu/~jordan/papers/blb_icml2012.pdf )
Bootstrapping is a method for assigning measures of
accuracy (defined in terms of bias, variance, confidence
intervals, prediction error or some other such measure) to
sample estimates. This technique allows estimation of the
sampling distribution of almost any statistic using only very
simple methods. Generally, it falls in the broader class of
resampling methods. The basic idea of bootstrapping is that
inference about a population from sample data (sample →
population) can be modeled by resampling the sample data
and performing inference on (resample → sample). As the
population is unknown, the true error in a sample statistic
against its population value is unknowable. In bootstrap-
resamples, the 'population' is in fact the sample, and this is
known; hence the quality of inference from resample data →
'true' sample is measurable. Source: Wikipedia
Concept
Big Data and the Bootstrap
Abstract from the paper listed on the lower left side: The bootstrap provides a simple and powerful means
of assessing the quality of estimators. However, in settings involving large datasets, the computation of
bootstrap-based quantities can be prohibitively demanding. As an alternative, the authors present the Bag
of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and
subsampling to obtain a robust,
computationally efficient means of
assessing estimator quality. BLB is
well suited to modern parallel and
distributed computing architectures
and retains the generic applicability,
statistical efficiency, and favorable
theoretical properties of the bootstrap.
The authors provide the results of an
extensive empirical and theoretical
investigation of BLB's behavior,
including a study of its statistical
correctness, its large-scale
implementation and performance, selection of hyper parameters, and performance on real data.
The authors claim their procedure for quantifying estimator quality is “accurate”, “automatic” and
“scalable and have tested on data sets of size exceeding 1 Terabyte.
A use
case in
Airline
industry
Modeling Demand and Supply for Domestic and International Air
Travel Economics for Cost Minimization and Profit Maximization
An in-depth and “excellent” scholarly detail of the application of bootstrapping for
modelling Domestic and International Air travel Economics (demand / supply) for an
Airline company is detailed in the Ph.D. thesis listed below:
(slightly old – published April 1999, but still very relevant to this age )
Essays on Domestic and International Airline
Economics with Some Bootstrap Applications
by Anthony Kenneth Postert
of Rice University
(URL: http://scholarship.rice.edu/bitstream/handle/1911/19428/9928581.PDF?sequence=1 )
Bootstrap and Big Data
Big Data Analysis 40Author: Vikram Andem ISRM & IT GRC Conference
Ensemble Learning, Bagging and Boosting
Ensemble Learning
The basis of this slide is from the original presentation titled
Bayesian Ensemble Learning for Big Data
by Rob McCulloch
of University of Chicago, Booth School of Business
Published, November 17, 2013
(URL: http://www.rob-mcculloch.org/some_papers_and_talks/talks/dsi-bart.pdf )
Ensemble methods use multiple learning algorithms to
obtain better predictive performance than could be
obtained from any of the constituent learning
algorithms. Unlike a statistical ensemble in statistical
mechanics, which is usually infinite, a machine learning
ensemble refers only to a concrete finite set of
alternative models, but typically allows for much more
flexible structure to exist between those alternatives.
Source : Wikipedia
Bagging
A use
case in
Airline
industry
Air Traffic Capacity impact during Adverse Weather conditions
A good discussion of the topic is detailed in the paper listed below:
An Translation of Ensemble Weather Forecasts
into Probabilistic Air Traffic Capacity Impact
by Matthias Steiner, Richard Bateman, Daniel Megenhardt,
Yubao Liu, Mei Xu, Matthew Pocernich,
of the National Center for Atmospheric Research,
and by Jimmy Krozel of Metron Aviation
(URL: http://nldr.library.ucar.edu/repository/assets/osgc/OSGC-000-000-000-687.pdf )
Bootstrap aggregating, also called Bagging, is a
machine learning ensemble meta-algorithm
designed to improve the stability and accuracy
of machine learning algorithms used in
statistical classification and regression. It also
reduces variance and helps to avoid overfitting.
Although it is usually applied to decision tree
methods, it can be used with any type of
method. Bagging is a special case of the model
averaging approach. Source : Wikipedia
Boosting
Boosting is a machine learning meta-
algorithm for reducing bias in supervised
learning. Boosting is based on the
question: Can a set of weak learners
create a single strong learner? A weak
learner is defined to be a classifier which
is only slightly correlated with the true
classification (it can label examples
better than random guessing). In
contrast, a strong learner is a classifier
that is arbitrarily well-correlated with the
true classification. Source : Wikipedia
Ensemble Learning and Big Data
Big Data Analysis 41Author: Vikram Andem ISRM & IT GRC Conference
Random Forests
Random Forests
Random forests are an
ensemble learning
method for classification
(and regression) that
operate by constructing
a multitude of decision
trees at training time
and outputting the
class that is the mode
of the classes output by
individual trees. Source: Wikipedia
A use
case in
Airline
industry
Network based model for Predicting Air Traffic Delays
The authors of the below paper propose a model using Random Forest (RF)
algorithms, considering both temporal and spatial (that is, network) delay states as
explanatory variables. In addition to local delay variables that describe the arrival or
departure delay states of the most influential airports and origin-destination (OD)
pairs in the network, the authors propose new network delay variables that depict
the global delay state of the entire NAS at the time of prediction.
A Network-Based Model for Predicting Air Traffic Delays
by Juan Jose Rebollo and Hamsa Balakrishnan
of Massachusetts Institute of Technology
(URL: http://www.mit.edu/~hamsa/pubs/RebolloBalakrishnanICRAT2012.pdf)
Random Forests in Big Data
Cloudera: In the below URL link Cloudera (a major Big
Data vendor), shows how to implement a Poisson
approximation to enable and train a random forest on an
enormous data set (with R an open source free statistical
software on Hadoop File System). The link also provides
Map and Reduce source code.
URL: https://blog.cloudera.com/blog/2013/02/how-to-resample-from-a-
large-data-set-in-parallel-with-r-on-hadoop/
Big Data Analysis 42Author: Vikram Andem ISRM & IT GRC Conference
k-nearest Neighbours
k-nearest Neighbours
k -nearest Neighbors algorithm (or k-NN for short) is a non-parametric
method used for classification and regression. In both cases, the input
consists of the k closest training examples in the feature space. The output
depends on whether k -NN is used for classification or regression:
 In k-NN classification, the output is a class membership. An object is classified by a
majority vote of its neighbors, with the object being assigned to the class most
common among its k nearest neighbors (k is a positive integer, typically small). If k
= 1, then the object is simply assigned to the class of that single nearest neighbor.
 In k-NN regression, the output is the property value for the object. This value is
the average of the values of its k nearest neighbors.
k-NN is a type of instance-based learning, or lazy learning, where
the function is only approximated locally and all computation is
deferred until classification. The k-NN algorithm is among the
simplest of all machine learning algorithms. Both for classification
and regression, it can be useful to weight contributions of
neighbors, so that the nearer neighbors contribute more to the
average than the more distant ones.
For example, a common weighting scheme consists in giving each
neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are
taken from a set of objects for which the class (for k -NN classification) or the object
property value (for k -NN regression) is known. This can be thought of as the training
set for the algorithm, though no explicit training step is required. A shortcoming of
the k -NN algorithm is that it is sensitive to the local structure of the data.
Example of k-NN classification. The test sample (green circle) should be
classified either to the first class of blue squares or to the second class of red triangles. If k = 3
(solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square
inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2
triangles inside the outer circle).
Data Reduction
Data reduction is one of the most important problems for work with huge data sets.
Usually, only some of the data points are needed for accurate classification. Those data
are called the prototypes and can be found as follows:
1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for
a given k)
2. Separate the rest of the data into two sets: (i) the prototypes that are used for the
classification decisions and (ii) the absorbed points that can be correctly classified by k-
NN using prototypes which can be removed from the training set.
K-Nearest Neighbours and Big Data
A good discussion of the how to execute kNN joins in a MapReduce cluster with
algorithms in MapReduce to perform efficient parallel kNN joins on large data is presented
in the paper below. The authors demonstrated ideas on using Hadoop with extensive
experiments in large real and synthetic datasets, with tens or hundreds of millions of
records ran in R up to 30 dimensions, with efficiency, effectiveness, and scalability.
Efficient Parallel kNN Joins for Large Data in MapReduce
by Chi Zhang of Florida State University and Jeffrey Jestes of University of Utah
(URL: http://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf )
Source: Wikipedia
Background: Nearest Neighbor Graph
Nearest neighbor graph (NNG) for a set of n objects P
in a metric space (e.g., for a set of points in the plane
with Euclidean distance) is a directed graph with P
being its vertex set and with a directed edge from
p to q whenever q is a nearest neighbor of p.
Example: The right side image shows a nearest
neighbor graph of 100 points in the Euclidean plane.
k-nearest neighbor graph (k-NNG) is a graph in which two vertices p and q
are connected by an edge, if the distance between p and q is among the k-th
smallest distances from p to other objects from P
Source: Wikipedia
Source: Wikipedia
Big Data Analysis 43Author: Vikram Andem ISRM & IT GRC Conference
k-nearest Neighbours (continued)
A use
case in
Airline
industry
Seating Arrangement and Inflight Purchase / Buying Behavior of Airline Customers
The below paper investigates and characterizes how social influence affects buying behavior of airline passengers who can purchase items through an
individual entertainment system located in front of them. The author used the seating configuration in the airplane as a basis for the analysis.
The authored used large data sets with a sample size for analysis of purchase behavior of about 257,000 passengers in nearly 2,000 fights for the analysis,
where the passengers performed 65,525 transactions, with an average of 33.3 transactions per flight. The author claims that he finds strong evidence of
social effects and states the number of average transactions per passenger increases 30% upon observation of a neighbor's purchase. Within and cross-
category effects the author analyzed and found that the passengers are likely to buy from the same category purchased by their neighbors. For example, a
purchase of an alcoholic beverage increases the probability of same category purchases by a neighbor in 78%. The author claims peer effects also take
place at a deeper level than product category. Passengers `imitate' their peers' decisions on the type of food, alcohol and even movie genre. The paper
also investigates the determinants of social influence: Author claims no support is found for informational learning as a significant mechanism in driving
social influence. The main determinant of social influence is found to be the number of neighbors observed purchasing an item. The results are consistent
with informational learning where consumers only learn from others' actions, but not from their inaction.
Peer Effects in Buying Behavior: Evidence from In-Flight Purchases
By Pedro M. Gardete, Assistant Professor of Marketing at Stanford University (published September 2013)
(URL: http://faculty-gsb.stanford.edu/gardete/documents/SocialEffects_8_2013.pdf )
Big Data Analysis 44Author: Vikram Andem ISRM & IT GRC Conference
Stochastic Gradient Descent
Gradient
Descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function
using gradient descent, one takes steps proportional to the negative of the gradient (or of the
approximate gradient) of the function at the current point. If instead one takes steps
proportional to the positive of the gradient, one approaches a local maximum of that function;
the procedure is then known as gradient ascent. Gradient descent is also known as steepest
descent, or the method of steepest descent. When known as the latter, gradient descent
should not be confused with the method of steepest descent for approximating integrals.
Source: Wikipedia
Stochastic Gradient Descent
Stochastic gradient descent is a gradient descent optimization
method for minimizing an objective function that is written as a sum
of differentiable functions.
Source: Wikipedia
Source:
Wikipedia
Big Data Analysis 45Author: Vikram Andem ISRM & IT GRC Conference
Stochastic Gradient Descent (continued)
Gradient
Descent
Example
Big Data Analysis 46Author: Vikram Andem ISRM & IT GRC Conference
Stochastic Gradient Descent (continued)
Stochastic Gradient Descent: Example
Source:
Wikipedia
Stochastic Gradient Descent vs.
Minibatch Gradient Descent
Stochastic Gradient Descent and Big Data
The below URL link provides a presentation on “Stochastic Optimization for Big Data Analytics”.
Stochastic Optimization for Big Data Analytics
by Tianbao Yang and Shenghuo Zhu of NEC Laboratories America
and Rong Jin of Michigan State University
(URL: http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf)
The authors in the below paper present stochastic gradient descent techniques for online learning
and ensemble methods to scale out to large amounts of data at Twitter with details on how to
integrate machine learning tools into Hadoop platform (using Pig a programming tool) .
Large-Scale Machine Learning at Twitter
by Jimmy Lin and Alek Kolcz of Twitter, Inc.
(URL: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf )
Big Data Analysis 47Author: Vikram Andem ISRM & IT GRC Conference
Stochastic Games and Markov Perfect Equilibrium
Importance of Markov Perfect Equilibrium in Airline industry Pricing
As an example of the use of this equilibrium concept we consider the competition between firms which had invested heavily into fixed costs and
are dominant producers in an industry, forming an oligopoly. The players are taken to be committed to levels of production capacity in the short
run, and the strategies describe their decisions in setting prices. Firms' objectives are modeled as maximizing present discounted value of profits.
Airfare Game / Airline Pricing Game: Often an airplane ticket for a certain route has the same price on either airline A or airline B. Presumably,
the two airlines do not have exactly the same costs, nor do they face the same demand function given their varying frequent-flyer programs, the
different connections their passengers will make, and so forth. Thus, a realistic general equilibrium model would be unlikely to result in nearly
identical prices. Both airlines have made sunk investments into the equipment, personnel, and legal framework. In the near term we may think of
them as committed to offering service. We therefore see that they are engaged, or trapped, in a strategic game with one another when setting
prices.
Equilibrium: Consider the following strategy of an airline for setting the ticket price for a certain route. At every price-setting opportunity:
 If the other airline is charging $300 or more, or is not selling tickets on that flight, charge $300
 If the other airline is charging between $200 and $300, charge the same price
 If the other airline is charging $200 or less, choose randomly between the following three options with equal probability:
matching that price, charging $300, or exiting the game by ceasing indefinitely to offer service on this route.
This is a Markov strategy because it does not depend on a history of past observations. It satisfies also the Markov reaction function definition
because it does not depend on other information which is irrelevant to revenues and profits. Assume now that both airlines follow this strategy
exactly. Assume further that passengers always choose the cheapest flight and so if the airlines charge different prices, the one charging the
higher price gets zero passengers. Then if each airline assumes that the other airline will follow this strategy, there is no higher-payoff alternative
strategy for itself, i.e. it is playing a best response to the other airline strategy. If both airlines followed this strategy, it would form a Nash
equilibrium in every proper subgame, thus a subgame-perfect Nash equilibrium. Source: Wikipedia
Stochastic (or Markov) Games
Stochastic (or Markov) game, is a dynamic game with
probabilistic transitions played by one or more players. The
game is played in a sequence of stages. At the beginning of
each stage the game is in some state. The players select
actions and each player receives a payoff that depends on
the current state and the chosen actions. The game then
moves to a new random state whose distribution depends on
previous state and actions chosen by players. The procedure
is repeated at the new state and play continues for a finite or
infinite number of stages. Total payoff to a player is often
taken to be discounted sum of stage payoffs or the limit
inferior of averages of stage payoffs. Source: Wikipedia
Markov Perfect Equilibrium
A Markov perfect equilibrium is an equilibrium concept in game theory. It is the refinement of the concept of sub
game perfect equilibrium to extensive form games for which a pay-off relevant state space can be readily
identified. In extensive form games, and specifically in stochastic games, a Markov perfect equilibrium is a set of
mixed strategies for each of the players which satisfy the following criteria:
 The strategies have the Markov property of memorylessness, meaning that each player's mixed strategy can
be conditioned only on the state of the game. These strategies are called Markov reaction functions.
 The state can only encode payoff-relevant information. This rules out strategies that depend on non-
substantive moves by the opponent. It excludes strategies that depend on signals, negotiation, or
cooperation between players (e.g. cheap talk or contracts).
 The strategies form a subgame perfect equilibrium of the game. Source: Wikipedia
Subgame Perfect Equilibrium
Subgame Perfect Equilibrium is a refinement of a Nash equilibrium used in dynamic games. A strategy profile
is a subgame perfect equilibrium if it represents a Nash equilibrium of every subgame of the original game.
Informally, this means that if (1) the players played any smaller game that consisted of only one part of the
larger game and (2) their behavior represents a Nash equilibrium of that smaller game, then their behavior is a
subgame perfect equilibrium of the larger game. Source: Wikipedia
Dynamic Airfare Pricing
and Competition
The below paper details Airline industry price
competition for an oligopoly in a dynamic
setting, where each of the sellers has a fixed
number of units available for sale over a fixed
number of periods. Demand is stochastic, and
depending on how it evolves, sellers may
change their prices at any time.
Dynamic Price Competition
with Fixed Capacities
by Kalyan Talluri & Victor Martinez de Albeniz
A use
case in
Airline
industry
Most of the work in this
paper was done prior to the
writing of the paper as part
of (both author's)
Ph.D. dissertation at
Massachusetts Institute of
Technology
(published February 2010)
(URL: www.econ.upf.edu/docs/papers/downloads/1205.pdf)
Big Data Analysis 48Author: Vikram Andem ISRM & IT GRC Conference
Stochastic Games and Markov Perfect Equilibrium (continued)
A use
case in
Airline
industry
Dynamic Revenue Management in Airline Alliances / Code Sharing
The below paper presents and excellent formulization of a Markov-game model of a two-partner airline alliance that can be used to analyze the effects of
these mechanisms on each partner’s behavior. The authors show that no Markovian transfer pricing mechanism can coordinate an arbitrary alliance. Next,
the authors derive the equilibrium acceptance policies under each scheme and use analytical techniques, as well a numerical analyses of sample
alliances, to generate fundamental insights about partner behavior under each scheme. The analysis and numerical examples also illustrate how certain
transfer price schemes are likely to perform in networks with particular characteristics.
Dynamic Revenue Management in Airline Alliances
by Robert Shumsky of Dartmouth College and Christopher Wright, Harry Groenevelt of University of Rochester (published February 2009)
(URL: http://www.researchgate.net/publication/220413135_Dynamic_Revenue_Management_in_Airline_Alliances/file/72e7e5215a1f91ed5b.pdf )
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references
Big data analysis concepts and references

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Edureka!
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 

Was ist angesagt? (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
big data
big databig data
big data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Big Data & Data Mining
Big Data & Data MiningBig Data & Data Mining
Big Data & Data Mining
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
Big Data vs Data Science vs Data Analytics | Demystifying The Difference | Ed...
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Bigdata
Bigdata Bigdata
Bigdata
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Bigdata
BigdataBigdata
Bigdata
 

Ähnlich wie Big data analysis concepts and references

Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviationranjit banshpal
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEijsptm
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 

Ähnlich wie Big data analysis concepts and references (20)

Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
BigData Analytics
BigData AnalyticsBigData Analytics
BigData Analytics
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Big Data
Big DataBig Data
Big Data
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 

Mehr von Information Security Awareness Group

Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...
Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...
Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...Information Security Awareness Group
 
Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf...
 Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf... Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf...
Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf...Information Security Awareness Group
 
Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...
Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...
Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...Information Security Awareness Group
 
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...Information Security Awareness Group
 
Authorization Policy in a PKI Environment Mary Thompson Srilekha Mudumbai A...
 Authorization Policy in a PKI Environment  Mary Thompson Srilekha Mudumbai A... Authorization Policy in a PKI Environment  Mary Thompson Srilekha Mudumbai A...
Authorization Policy in a PKI Environment Mary Thompson Srilekha Mudumbai A...Information Security Awareness Group
 
Introduction to distributed security concepts and public key infrastructure m...
Introduction to distributed security concepts and public key infrastructure m...Introduction to distributed security concepts and public key infrastructure m...
Introduction to distributed security concepts and public key infrastructure m...Information Security Awareness Group
 
OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...
OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...
OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...Information Security Awareness Group
 
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...Information Security Awareness Group
 

Mehr von Information Security Awareness Group (20)

Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...
Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...
Securing the Data in Big Data Security Analytics by Kevin Bowers, Nikos Trian...
 
Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf...
 Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf... Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf...
Mobile Device Security by Michael Gong, Jake Kreider, Chris Lugo, Kwame Osaf...
 
Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...
Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...
Mobile Devices – Using Without Losing Mark K. Mellis, Associate Information S...
 
IBM Security Strategy Intelligence,
IBM Security Strategy Intelligence,IBM Security Strategy Intelligence,
IBM Security Strategy Intelligence,
 
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
Addressing Big Data Security Challenges: The Right Tools for Smart Protection...
 
PKI by Tim Polk
PKI by Tim PolkPKI by Tim Polk
PKI by Tim Polk
 
Authorization Policy in a PKI Environment Mary Thompson Srilekha Mudumbai A...
 Authorization Policy in a PKI Environment  Mary Thompson Srilekha Mudumbai A... Authorization Policy in a PKI Environment  Mary Thompson Srilekha Mudumbai A...
Authorization Policy in a PKI Environment Mary Thompson Srilekha Mudumbai A...
 
Pki by Steve Lamb
Pki by Steve LambPki by Steve Lamb
Pki by Steve Lamb
 
PKI by Gene Itkis
PKI by Gene ItkisPKI by Gene Itkis
PKI by Gene Itkis
 
Introduction to distributed security concepts and public key infrastructure m...
Introduction to distributed security concepts and public key infrastructure m...Introduction to distributed security concepts and public key infrastructure m...
Introduction to distributed security concepts and public key infrastructure m...
 
OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...
OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...
OThe Open Science Grid: Concepts and Patterns Ruth Pordes, Mine Altunay, Bria...
 
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...
Optimal Security Response to Attacks on Open Science Grids Mine Altunay, Sven...
 
THE OPEN SCIENCE GRID Ruth Pordes
THE OPEN SCIENCE GRID Ruth PordesTHE OPEN SCIENCE GRID Ruth Pordes
THE OPEN SCIENCE GRID Ruth Pordes
 
Open Science Grid security-atlas-t2 Bob Cowles
Open Science Grid security-atlas-t2 Bob CowlesOpen Science Grid security-atlas-t2 Bob Cowles
Open Science Grid security-atlas-t2 Bob Cowles
 
Security Open Science Grid Doug Olson
Security Open Science Grid Doug OlsonSecurity Open Science Grid Doug Olson
Security Open Science Grid Doug Olson
 
Open Science Group Security Kevin Hill
Open Science Group Security Kevin HillOpen Science Group Security Kevin Hill
Open Science Group Security Kevin Hill
 
Xrootd proxies Andrew Hanushevsky
Xrootd proxies Andrew HanushevskyXrootd proxies Andrew Hanushevsky
Xrootd proxies Andrew Hanushevsky
 
Privilege Project Vikram Andem
Privilege Project Vikram AndemPrivilege Project Vikram Andem
Privilege Project Vikram Andem
 
DES Block Cipher Hao Qi
DES Block Cipher Hao QiDES Block Cipher Hao Qi
DES Block Cipher Hao Qi
 
Cache based side_channel_attacks Anestis Bechtsoudis
Cache based side_channel_attacks Anestis BechtsoudisCache based side_channel_attacks Anestis Bechtsoudis
Cache based side_channel_attacks Anestis Bechtsoudis
 

Kürzlich hochgeladen

The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 

Kürzlich hochgeladen (20)

The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 

Big data analysis concepts and references

  • 1. Big Data Analysis 1Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis  Concepts and References  Use Cases in Airline Industry
  • 2. Big Data Analysis 2Author: Vikram Andem ISRM & IT GRC Conference  The objective of this presentation is to provide awareness and familiarize a general business or management user with terms and terminology of Big Data Analysis and some references to use cases that can be (or are currently) applied in Airline industry.  The presentation is intended for an business or a management user to help with the thinking process on formulating an analytical question given business situation / problem for Big Data Analysis.  The presentation may also help provide an insight on basic terms and concepts that are a need to know, what to ask, how to evaluate and/or help solve a business problem for a potential Big Data Analysis use case and what to expect from the work of an competent Data Scientist when dealing with such use case for Big Data Analysis. NOTE: Just reviewing this presentation will most likely NOT make you competent enough to instantly perform Big Data Analysis. Big Data Analysis is a new (very recent) aspect of Data Science and requires some college or university level course work in (fields such as, but not limited to) mathematics, statistics, computer science , management science, econometrics, engineering etc. The presentation is divided into three parts following a separate presentation on Big Data Security & Governance, Risk Management & Compliance Part 1. Big Data : Introduction ( page # 3) Part 2. Very quick introduction to understanding Data and analysis of Data ( page # 8) (Beginner: if you are new to understanding data and use of data you should start here) Part 3. Big Data Analysis : Concepts and References to Use Cases in Airline Industry ( page # 17) (Advanced: if you understand data and how to use data, you may jump to this part).
  • 3. Big Data Analysis 3Author: Vikram Andem ISRM & IT GRC Conference Big Data: Introduction You may skip this section if you are familiar with Big Data and directly jump to Part 2 (page # 8) Part 1
  • 4. Big Data Analysis 4Author: Vikram Andem ISRM & IT GRC Conference Introduction Projected growth and use of Unstructured vs. Structured data¹ 2012 2013 2014 2015 2016 2017 2018 2019 2020 Unstructured Structured ¹ 2013 IEEE Bigdata conference (projected growth of data combined for all fortune 500 companies only) Limitations of existing Data Analytics Architecture BI Reports + Interactive Apps RDBMS (aggregated data) ETL Computer Grid Collection Instrumentation Storage Only Grid (original raw data) Mostly Append Limit #1 : Moving data to compute doesn’t scale. Limit #2 : Can’t explore high fidelity raw data Limit #3 : Archiving = Premature data death *Zettabyte’sofdata * 1 Zettabyte = 1000 Exabyte's = 1 Million Petabyte’s = 1 Billion Terabyte’s. Big Data a general term refers to the large voluminous amounts (at least terabytes) of poly-structured data that is gleaned from traditional and non-traditional sources and continuously flows through and around organizations, including but not limited-to e-mail, text, event logs, audio, video, blogs, social media and transactional records. What does this information hold? What is the challenge extracting it? It holds the promise of giving enterprises like United a deeper insight into their customers, partners, and business. This data can provide answers to questions they may not have even thought to ask. Companies like United can benefit from a multidimensional view of their business when they add insight from big data to the traditional types of information they collect and analyze. NumberofResults Demand <- More Generic More Specific -> Popularity Rank <- Small Tail Long Tail -> Traditional EDW Classical Statistics Big Data Specific Spikes Transactional Data (e.g., Reservations) Non-Transactional & Raw Data (e.g., Search's, Event logs) - + The challenge of extracting value from big data is similar in many ways to the age-old problem of distilling business intelligence from transactional data. At the heart of this challenge is the process used to extract data from multiple sources, transform it to fit your analytical needs, and load it into a data warehouse for subsequent analysis, a process known as “Extract, Transform & Load” (ETL). The nature of big data requires that the infrastructure for this process can scale cost-effectively. While the storage capacities of hard drives have increased massively over the years, access speeds — the rate at which data can be read from drives — have not kept up. One typical drive from year 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all data off the disk. This is a long time to read all data on a single drive — and writing is even slower. Data Storage and Analysis
  • 5. Big Data Analysis 5Author: Vikram Andem ISRM & IT GRC Conference Hadoop Apache Hadoop is a scalable fault- tolerant distributed system for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop can be used to store Exabyte's of unstructured and semi-structured data reliably on tens of thousands of general purpose servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Using Hadoop in this way, organization like United gains an additional ability to store and access data that they “might” need, data that may never be loaded into the data warehouse. The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS)  Schema must be created before any data can be loaded.  An explicit load operation has to take place which transforms data to DB internal structure.  New columns must be added explicitly before new data for such columns can be loaded into the database. Schema-on-Read (Hadoop)  Data is simply copied to the file store, no transformation is needed.  A SerDe (Serializer/ Deserlizer) is applied during read time to extract the required columns (late binding).  New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.  Read is Fast  Standards / Governance  Load is Fast  Flexibility / AgilityPros  Interactive OLAP Analytics (<1sec)  Multistep ACID Transactions  100% SQL Compliance  Structured or Not (Flexibility)  Scalability of Storage/Compute  Complex Data Processing Use When Hadoop Architecture Central to the scalability of Hadoop is the distributed processing framework known as MapReduce which splits the input data-set into multiple chunks, each of which is assigned a map task that can process the data in parallel. Each map task reads the input as a set of (key, value) pairs and produces a transformed set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group them into final results. MapReduce uses Job Tracker and Task Tracker mechanisms to schedule tasks, monitor them, & restart any that fail. Hadoop Distributed File System (HDFS) is designed for scalability and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes. ² Pictures source :Intel White Paper on Big Data Analytics
  • 6. Big Data Analysis 6Author: Vikram Andem ISRM & IT GRC Conference Use Case Description Data Storage Collect and store unstructured data in a fault-resilient scalable data store that can be organized and sorted for indexing and analysis. Analytics Ability to query in real time at the speed of thought on petabyte scale unstructured and semi structured data using Hbase and Hive. Batch Processing of Unstructured data Ability to batch-process (index, analyze etc.) tens to hundreds of petabytes of unstructured and semi-structured data. Data Archive Medium-term (12-36 months) archival of data from EDW/DBMS to meet data retention policies. Integration with EDW Extract, transfer and load data in and out of Hadoop into separate DBMS for advanced analytics. Search and predictive analytics Crawl, extract, index and transform structured and unstructured data for search and predictive analytics. Use Cases Southwest Airlines uses Hadoop based solution for its "Rapid Rewards loyalty program" for Customer Service. CommonPatternsofHadoopUse Big Data Exploration Pattern # 1: Hadoop as a Data Refinery Traditional Sources (RDBMS, OLTP, OLAP) DATASYSTEMSDATASOURCESAPPLICATIONS New Sources (Web logs, email, sensor data, social media) RDBMS EDW Traditional Repos Business Analytics Custom Applications Enterprise Applications 1 3 2 1 3 2 Collect data and apply a known algorithm into trusted operational process Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Push to existing data warehouse for use with existing analytic tools Pattern # 2: Data Exploration with Hadoop Traditional Sources (RDBMS, OLTP, OLAP) DATASYSTEMSDATASOURCESAPPLICATIONS New Sources (Web logs, email, sensor data, social media) RDBMS EDW Traditional Repos Business Analytics Custom Applications Enterprise Applications 1 3 2 1 3 2 Collect data and perform iterative investigation for value Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Explore and visualize with analytics tools supporting Hadoop Pattern # 3: Application Enrichment with Hadoop Traditional Sources (RDBMS, OLTP, OLAP) DATASYSTEMSDATASOURCESAPPLICATIONS New Sources (Web logs, email, sensor data, social media) RDBMS EDW Traditional Repos Business Analytics Custom Applications Enterprise Applications 1 3 2 1 3 2 Collect data, analyze and present salient results for online apps Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Incorporate data directly into applications Top 5 General Usages Obtaining a 360-degree view of Customers Operations Analytics Data Warehouse Augmentation Social Media How Airline Industry is using Hadoop ? ³ Capturing Sensor Data to Optimize Maintenance Top 5 Airline Usages Forecasting the Weather to Optimize Fuel Loads Identifying and Capturing the Demand Signal (Competitive offerings, Travel partner feeds) Loyalty and Promotions Webpage Visit’s, Log Storage  When is the best time of day/day of week/time of year to fly to minimize delays?  Do older planes suffer more delays?  How does the number of people flying between different locations change over time?  How well does weather predict plane delays?  Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system? American Airlines is utilizing Hadoop based solution for Clickstream, Customer, Kiosk and Data Analytics. British Airlines uses Hadoop based solution for Forecasting and Revenue Management. Orbitz Worldwide utilizes Hadoop based solution for Statistical Analysis to identify best possible promotions that combine air travel with hotel stay. Prospects: CTO Q’s? ³ source : Internet – Google search
  • 7. Big Data Analysis 7Author: Vikram Andem ISRM & IT GRC Conference Cost Benefit Analysis Gigabyte scale Petabyte scale Size of data CostperTerabyte Cost of current, structured data management technologies Economics of Data Cost of Hadoop Value captured by customers (like United) using Hadoop Why is Hadoop a value proposition? TCOD (Total Cost of Data) is the cost of owning (and using!) data over time for analytic purposes is a better metric than TCO (Total Cost of Ownership) for Cost Benefit Analysis of this case. TCOD estimates what a company like United will really spend to get to its business goal and the focus is on on total cost, not just the platform cost. In this case the TCOD comparison is made between EDW Platform/Appliance and Hadoop for the same amount of (raw or unstructured) data. Cost Comparison & Benefits are based on underlying “Data Management” Requirements⁴ 1.Hundreds of TB of data per week –500 TB data. 2.Raw data life: few hours to a few days. 3.Challenge: find the important events or trends. 4.Analyze the raw data once or a few times. 5.When analyzing, read entire files. 6.Keep only the significant data. Project A: Emphasis on “Data Refining” Requirements EDW Appliance Hadoop Total System Cost $23M $1.3M System and Data Admin $2M $2M Application Development $5M $6M Total Cost of Data $30M $9.3M Cost equation is favorable to Hadoop for data refining, data landing and archival requirements. 1.Data volume 500 TB to start – all must be retained for at least five years. 2.Continual growth of data and workload 3.Data sources: thousands 4. Data sources change their feeds frequently 5.Challenges: Data must be correct and data must be integrated 6.Typical enterprise data lifetime: decades 7.Analytic application lifetime: years 8.Millions of data users 9.Hundreds of analytic applications 10.Thousands of one time analyses 11.Tens of thousands of complex queries Project B: Emphasis on “EDW” Requirements EDW Appliance Hadoop Total System Cost $45M $5M System and Data Admin $50M $100M Application Development $40M $300M ETL $60M $100M Complex Queries $40M $80M Analysis $30M $70M Total Cost of Data $265M $655M Cost equation is favorable to EDW for Data Warehouse appliance requirements Cost Comparison Conclusions Each technology has large advantages in its sweet spot(s).  Neither platform is cost effective in the other’s sweet spot.  Biggest differences for the data warehouse are the development cost for “Complex Queries” & “Analytics”.  Total cost is extremely sensitive to technology choice.  Analytic architectures will require both Hadoop and data warehouse platforms.  Focus on total cost, not platform cost, in making your choice for a particular application or use.  Many analytic processes will use both Hadoop & EDW technology – so integration cost also counts! ⁴ source for TCOD comparison is “The Real Cost of Big Data Spreadsheet” provided by Winter Corp (www.wintercorp.com/tcod) Hadoop is ideal for data storage of • data which rarely needed; • data which can grow rapidly; • data which can grow very large; • data for which it is uncertain how it will be needed in the future; • data which may or may not have structure; • data which may require ETL and Analysis sometime in the future but just needs to be stored now …..for some unknown use. TCOD is the cost of owning (and using!) dataover timefor analytic purposes * ETL is extract, transform and load (preparing data for analytic use) Software Development/Maintenance Cost SYSTEM COST ADMIN COST ETL * APPS QUERIES ANALYTICS
  • 8. Big Data Analysis 8Author: Vikram Andem ISRM & IT GRC Conference Very quick introduction to understanding Data and analysis of Data Start here if you are new to understanding the data or do not know how to analyze data. Part 2
  • 9. Big Data Analysis 9Author: Vikram Andem ISRM & IT GRC Conference Introduction to Data  Targets business constraints.  Assesses and determines new ways to operate. What is the best action/outcome? What could happen? What happened?  Finds associations in data not readily apparent with customary analysis .  Forecasts future probabilities and trends.  Pattern recognition from samples for reporting of trends.  Formulates and analyzes historical data. Descriptive Analytics Predictive Analytics Prescriptive Analytics Benefits Very High High Medium Myth: I have large sets of data on Hadoop File System. Running powerful analytical tools (e.g., R, SAS, Tableau etc.) on Hadoop Infrastructure will perform all the data analysis work for me and provide/deliver useful information. Fact : The data by itself may not contain the answer, Big or Small - you need the right data. The combination of data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data . Performing data analysis with an understanding (and application of) data science principles, by correctly framing the analytical problem (with correct data sampling/collection methods) and with the use of appropriate analytical tools will most likely provide useful information for statistical/analytical inference. Data Analysis Benefits Data Facts  Data is just data.  Data does not live nor die.  Data does not offer truth nor does it lie.  Data is not large nor its small.  Data has always existed as both big data and small data. Data is the second most important thing  The most important thing in data science is the question.  The second most important is the data.  Often the data will limit or enable the questions.  But having data is useless if you don't have an appropriate question. Data are values of qualitative or quantitative variables, belonging to a set of items. Source: Wikipedia
  • 10. Big Data Analysis 10Author: Vikram Andem ISRM & IT GRC Conference Data Types : Basics Types of Data Variables Examples (from above table):  gender: categorical  sleep: numerical, continuous  bedtime: categorical, ordinal  countries: numerical, discrete  dread: categorical, ordinal - could also be used as numerical Data variables numerical categorical continuous discrete regular categorical ordinal Numerical (quantitative) variables take on numerical values. It is sensible to add, subtract, take averages, etc. with these values. Categorical (qualitative) variables take on a limited number of distinct categories. These categories can be identified with numbers, but it wouldn’t be sensible to do arithmetic operations with these values. Continuous numerical variables are measured, and can take on any numerical value. Discrete numerical variables are counted, and can take on only whole non-negative numbers. Categorical variables that have ordered levels are called ordinal. Think about a flight survey question where you are asked about how satisfied you are with the customer service you received, and the options are very unsatisfied, unsatisfied, neutral, satisfied or very satisfied. These levels will have inherent ordering and hence the variable will be called ordinal. If the levels of the categorical variable do not have an inherent ordering to them, then the variable is simply called (regular) categorical. (e.g., do you prefer morning flight or an evening flight?) Observations, Variables and Data Matrices Data are organized in what we call a data matrix, where each row represents an observation (or a case), and each column represents a variable. City no_flights %_ontime ……. region Chicago 350 90 …… Midwest Houston 330 96 …… South ……… ……… ……… …… ……… Newark 306 92 …… Northeast San Francisco 310 93 …… West data matrix observation (case) variable  First variable is City which is an identifier variable for the name of the city United serves to which the data are gathered.  Next is the no_flights (number of flights) served by united daily, and is a discrete numerical variable.  Next is the %_ontime (percentage on time) representing the united flights that operated on-time (arrival or departure) which represents a continuous numerical variable (as it can take on any value between zero and 100, even though the values shown here are rounded to whole numbers).  Last column is region representing where the city is located in USA as designated by US census (Northeast, Midwest, South, and West) and this a categorical variable.  When two variables show some connection with one another, they are called associated, or dependent variables.  The association can be further described as positive or negative  If two variables are not associated they are said to be independent. Relationships between Variables example
  • 11. Big Data Analysis 11Author: Vikram Andem ISRM & IT GRC Conference Data Observational Studies and Experiments Observational Study  In an observational study you collect data in a way that does not directly interfere with how the data arise, i.e. merely “observe”.  We can only establish an association (or correlation) between the explanatory and response variables.  If an observation study uses data from the past, it’s called a retrospective study, whereas if data are collected throughout the study, it’s called prospective. Experiment  In an experiment, you randomly assign subjects to various treatments and can therefore establish a causal connection between the explanatory and response variables. work out don't work out average energy level average energy level work out don't work out average energy level average energy level random assignment In observational study we sample two types of people from the population, those who choose to work out regularly and those who don’t, then we find the average energy level for the two groups of people and compare. On the other hand in an experiment, we sample a group of people from the population and then we randomly assign these people into two groups, those who will regularly workout throughout the course of the study and those who will not. The difference is that the decision of whether to work out or not is not left to the subjects as in the observational study but is instead imposed by the researcher. At the end we compare the average energy levels of the two groups. Based on the observational study, even if we find the difference between the energy levels of these two groups of people, we really can’t attribute this difference solely to working out, because there may be other variables that we didn’t control for in this study that contribute to the observed difference. For example people who are in better shape might be more likely to work out and also have high energy levels. However in the experiment such variables that likely to contribute to the outcome are equally represented in the two groups, due to random assignment. Therefore if we find a difference between the two averages, we can indeed make a causal statement attributing this difference to working out. Example: Suppose you want to evaluate the relationship between regularly working out and energy level. We can design the study as observational study or an experiment. What type of study is this, observational study or an experiment? “Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer than those who skipped the morning meal, according to a study that tracked nearly 2,400 girls for 10 years. [...] As part of the survey, the girls were asked once a year what they had eaten during the previous three days.” This is an observational study since the researchers merely observed the behavior of the girls (subjects) as opposed to imposing treatments on them. The study concludes there is an association between girls eating breakfast and being slimmer. #1 : Eating breakfast causes girls to be thinner. #2: Being thin causes girls to eat breakfast #3: A third variable is responsible for both. What could it be? An extraneous variable that affects both the explanatory and the response variable and that make it seem like there is a relationship between the two are called confounding variables. Images from: http://www.appforhealth.com/ wp-content/ uploads/ 2011/ 08/ ipn-cerealfrijo-300x135.jpg, http://www.dreamstime.com/ stock-photography-too-thin-woman-anorexia-model-image2814892. • What determines whether we can infer correlation or causation depends on the type of study that we are basing our decision on. • Observational studies for the most part only allow us to make correlation statements, while experiments infer us to causation. Correlation does not imply causation. Question Answer 3 Possible Explanations vs.
  • 12. Big Data Analysis 12Author: Vikram Andem ISRM & IT GRC Conference Data Sampling and Sources of Bias Data Sampling  Think about sampling something that you are cooking - you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole.  When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s called exploratory analysis for the sample at hand.  If you can generalize and conclude that your entire soup needs salt, that’s making an inference.  For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population).  If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not representative of the whole pot.  On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot. Exploratory Analysis Representative Sample Inference Sources of Sampling Bias  Convenience sample bias: Individuals who are easily accessible are more likely to be included in the sample. Example: say you want to find out how people in your city feel about recent increase in public transportation costs. If you only poll people in your neighborhood as opposed to a representative of a whole sample from the entire city your study will suffer from Convenience sample bias.  Voluntary Response bias: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the entire population. Example: say you place polling machines at all bus stops and metro stations in your city, but only those who suffered by the price increase choose to actually take the time to vote and express their opinion on the recent increase in public transportation fares. The people who respond to such sample do not make up the representative of the entire population.  Non-Response sampling bias: If only a (non-random) fraction of the randomly sampled people choose to respond to a survey, the sample is no longer a representative of the entire population. Example: say you take a random sample of individuals from your city and attempt to survey them but certain segments of the population; say those from the lower socio- economic status are less likely to respond to the survey then its not a representative of entire population. Sampling Bias a historical example: Landon vs. FDRIn 1936, Landon sought the Republican presidential nomination opposing the re-election of FDR. A popular magazine of the time (1936) “Literary Digest” polled about 10 million Americans, and got responses from about 2.4 million. To put things in perspective, nowadays reliable polls in USA poll about 1500 to 3000 people, so the “10 million” poll was very huge sample. The poll showed that Landon would likely be the overwhelming winner and FDR would get only 43% of the votes. Election result: FDR won, with 62% of the votes. What went wrong with the Literary Digest Poll? The magazine had surveyed  its own readers:  registered automobile owners,  and registered telephone users. These groups had incomes well above the national average of the day (remember, this is Great Depression era) which resulted in lists of voters far more likely to support Republicans than a truly typical voter of the time, i.e. the sample was not representative of the American population at the time. The Literary Digest election poll was based on a sample size of 2.4 million, which is huge, but since the sample was biased, the sample did not yield an accurate prediction. Back to the soup analogy: If the soup is not well stirred, it doesn’t matter how large a spoon you have, it will still not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.
  • 13. Big Data Analysis 13Author: Vikram Andem ISRM & IT GRC Conference Data Sampling Methods & Experimental Design Obtaining Good Samples  Almost all statistical methods are based on the notion of implied randomness.  If observational data are not collected in a random framework from a population, these statistical methods the estimates & errors associated with the estimates are not reliable.  Most commonly used random sampling techniques are simple, stratified, and cluster sampling. Simple Random Sample Randomly select cases from the population, where there is no implied connection between the points that are selected. Stratified Sample Strata are made up of similar observations. We take a simple random sample from each stratum. Cluster Sample Clusters are usually not made up of homogeneous observations, and we take a simple random sample from a random sample of clusters. Usually preferred for economical reasons. (data) Experimental Design Concepts • Control: Compare treatment of interest to a control group. • Randomize: Randomly assign subjects to treatments, and randomly sample from the population whenever possible. • Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study. • Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups. • Blocking example: We would like to design an experiment to investigate if energy gels makes you run faster: Treatment: energy gel ; Control: no energy gel • It is suspected that energy gels might affect pro and amateur athletes differently, therefore we block for pro status: • Divide the sample to pro and amateur. Randomly assign pro athletes to treatment and control groups. Randomly assign amateur athletes to treatment and control groups. • Pro/amateur status is equally represented in resulting treatment and control groups Random Assignment vs. Random Sampling
  • 14. Big Data Analysis 14Author: Vikram Andem ISRM & IT GRC Conference Hypothesis Testing Two competing claims Claim 1. “There is nothing going on.” Promotion and gender are independent, no gender discrimination, observed difference in proportions is simply due to chance. => Null hypothesis Claim 2. “There is something going on.” Promotion and gender are dependent, there is gender discrimination, observed difference in proportions is not due to chance. => Alternative hypothesis A court trial as a hypothesis test Hypothesis testing is very much like a court trial.  H0: Defendant is innocent  HA: Defendant is guilty  We then present the evidence: collect data.  Then we judge the evidence - “Could these data plausibly have happened by chance if the null hypothesis were true?”  If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt in our minds about the null hypothesis.  Ultimately we must make a decision. How unlikely is unlikely?  If the evidence is not strong enough to reject the assumption of innocence, the jury returns with a verdict of “not guilty”.  The jury does not say that the defendant is innocent, just that there is not enough evidence to convict.  The defendant may, in fact, be innocent, but the jury has no way of being sure.  Said statistically, we fail to reject the null hypothesis.  We never declare the null hypothesis to be true, because we simply do not know whether it’s true or not.  Therefore we never “accept the null hypothesis”.  In a trial, the burden of proof is on the prosecution.  In a hypothesis test, the burden of proof is on the unusual claim.  The null hypothesis is the ordinary state of affairs (the status quo), so it’s the alternative hypothesis that we consider unusual and for which we must gather evidence. Hypothesis Testing
  • 15. Big Data Analysis 15Author: Vikram Andem ISRM & IT GRC Conference Statistical Inference and Prediction Statistical Inference Statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. Confidence Interval Outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy. A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter. Confidence Level Confidence level is the probability value (1-alpha) associated with a confidence interval. It is often expressed as a percentage. For example, say alpha = 0.05 = 5%, then the confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level. Prediction In statistics, prediction is the process of determining the magnitude of statistical variates at some future point of time. For data analysis context the word may also occur in slightly different meanings; e.g. in a regression equation expressing a dependent variate y in terms of dependent x’s, the value given for y by specified values of x’s is called the “predicted” value even when no temporal element is involved. Prediction vs. Inference  Using data to predict an event that has yet to occur is statistical prediction. Inferring the value of a population quantity such as the average income of a country or the proportion of eligible voters who say they will vote ‘yes’ is statistical inference.  Prediction and inference answer different types of data analysis questions. Examples of predictions (because the events have not occurred at the time of writing this content):  The probability that Chicago Bulls will win the 2018 NBA playoffs is __.  The probability that Republican Party will win the 2020 Presidential election is __. Examples of inferences: (because the questions involve estimating a population value.)  The proportion of NBA fans that currently believe Chicago Bulls will win the 2018 playoffs is __.  The proportion of eligible voters that currently state they will vote for Republican Party in the 2020 Presidential election is __.
  • 16. Big Data Analysis 16Author: Vikram Andem ISRM & IT GRC Conference Before you proceed to Part 3 : Please quickly review the Appendix section to familiarize with terms and terminology that will be used in the rest of the presentation.
  • 17. Big Data Analysis 17Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis  Concepts and References  Use Cases in Airline Industry Jump here directly, if you are a advanced user who understands data and knows how to analyze data. Part 3
  • 18. Big Data Analysis 18Author: Vikram Andem ISRM & IT GRC Conference Big Data Analysis: Concepts and Airline Industry Use Cases Data Analysis Machine Learning & Models derive useful analytical information so humans can better understand it. Examples:  Does spending more money on marketing & sales in area “X” vs. area “Y” makes the company more profitable?  What does the customer want ? (e.g., customer survey). Models allow machines (software programs & applications) to make “real-time” (auto) decisions. Examples:  Google search / Amazon product recommendations, Facebook news feed, etc.  Geographic (GPS based) specific advertisements or event (holiday, weather, traffic) based promotions. Big Data application areas
  • 19. Big Data Analysis 19Author: Vikram Andem ISRM & IT GRC Conference Bayesian Approach and Bayes Rule Bayesian Approach Differences between Bayesians and Non-Bayesians Bayes Theorem The probability the email message is spam, given the words in the email message is expressed as below: Example: Email Spam Filtering With Bayes: • A key benefit: The ability to incorporate prior knowledge • A key weakness: The need to incorporate prior knowledge
  • 20. Big Data Analysis 21Author: Vikram Andem ISRM & IT GRC Conference Bayesian Belief Network Representation of Airline Passenger Behavior Source: Booz Allen Hamilton Bayesian Belief Network Representation of Airline Passenger BehaviorThe basis of this slide is from the paper titled Airline Analytics: Decision Analytics Center of Excellence by Cenk Tunasar, Ph.D., and Alex Cosmas of Booz Allen Hamilton In the above listed paper authors claim Booz Allen used the Big Data infrastructure of an airline client, and were able to analyze large datasets containing more than 3 years’ worth of passenger data of approximately 100 GB+. Booz Allen generated hypotheses to test from the Big Data set including , but not limited to: Airline Market Performance • What are the client’s natural market types and their distinct attributes? • What is the client’s competitive market health? • Where does the client capture fare premiums or fare discounts relative to other carriers? Passenger Behavior • What is the variability of booking curves by market type? • What are the intrinsic attributes of markets with the highest earn and highest burn rates? • Can predictive modeling be developed for reservation changes and no-show rates for individual passengers on individual itineraries? Consumer Choice • What is the demand impact of increasing connection time? • What is the effect of direct versus connecting itineraries on passenger preference? A use case in Airline industry (URL: http://www.boozallen.com/media/file/airline-analytics-brochure.pdf)
  • 21. Big Data Analysis 22Author: Vikram Andem ISRM & IT GRC Conference Bayesian Ideas are very important for Big Data Analysis Bayesian Themes Prediction Average over unknowns, don't maximize. Uncertainty Probability coherently represents uncertainty. Combine Information Hierarchical models combine information from multiple sources. Source: Steve Scott (Google Inc. Sparsity Sparsity plays an important role in modeling Big Data  Models are "big" because of a small number of factors with many levels.  Big data problems are often big collections of small data problems. Multi-armed Bandits Problem Multi-armed bandit problem is the problem a gambler faces at a row of slot machines, sometimes known as "one-armed bandits", when deciding which slot machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. Source: Wikipedia Bayes Rule applied to Machine Learning A use case in Airline industry Big Data Project at South West Airlines The below URL provides a visual (interactive graphics) presentation of the Big Data Project at South West Airlines and how they used Bayesian approach and Naive Bayes classification with WEKA("Waikato Environment for Knowledge Analysis") tool for analysis of the following questions: 1) What are the important factors that cause delays and their weightage ? 2) What kind of weather (e.g. sunny, cloudy, snow, rain, etc.) causes weather delays? 3) Are some of the time periods during the day (e.g. early morning, morning, noon, etc.) that are more prone to delays than others? (URL: http://prezi.com/f3bsv9m6yl2g/big-data-project_southwest-airlines/) Entirely driven by parameter uncertainty
  • 22. Big Data Analysis 23Author: Vikram Andem ISRM & IT GRC Conference Example: Bayesian based “Search Optimization” on Google File System (Source: Google Analytics) Source: Steve Scott (Google Inc.) Personalization as a “Big Logistic Regression" Search words: “Chicago to Houston today” Search words: “Chicago to Houston flight tomorrow” Search words: “Chicago to Houston cheapest”
  • 23. Big Data Analysis 24Author: Vikram Andem ISRM & IT GRC Conference Meta Analysis Meta Analysis Meta-analysis refers to methods that focus on contrasting and combining results from different studies, in the hope of identifying patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies. In its simplest form, meta-analysis is normally done by identification of a common measure of effect size. A weighted average of that common measure is the output of a meta- analysis. The weighting is related to sample sizes within the individual studies. More generally there are other differences between the studies that need to be allowed for, but the general aim of a meta- analysis is to more powerfully estimate the true effect size as opposed to a less precise effect size derived in a single study under a given single set of assumptions and conditions. Source: Wikipedia Advantages  Results can be generalized to a larger population,  The precision and accuracy of estimates can be improved as more data is used. This, in turn, may increase the statistical power to detect an effect.  Inconsistency of results across studies can be quantified and analyzed. For instance, does inconsistency arise from sampling error, or are study results (partially) influenced by between- study heterogeneity.  Hypothesis testing can be applied on summary estimates. A use case in Airline industry Price Elasticities of Demand for Passenger Air Travel A good discussion of the topic is detailed in the paper listed below: Price Elasticities of Demand for Passenger Air Travel: A Meta-Analysis by Martijn Brons, Eric Pels, Peter Nijkamp, Piet Rietveld of Tinbergen Institute (URL: http://papers.tinbergen.nl/01047.pdf) Meta Analysis and Big Data A good discussion of the topic is detailed in the article listed below: Meta-Analysis: The Original 'Big Data‘ by Blair T. Johnson , Professor at University of Connecticut (URL: http://meta-analysis.ning.com/profiles/blogs/meta-analysis-the-original-big-data)
  • 24. Big Data Analysis 25Author: Vikram Andem ISRM & IT GRC Conference Effect Size Effect Size Effect size is a measure of the strength of a phenomenon (for example, the change in an outcome after experimental intervention). An effect size calculated from data is a descriptive statistic that conveys the estimated magnitude of a relationship without making any statement about whether the apparent relationship in the data reflects a true relationship in the population. In that way, effect sizes complement inferential statistics such as p-values. Among other uses, effect size measures play an important role in meta-analysis studies that summarize findings from a specific area of research, and in statistical power analyses. Source: Wikipedia Example: A weight loss program may boast that it leads to an average weight loss of 30 pounds. In this case, 30 pounds is the claimed effect size. if the weight loss program results in an average loss of 30 pounds, it is possible that every participant loses exactly 30 pounds, or half the participants lose 60 pounds and half lose no weight at all. "Small", “Medium", “Large" Effect Sizes Effect sizes apply terms such as "small", "medium" and "large" to the size of the effect and are relative. Whether an effect size should be interpreted small, medium, or large depends on its substantive context and its operational definition. Cohen's conventional criteria small, medium, or big are near ubiquitous across many fields. Power analysis or sample size planning requires an assumed population parameter of effect sizes. For Cohen's an effect size of 0.2 to 0.3 might be a "small" effect, around 0.5 a "medium" effect and 0.8 to infinity, a "large" effect.
  • 25. Big Data Analysis 26Author: Vikram Andem ISRM & IT GRC Conference Monte Carlo Method Monte Carlo Method Monte Carlo methods (or experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results; typically one runs simulations many times over in order to obtain the distribution of an unknown probabilistic entity. The name comes from the resemblance of the technique to the act of playing and recording results in a real gambling casino. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to obtain a closed-form expression, or infeasible to apply a deterministic algorithm. Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical integration and generation of draws from a probability distribution. Monte Carlo methods vary, but tend to follow a particular pattern:  Define a domain of possible inputs.  Generate inputs randomly from a probability distribution over the domain.  Perform a deterministic computation on the inputs.  Aggregate the results. For example: Consider a circle inscribed in a unit square. Given that circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method:  Draw a square on ground, then inscribe a circle within it.  Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.  Count the number of objects inside the circle and the total number of objects.  The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π. Monte Carlo Methods for Bayesian Analysis and Big Data A good discussion of the topic is detailed in the paper listed below: A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets by David Madigan, Professor and Dean at Columbia University and Greg Ridgeway, Deputy Director at National Institute of Justice. (URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2753529/ ) Source: Wikipedia A use case in Airline industry Flight Delay-Cost (Initial delay – “type I” and Propagated delay “type II”) and Dynamic Simulation Analysis for Airline Schedule Optimization Flight Delay-Cost Simulation Analysis and Airline Schedule Optimization by Duojia Yuan of RMIT University, Victoria, Australia (URL: http://researchbank.rmit.edu.au/eserv/rmit:9807/Yuan.pdf General use case for Customer Satisfaction and Customer Loyalty Concurrent Reinforcement Learning from Customer Interactions Concurrent Reinforcement Learning from Customer Interactions by David Silver of University College London (published 2013) and Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall of Causata Ltd . (URL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/concurrent-rl.pdf ) A good discussion of the topic is detailed in the Ph.D. thesis listed below. The reliability modeling approach developed in this project (to enhance the dispatch reliability of Australian X airline fleet) is based on the probability distributions and Monte Carlo Simulation (MCS) techniques. Initial (type I) delay and propagated (type II) delay are adopted as the criterion for data classification and analysis. In the below paper, authors present a framework for concurrent reinforcement learning, a new method of a company interacting concurrently with many customers with an objective function to maximize revenue, customer satisfaction, or customer loyalty, which depends primarily on the sequence of interactions between company and customer (such as promotions, advertisements, or emails) and actions by the customer (such as point-of- sale purchases, or clicks on a website). The proposed concurrent reinforcement learning framework uses a variant of temporal- difference learning to learn efficiently from partial interaction sequences. The goal is to maximize the future rewards for each customer, given their history of interactions with the company. The proposed framework differs from traditional reinforcement learning paradigms, due to the concurrent nature of the customer interactions. This distinction leads to new considerations for reinforcement learning algorithms.
  • 26. Big Data Analysis 27Author: Vikram Andem ISRM & IT GRC Conference Bayes and Big Data: Consensus Monte Carlo and Nonparametric Bayesian Data Analysis A good discussion of the topic is detailed in the article listed below: “Bayes and Big Data: The Consensus Monte Carlo Algorithm” by  Robert E. McCulloch, of University of Chicago, Booth School of Business  Edward I. George, of University of Pennsylvania, The Wharton School  Steven L. Scott, of Google, Inc  Alexander W. Blocker, of Google, Inc  Fernando V. Bonassi, Google, Inc. (URL: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf) Consensus Monte Carlo For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Need: A useful definition of “big data” is data that is too big to fit on a single machine, either because of processor, memory, or disk bottlenecks. Graphics Processing Units (GPU) can alleviate the processor bottleneck, but memory or disk bottlenecks can only be alleviated by splitting “big data” across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging the individual Monte Carlo draws. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Source: Steve Scott (Google Inc.) Non-Parametric Bayesian Data Analysis A use case in Airline industry Airline Delays in International Air Cargo Logistics A good discussion of the topic is detailed in the paper below: “Nonparametric Bayesian Analysis in International Air Cargo Logistics” by Yan Shang of Fuqua School of Business, Duke University (URL: https://bayesian.org/abstracts/5687 ) Non-Parametric Analysis refers to comparative properties (statistics) of the data, or population, which do not include the typical parameters, of mean, variance, standard deviation, etc. Need / Motivation: Models are never correct for real world data. Non-Parametric Modelling of Large Data Sets What is a nonparametric model?  A parametric model where the number of parameters increases with data.  A really large parametric model.  A model over infinite dimensional function or measure spaces.  A family of distributions that is dense in some large space. Why nonparametric models in Bayesian theory of learning?  Broad class of priors that allows data to “speak for itself”.  Side-step model selection and averaging. Bayes and Big Data
  • 27. Big Data Analysis 28Author: Vikram Andem ISRM & IT GRC Conference Homoscedasticity vs. Heteroskedasticity Homoscedasticity In regression analysis , homoscedasticity means a situation in which the variance of the dependent variable is the same for all the data. Homoscedasticity facilitates analysis because most methods are based on the assumption of equal variance. A sequence or a vector of random variables is homoscedastic if all random variables in the sequence or vector have the same finite variance. This is also known as homogeneity of variance. In regression analysis , heteroskedasticity means a situation in which the variance of the dependent variable varies across the data. Heteroskedasticity complicates analysis because many methods in regression analysis are based on an assumption of equal variance. A collection of random variables is heteroscedastic if there are sub- populations that have different variabilities from others. Here "variability" could be quantified by the variance or any other measure of statistical dispersion. Thus heteroscedasticity is the absence of homoscedasticity. Heteroskedasticity
  • 28. Big Data Analysis 29Author: Vikram Andem ISRM & IT GRC Conference Benford’s Law Benford’s Law Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution. This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude. Source: Wikipedia Numerically, the leading digits have the following distribution in Benford's Law, where d is the leading digit and P(d) the probability: Benford’s Law Big Data Application: Fraud Detection Facts The graph below shows Benford's Law for base 10. There is a generalization of the law to numbers expressed in other bases (for example, base 16), and also a generalization from leading 1 digit to leading n digits. A set of numbers is said to satisfy Benford's Law if the leading digit d (d ∈ {1, ..., 9}) occurs with Probability. Benford’s Law holds true for a data set that grows exponentially (e.g., doubles, then doubles again in the same time span). It is best applied to data sets that go across multiple orders of magnitude . The theory does not hold true for data sets in which digits are predisposed to begin with a limited set of digits. The theory also does not hold true when a data set covers only one or two orders of magnitude.  Helps identify duplicates & other data pattern anomalies in large data sets.  Enables auditors and data analysts to focuses on possible anomalies in very large data sets.  It does not "directly" prove that error or fraud exist, but identifies items that deserve further study on statistical grounds.  Mainly used for setting future auditing plans and is a low cost entry for continuous analysis of very large data sets  Not good for sampling – results in very large selection sizes.  As technology matures, finding fraud will increase (not decrease).  Not all data sets are suitable for analysis . A use case in Airline industry An financial/accounting auditor can evaluate very large data sets (in a continuous monitoring or continuous audit environment) that represents a continuous stream of transactions , such as the sales made by an (third party) online retailer or the internal airline reservation system. Fraud Detection in Airline Ticket Purchases Christopher J. Rosetti, CPA, CFE, DABFA of KPMG states in his presentation titled "SAS 99: Detecting Fraud Using Benford’s Law" presented at the FAE/NYSSCPA, Technology Assurance Committee , on March 13, 2003 claims that United Airlines currently uses Benford's law for fraud detection! (URL: http://www.nysscpa.org/committees/emergingtech/law.ppt )
  • 29. Big Data Analysis 30Author: Vikram Andem ISRM & IT GRC Conference Multiple Hypothesis Testing Multiple Testing Problem Multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly reject the null hypothesis are more likely to occur when one considers the set as a whole. Source: Wikipedia For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the probability that a fair coin would come up heads at least 9 out of 10 times is (10 + 1) × (1/2)10 = 0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected — i.e., the coin is unfair. A multiple-comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one were to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e., pre-selected) coin comes up heads 9 or 10 times would still be very unlikely, but seeing any coin behave that way, without concern for which one, would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would be more likely to falsely identify at least one fair coin as unfair. Multiple Hypothesis Testing A use case in Airline industry Predicting Flight Delays using Multiple Hypothesis Testing A good discussion of the topic is detailed in the paper listed below: Predicting Flight Delays by Dieterich Lawson and William Castillo of Stanford University (URL: http://cs229.stanford.edu/proj2012/CastilloLawson-PredictingFlightDelays.pdf ) Also detailed in the book “Big Data for Chimps: A Seriously Fun guide to Terabyte-scale data processing“ by the same author (Dieterich Lawson) and Philip Kromer. Sample Source Code for modelling in Matlab is also provided by the Dieterich Lawson and can be found at URL: https://github.com/infochimps-labs/big_data_for_chimps
  • 30. Big Data Analysis 31Author: Vikram Andem ISRM & IT GRC Conference The German Tank Problem The German Tank Problem The problem of estimating the maximum of a discrete uniform distribution from sampling without replacement is known in English as the German tank problem, due to its application in World War II to the estimation of the number of German tanks. The analyses illustrate the difference between frequentist inference and Bayesian inference. Estimating the population maximum based on a single sample yields divergent results, while the estimation based on multiple samples is an instructive practical estimation question whose answer is simple but not obvious. Source: Wikipedia During World War II, production of German tanks such as the Panther (below photo) was accurately estimated by Allied intelligence using statistical methods. Example: Suppose an intelligence officer has spotted k = 4 tanks with serial numbers, 2, 6, 7, and 14, with maximum observed serial number, m = 14. The unknown total number of tanks is called N. The formula for estimating the total number of tanks suggested by the frequentist approach outlined is: Whereas, the Bayesian analysis below yield (primarily) a probability mass function for the number of tanks: from which we can estimate the number of tanks according to: This distribution has positive skewness, related to the fact that there are at least 14 tanks. During the course of the war the Western Allies made sustained efforts to determine the extent of German production, and approached this in two major ways: conventional intelligence gathering and statistical estimation. To do this they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences. Chassis and engine numbers were also used, though their use was more complicated. Various other components were used to cross- check the analysis. Similar analyses were done on tires, which were observed to be sequentially numbered (i.e., 1, 2, 3, ..., N). The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use. Analysis of wheels from two tanks (48 wheels each, 96 wheels total) yielded an estimate of 270 produced in February 1944, substantially more than had previously been suspected. German records after the war showed production for the month of February 1944 was 276. The statistical approach proved to be far more accurate than conventional intelligence methods, and the phrase German tank problem became accepted as a descriptor for this type of statistical analysis. Application in Big Data Analysis  Similar to German Tank Problem we can estimate/analyze (large or small) data sets that we don’t have (or assumed that we don’t have).  There is “leaky” data all around us; all we have to do is to think outside the box. Companies very often don’t think about the data they publish publicly and we can either extrapolate from that data (as in the German Tank problem) or simply extract useful information from it.  A company’s competitors' websites (publicly available data) can be a valuable hunting ground. Think about whether you can use it to estimate some missing data (as with the serial numbers) and/or combine that data with other, seemingly innocuous, sets to produce some vital information. If that information gives your company a commercial advantage and is legal, then you should use it as part of your analysis. Source: Wikipedia
  • 31. Big Data Analysis 32Author: Vikram Andem ISRM & IT GRC Conference Nyquist–Shannon Sampling Theorem Nyquist–Shannon Sampling Theorem The Nyquist Theorem, also known as the sampling theorem, is a principle that engineers follow in the digitization of analog signals. For analog-to-digital conversion (ADC) to result in a faithful reproduction of the signal, slices, called samples, of the analog waveform must be taken frequently. The number of samples per second is called the sampling rate or sampling frequency. Any analog signal consists of components at various frequencies. The simplest case is the sine wave, in which all the signal energy is concentrated at one frequency. In practice, analog signals usually have complex waveforms, with components at many frequencies. The highest frequency component in an analog signal determines the bandwidth of that signal. The higher the frequency, the greater the bandwidth, if all other factors are held constant. Suppose the highest frequency component, in hertz, for a given analog signal is fmax. According to the Nyquist Theorem, the sampling rate must be at least 2fmax, or twice the highest analog frequency component. The sampling in an analog-to-digital converter is actuated by a pulse generator (clock). If the sampling rate is less than 2fmax, some of the highest frequency components in the analog input signal will not be correctly represented in the digitized output. When such a digital signal is converted back to analog form by a digital-to-analog converter, false frequency components appear that were not in the original analog signal. This undesirable condition is a form of distortion called aliasing. Application in Big Data Analysis  Even though the “Nyquist–Shannon Sampling Theorem” is about the minimum sampling rate of a continuous wave, but with Big Data Analysis practice it will tell you how frequently you need to collect that Big Data from sensors like smart meters.  The frequency of data collection for Big Data is the “Velocity”, one of the three “V”s for terms that define Big Data; Volume, Velocity and Varity. Left figure: X(f) (top blue) and XA(f) (bottom blue) are continuous Fourier transforms of two different functions, x(t) and xA(t) (not shown). When the functions are sampled at rate fs, the images (green) are added to the original transforms (blue) when one examines the discrete-time Fourier transforms (DTFT) of the sequences. In this hypothetical example, the DTFTs are identical, which means the sampled sequences are identical, even though the original continuous pre- sampled functions are not. If these were audio signals, x(t) and xA(t) might not sound the same. But their samples (taken at rate fs) are identical and would lead to identical reproduced sounds; thus xA(t) is an alias of x(t) at this sample rate. In this example (of a bandlimited function), such aliasing can be prevented by increasing fs such that the green images in the top figure do not overlap the blue portion. Right figure: Spectrum, Xs(f), of a properly sampled bandlimited signal (blue) and the adjacent DTFT images (green) that do not overlap. A brick-wall low-pass filter, H(f), removes the images, leaves the original spectrum, X(f), and recovers the original signal from its samples. Source: Wikipedia Source: Wikipedia
  • 32. Big Data Analysis 33Author: Vikram Andem ISRM & IT GRC Conference Simpson’s Paradox Simpson’s Paradox Simpson's paradox is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is particularly confounding when frequency data are unduly given causal interpretations. Simpson's Paradox disappears when causal relations are brought into consideration. Example:  It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. Simpson' paradox, however, slams a hammer down on the rule and the result is a good deal worse than a sore thumb. Simpson's paradox demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set may be the exact opposite of conclusion from the smaller sets. Unfortunately, the conclusions from the large set can (also) be wrong.  The lurking variables (or confounding variable) in Simpson’s paradox are categorical. That is, they break the observation into groups, such as the city of origin for the airline flights. Simpson’s paradox is an extreme form of the fact that the observed associations can be misleading when there are lurking variables. Status Airline A Airline B On Time 718 5534 Delayed 74 532 Total 792 6066 From the left table: Airline A is delayed 9.3% (74/792) of the time; Airline B is delayed only 8.8% (532/6066) of the time. So Airline A would NOT be preferable. Chicago Houston Airline On Time Delayed Total On Time Delayed Total A 497 62 559 221 12 233 B 694 117 811 4840 415 5255 From the above table: From Chicago, Airline A is delayed 11.1% (62/559) of the time, but Airline B is delayed 14.4% (117/811) of the time. From Houston, Airline A is delayed 5.2% (12/233) of the time, but Airline B is delayed 7.9% (415/5255). Consequently, Airline A would be preferable. This conclusion contradicts the previous conclusion. Simpsons' Paradox is when Big Data sets CAN go wrong A use case in Airline industry Airline On-Time Performance at Hub-and-Spoke Flight Networks A good discussion of the topic is detailed in the paper listed below: Simpson’s Paradox, Aggregation, and Airline On-time Performance by Bruce Brown of Cal State Polytechnic University (URL: http://www.csupomona.edu/~bbrown/Brown_SimpPar_WEAI06.pdf)  Big Data doesn’t happen overnight and there’s no magic to it.  Just deploying Big Data tools and analytical solutions (R, SAS, and Tableau etc.) doesn’t guarantee anything, as Simpson’s Paradox proves.
  • 33. Big Data Analysis 34Author: Vikram Andem ISRM & IT GRC Conference Machine Learning Machine Learning and Data Mining Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances. Source: Wikipedia These two terms are commonly confused, as they often employ the same methods and overlap significantly. They can be roughly defined as follows:  Machine learning focuses on prediction, based on known properties learned from the training data.  Data mining focuses on the discovery of (previously) unknown properties in the data. This is the analysis step of Knowledge Discovery in Databases. Terminology  Classification: The learned attribute is categorical (“nominal”)  Regression: The learned attribute is numeric  Supervised Learning (“Training”) : We are given examples of inputs and associated outputs and we learn the relationship between them.  Unsupervised Learning (sometimes: “Mining”): We are given inputs, but no outputs (such as unlabeled data) and we learn the “Latent” labels. (example: Clustering). Example: Document Classification • Highly accurate predictions on real time and continuous data (based on rule sets with earlier training / learning and training / historical data). • Goal is not to uncover underlying “truth”. • Emphasis on methods that can handle very large datasets for better predictions. A use case in Airline industry South West Airlines use of Machine Learning for Airline Safety The below URL details an article (published September 2013) on how South West Airlines uses Machine Learning algorithms for Big Data purposes to analyze vast amounts of very large data sets (which are publicly accessible from NASA’s DASHlink site) to find anomalies and potential safety issues and to identify patterns to improve airline safety. URL: http://www.bigdata-startups.com/BigData-startup/southwest- airlines-uses-big-data-deliver-excellent-customer-service/ Primary Goal of Machine Learning Why Machine Learning? Increase barrier to entry when product / service quality is dependent on data Customize product / service to increase engagement and profits. Example: Customize sales page to increase conversion rates for online products. vs. Use Case1 Use Case 2
  • 34. Big Data Analysis 35Author: Vikram Andem ISRM & IT GRC Conference Classification Rules and Rule Sets Rule Set to Classify Data Golf Example: To Play or Not to Play A use case in Airline industry Optimal Airline Ticket Purchasing (automated feature selection) A good discussion of the topic is detailed in the paper listed below: Optimal Airline Ticket Purchasing Using Automated User-Guided Feature Selection by William Groves and Maria Gini of University of Minnesota (URL: http://ijcai.org/papers13/Papers/IJCAI13-032.pdf ) Classification Problems Examples of Classification Problems: • Text categorization (e.g., spam filtering) • Market segmentation (e.g.: predict if customer will respond to promotion). • Natural-language processing (e.g., spoken language understanding).
  • 35. Big Data Analysis 36Author: Vikram Andem ISRM & IT GRC Conference Decision Tree Learning Example: Good vs. Evil Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data but not decisions; rather the resulting classification tree can be an input for decision making. Source: Wikipedia
  • 36. Big Data Analysis 37Author: Vikram Andem ISRM & IT GRC Conference Tree Size vs. Accuracy Accuracy, Confusion Matrix, Overfitting, Good/Bad Classifiers, and Controlling Tree Size Building an Accurate Classifier Good and Bad Classifiers A use case in Airline industry Predicting Airline Customers Future Values A good discussion of the topic is detailed in the paper listed below: Applying decision trees for value-based customer relations management: Predicting airline customers future values by Giuliano Tirenni, Christian Kaiser and Andreas Herrmann of the Center for Business Metrics at University of St. Gallen, Switzerland. (URL: http://ipgo.webs.upv.es/azahar/Pr%C3%A1cticas/articulo2.pdf ) Theory Overfitting example Accuracy and Confusion Matrix
  • 37. Big Data Analysis 38Author: Vikram Andem ISRM & IT GRC Conference Entropy and Information Gain Entropy Question: How do you determine which attribute best classifies data or a data set? Answer: Entropy Entropy is a measure of unpredictability of information content. Example : A poll on some political issue. Usually, such polls happen because the outcome of the poll isn't already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the entropy of the second poll results is small. Source: Wikipedia Statistical quantity measuring how well an attribute classifies the data.  Calculate the information gain for each attribute.  Choose attribute with greatest information gain. If there are n equally probable possible messages, then the probability p of each is 1/n Information conveyed by a message is -log(p) = log(n) Example, if there are 16 messages, then log(16) = 4 and we need 4 bits to identify/send each message. In general, if we are given a probability distribution P = (p1, p2, .., pn) The information conveyed by distribution (aka Entropy of P) is: I(P) = -(p1*log(p1) + p2*log(p2) + .. + pn*log(pn)) Information Theory : Background Information Gain Largest Entropy: Boolean functions with the same number of ones and zero's have largest entropy. In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high mutual information should be preferred to other attributes. A use case in Airline industry An Airline matching Airplanes to Routes (using Machine Learning) ((URL: http://machinelearning.wustl.edu/mlpapers/paper_files/jmlr10_helmbold09a.pdf ) A good discussion of the topic is detailed in the paper listed below: Learning Permutations with Exponential Weights by David P. Helmbold and Manfred K.Warmuth of University of California, Santa Cruz
  • 38. Big Data Analysis 39Author: Vikram Andem ISRM & IT GRC Conference The Bootstrap The Bootstrap A good discussion of the topic is detailed in the article listed below: “The Big Data Bootstrap” by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan of University of California, Berkeley (URL: http://www.cs.berkeley.edu/~jordan/papers/blb_icml2012.pdf ) Bootstrapping is a method for assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using only very simple methods. Generally, it falls in the broader class of resampling methods. The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and performing inference on (resample → sample). As the population is unknown, the true error in a sample statistic against its population value is unknowable. In bootstrap- resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference from resample data → 'true' sample is measurable. Source: Wikipedia Concept Big Data and the Bootstrap Abstract from the paper listed on the lower left side: The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively demanding. As an alternative, the authors present the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality. BLB is well suited to modern parallel and distributed computing architectures and retains the generic applicability, statistical efficiency, and favorable theoretical properties of the bootstrap. The authors provide the results of an extensive empirical and theoretical investigation of BLB's behavior, including a study of its statistical correctness, its large-scale implementation and performance, selection of hyper parameters, and performance on real data. The authors claim their procedure for quantifying estimator quality is “accurate”, “automatic” and “scalable and have tested on data sets of size exceeding 1 Terabyte. A use case in Airline industry Modeling Demand and Supply for Domestic and International Air Travel Economics for Cost Minimization and Profit Maximization An in-depth and “excellent” scholarly detail of the application of bootstrapping for modelling Domestic and International Air travel Economics (demand / supply) for an Airline company is detailed in the Ph.D. thesis listed below: (slightly old – published April 1999, but still very relevant to this age ) Essays on Domestic and International Airline Economics with Some Bootstrap Applications by Anthony Kenneth Postert of Rice University (URL: http://scholarship.rice.edu/bitstream/handle/1911/19428/9928581.PDF?sequence=1 ) Bootstrap and Big Data
  • 39. Big Data Analysis 40Author: Vikram Andem ISRM & IT GRC Conference Ensemble Learning, Bagging and Boosting Ensemble Learning The basis of this slide is from the original presentation titled Bayesian Ensemble Learning for Big Data by Rob McCulloch of University of Chicago, Booth School of Business Published, November 17, 2013 (URL: http://www.rob-mcculloch.org/some_papers_and_talks/talks/dsi-bart.pdf ) Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist between those alternatives. Source : Wikipedia Bagging A use case in Airline industry Air Traffic Capacity impact during Adverse Weather conditions A good discussion of the topic is detailed in the paper listed below: An Translation of Ensemble Weather Forecasts into Probabilistic Air Traffic Capacity Impact by Matthias Steiner, Richard Bateman, Daniel Megenhardt, Yubao Liu, Mei Xu, Matthew Pocernich, of the National Center for Atmospheric Research, and by Jimmy Krozel of Metron Aviation (URL: http://nldr.library.ucar.edu/repository/assets/osgc/OSGC-000-000-000-687.pdf ) Bootstrap aggregating, also called Bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Source : Wikipedia Boosting Boosting is a machine learning meta- algorithm for reducing bias in supervised learning. Boosting is based on the question: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Source : Wikipedia Ensemble Learning and Big Data
  • 40. Big Data Analysis 41Author: Vikram Andem ISRM & IT GRC Conference Random Forests Random Forests Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Source: Wikipedia A use case in Airline industry Network based model for Predicting Air Traffic Delays The authors of the below paper propose a model using Random Forest (RF) algorithms, considering both temporal and spatial (that is, network) delay states as explanatory variables. In addition to local delay variables that describe the arrival or departure delay states of the most influential airports and origin-destination (OD) pairs in the network, the authors propose new network delay variables that depict the global delay state of the entire NAS at the time of prediction. A Network-Based Model for Predicting Air Traffic Delays by Juan Jose Rebollo and Hamsa Balakrishnan of Massachusetts Institute of Technology (URL: http://www.mit.edu/~hamsa/pubs/RebolloBalakrishnanICRAT2012.pdf) Random Forests in Big Data Cloudera: In the below URL link Cloudera (a major Big Data vendor), shows how to implement a Poisson approximation to enable and train a random forest on an enormous data set (with R an open source free statistical software on Hadoop File System). The link also provides Map and Reduce source code. URL: https://blog.cloudera.com/blog/2013/02/how-to-resample-from-a- large-data-set-in-parallel-with-r-on-hadoop/
  • 41. Big Data Analysis 42Author: Vikram Andem ISRM & IT GRC Conference k-nearest Neighbours k-nearest Neighbours k -nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k -NN is used for classification or regression:  In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.  In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. Both for classification and regression, it can be useful to weight contributions of neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class (for k -NN classification) or the object property value (for k -NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. A shortcoming of the k -NN algorithm is that it is sensitive to the local structure of the data. Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 (solid line circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). Data Reduction Data reduction is one of the most important problems for work with huge data sets. Usually, only some of the data points are needed for accurate classification. Those data are called the prototypes and can be found as follows: 1. Select the class-outliers, that is, training data that are classified incorrectly by k-NN (for a given k) 2. Separate the rest of the data into two sets: (i) the prototypes that are used for the classification decisions and (ii) the absorbed points that can be correctly classified by k- NN using prototypes which can be removed from the training set. K-Nearest Neighbours and Big Data A good discussion of the how to execute kNN joins in a MapReduce cluster with algorithms in MapReduce to perform efficient parallel kNN joins on large data is presented in the paper below. The authors demonstrated ideas on using Hadoop with extensive experiments in large real and synthetic datasets, with tens or hundreds of millions of records ran in R up to 30 dimensions, with efficiency, effectiveness, and scalability. Efficient Parallel kNN Joins for Large Data in MapReduce by Chi Zhang of Florida State University and Jeffrey Jestes of University of Utah (URL: http://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf ) Source: Wikipedia Background: Nearest Neighbor Graph Nearest neighbor graph (NNG) for a set of n objects P in a metric space (e.g., for a set of points in the plane with Euclidean distance) is a directed graph with P being its vertex set and with a directed edge from p to q whenever q is a nearest neighbor of p. Example: The right side image shows a nearest neighbor graph of 100 points in the Euclidean plane. k-nearest neighbor graph (k-NNG) is a graph in which two vertices p and q are connected by an edge, if the distance between p and q is among the k-th smallest distances from p to other objects from P Source: Wikipedia Source: Wikipedia
  • 42. Big Data Analysis 43Author: Vikram Andem ISRM & IT GRC Conference k-nearest Neighbours (continued) A use case in Airline industry Seating Arrangement and Inflight Purchase / Buying Behavior of Airline Customers The below paper investigates and characterizes how social influence affects buying behavior of airline passengers who can purchase items through an individual entertainment system located in front of them. The author used the seating configuration in the airplane as a basis for the analysis. The authored used large data sets with a sample size for analysis of purchase behavior of about 257,000 passengers in nearly 2,000 fights for the analysis, where the passengers performed 65,525 transactions, with an average of 33.3 transactions per flight. The author claims that he finds strong evidence of social effects and states the number of average transactions per passenger increases 30% upon observation of a neighbor's purchase. Within and cross- category effects the author analyzed and found that the passengers are likely to buy from the same category purchased by their neighbors. For example, a purchase of an alcoholic beverage increases the probability of same category purchases by a neighbor in 78%. The author claims peer effects also take place at a deeper level than product category. Passengers `imitate' their peers' decisions on the type of food, alcohol and even movie genre. The paper also investigates the determinants of social influence: Author claims no support is found for informational learning as a significant mechanism in driving social influence. The main determinant of social influence is found to be the number of neighbors observed purchasing an item. The results are consistent with informational learning where consumers only learn from others' actions, but not from their inaction. Peer Effects in Buying Behavior: Evidence from In-Flight Purchases By Pedro M. Gardete, Assistant Professor of Marketing at Stanford University (published September 2013) (URL: http://faculty-gsb.stanford.edu/gardete/documents/SocialEffects_8_2013.pdf )
  • 43. Big Data Analysis 44Author: Vikram Andem ISRM & IT GRC Conference Stochastic Gradient Descent Gradient Descent Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals. Source: Wikipedia Stochastic Gradient Descent Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. Source: Wikipedia Source: Wikipedia
  • 44. Big Data Analysis 45Author: Vikram Andem ISRM & IT GRC Conference Stochastic Gradient Descent (continued) Gradient Descent Example
  • 45. Big Data Analysis 46Author: Vikram Andem ISRM & IT GRC Conference Stochastic Gradient Descent (continued) Stochastic Gradient Descent: Example Source: Wikipedia Stochastic Gradient Descent vs. Minibatch Gradient Descent Stochastic Gradient Descent and Big Data The below URL link provides a presentation on “Stochastic Optimization for Big Data Analytics”. Stochastic Optimization for Big Data Analytics by Tianbao Yang and Shenghuo Zhu of NEC Laboratories America and Rong Jin of Michigan State University (URL: http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf) The authors in the below paper present stochastic gradient descent techniques for online learning and ensemble methods to scale out to large amounts of data at Twitter with details on how to integrate machine learning tools into Hadoop platform (using Pig a programming tool) . Large-Scale Machine Learning at Twitter by Jimmy Lin and Alek Kolcz of Twitter, Inc. (URL: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf )
  • 46. Big Data Analysis 47Author: Vikram Andem ISRM & IT GRC Conference Stochastic Games and Markov Perfect Equilibrium Importance of Markov Perfect Equilibrium in Airline industry Pricing As an example of the use of this equilibrium concept we consider the competition between firms which had invested heavily into fixed costs and are dominant producers in an industry, forming an oligopoly. The players are taken to be committed to levels of production capacity in the short run, and the strategies describe their decisions in setting prices. Firms' objectives are modeled as maximizing present discounted value of profits. Airfare Game / Airline Pricing Game: Often an airplane ticket for a certain route has the same price on either airline A or airline B. Presumably, the two airlines do not have exactly the same costs, nor do they face the same demand function given their varying frequent-flyer programs, the different connections their passengers will make, and so forth. Thus, a realistic general equilibrium model would be unlikely to result in nearly identical prices. Both airlines have made sunk investments into the equipment, personnel, and legal framework. In the near term we may think of them as committed to offering service. We therefore see that they are engaged, or trapped, in a strategic game with one another when setting prices. Equilibrium: Consider the following strategy of an airline for setting the ticket price for a certain route. At every price-setting opportunity:  If the other airline is charging $300 or more, or is not selling tickets on that flight, charge $300  If the other airline is charging between $200 and $300, charge the same price  If the other airline is charging $200 or less, choose randomly between the following three options with equal probability: matching that price, charging $300, or exiting the game by ceasing indefinitely to offer service on this route. This is a Markov strategy because it does not depend on a history of past observations. It satisfies also the Markov reaction function definition because it does not depend on other information which is irrelevant to revenues and profits. Assume now that both airlines follow this strategy exactly. Assume further that passengers always choose the cheapest flight and so if the airlines charge different prices, the one charging the higher price gets zero passengers. Then if each airline assumes that the other airline will follow this strategy, there is no higher-payoff alternative strategy for itself, i.e. it is playing a best response to the other airline strategy. If both airlines followed this strategy, it would form a Nash equilibrium in every proper subgame, thus a subgame-perfect Nash equilibrium. Source: Wikipedia Stochastic (or Markov) Games Stochastic (or Markov) game, is a dynamic game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage the game is in some state. The players select actions and each player receives a payoff that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on previous state and actions chosen by players. The procedure is repeated at the new state and play continues for a finite or infinite number of stages. Total payoff to a player is often taken to be discounted sum of stage payoffs or the limit inferior of averages of stage payoffs. Source: Wikipedia Markov Perfect Equilibrium A Markov perfect equilibrium is an equilibrium concept in game theory. It is the refinement of the concept of sub game perfect equilibrium to extensive form games for which a pay-off relevant state space can be readily identified. In extensive form games, and specifically in stochastic games, a Markov perfect equilibrium is a set of mixed strategies for each of the players which satisfy the following criteria:  The strategies have the Markov property of memorylessness, meaning that each player's mixed strategy can be conditioned only on the state of the game. These strategies are called Markov reaction functions.  The state can only encode payoff-relevant information. This rules out strategies that depend on non- substantive moves by the opponent. It excludes strategies that depend on signals, negotiation, or cooperation between players (e.g. cheap talk or contracts).  The strategies form a subgame perfect equilibrium of the game. Source: Wikipedia Subgame Perfect Equilibrium Subgame Perfect Equilibrium is a refinement of a Nash equilibrium used in dynamic games. A strategy profile is a subgame perfect equilibrium if it represents a Nash equilibrium of every subgame of the original game. Informally, this means that if (1) the players played any smaller game that consisted of only one part of the larger game and (2) their behavior represents a Nash equilibrium of that smaller game, then their behavior is a subgame perfect equilibrium of the larger game. Source: Wikipedia Dynamic Airfare Pricing and Competition The below paper details Airline industry price competition for an oligopoly in a dynamic setting, where each of the sellers has a fixed number of units available for sale over a fixed number of periods. Demand is stochastic, and depending on how it evolves, sellers may change their prices at any time. Dynamic Price Competition with Fixed Capacities by Kalyan Talluri & Victor Martinez de Albeniz A use case in Airline industry Most of the work in this paper was done prior to the writing of the paper as part of (both author's) Ph.D. dissertation at Massachusetts Institute of Technology (published February 2010) (URL: www.econ.upf.edu/docs/papers/downloads/1205.pdf)
  • 47. Big Data Analysis 48Author: Vikram Andem ISRM & IT GRC Conference Stochastic Games and Markov Perfect Equilibrium (continued) A use case in Airline industry Dynamic Revenue Management in Airline Alliances / Code Sharing The below paper presents and excellent formulization of a Markov-game model of a two-partner airline alliance that can be used to analyze the effects of these mechanisms on each partner’s behavior. The authors show that no Markovian transfer pricing mechanism can coordinate an arbitrary alliance. Next, the authors derive the equilibrium acceptance policies under each scheme and use analytical techniques, as well a numerical analyses of sample alliances, to generate fundamental insights about partner behavior under each scheme. The analysis and numerical examples also illustrate how certain transfer price schemes are likely to perform in networks with particular characteristics. Dynamic Revenue Management in Airline Alliances by Robert Shumsky of Dartmouth College and Christopher Wright, Harry Groenevelt of University of Rochester (published February 2009) (URL: http://www.researchgate.net/publication/220413135_Dynamic_Revenue_Management_in_Airline_Alliances/file/72e7e5215a1f91ed5b.pdf )