Big Data: An Overview

<Insert Picture Here>
Big Data: An Overview

What Is Big Data?
• Big Data is not simply a huge pile of information
• A good starting place is the following paraphrase:
“Big Data describes datasets so large they become
awkward to manage with traditional database tools
• at a reasonable cost.”

VOLUME VELOCITY VARIETY VALUE
SOCIAL
BLOG
SMART
METER
101100101001
001001101010
101011100101
010100100101
A Breakdown Of What Makes Up Big Data

Data Growth Explosion
• 1 GB of stored content can create 1 PB of data in transit
Data & Image courtesy of IDC
• The totality of stored data is doubling about every 2 years
• This meant 130 EB in 2005
• 1227 EB in 2010 (1.19 ZB)
• 7910 EB in 2015 (7.72 ZB)

2005 20152010
• More than 90% is unstructured data
and managed outside Relational
Database
• Approx. 500 quadrillion files
• Quantity doubles every 2 years
1.8 trillion gigabytes of data
was created in 2011…
10,000
0
GBofData
(INBILLIONS)
STRUCTURED DATA
(MANAGED INSIDE RELATIONAL DATABASE)
UNSTRUCTURED DATA
(MANAGED OUTSIDE RELATIONAL DATABASE)
Growth Of Big Data
Harnessing Insight From Big Data Is Now Possible

So, Any Just Any Dataset?
• Big Data Can Work
With Any Dataset
• However, Big Data
Shines When Dealing
With Unstructured Data

Structured Vs. Unstructured
Structured Data is any data to which a
pre-defined data model can be applied
in an automated fashion producing in a
semantically meaningful result without
referencing some outside elements.
If you can’t, it’s unstructured
In other words, if you can apply some
template to a data set and have it
instantly make sense to the average
person, it’s structured.

Really? Only Two Categories?
Okay, there’s also
semi-structured data.
Which basically
means after the
template is applied,
some of the result
will make sense and
some will not.
XML is a classic
example of this
kind of data.

Formal Definitions Of Data Types
Structured Data:
Entities in the same group have the same descriptions (or attributes), while descriptions for
all entities in a group (or schema): a) have the same defined format; b) have a predefined
length; c) are all present; and d) follow the same order. Structured data are what is normally
associated with conventional databases such as relational transactional ones where
information is organized into rows and columns within tables. Spreadsheets are another
example. Nearly all understood database management systems (DBMS) are designed for
structural data
Semi-Structured Data:
Semi-structured data are intermediate between the two forms above wherein “tags” or
“structure” are associated or embedded within unstructured data. Semi-structured data are
organized in semantic entities, similar entities are grouped together, entities in the same
group may not have same attributes, the order of attributes is not necessarily important, not
all attributes may be required, and the size or type of same attributes in a group may differ. To
be organized and searched, semi-structured data should be provided electronically from
database systems, file systems (e.g., bibliographic data, Web data) or via data exchange
formats (e.g., EDI, scientific data, XML).
Unstructured Data:
Data can be of any type and do not necessarily follow any format or sequence, do not follow
any rules, are not predictable, and can generally be described as “free form.” Examples of
unstructured data include text, images, video or sound (the latter two also known as
“streaming media”). Generally, “search engines” are used for retrieval of unstructured data
via querying on keywords or tokens that are indexed at time of the data ingest.

Informal Definitions Of Data Types
Structured Data:
Fits neatly into a relational structure.
Semi-Structured Data:
Think documents or EDI.
Unstructured Data:
Can be anything.
Text Video Sound Images

Tools For Dealing With Semi/Un-Structured Data

What Is Hadoop?
“The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
“The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be
prone to failures.”

Rather than moving the data to a central server for processing
The Paradigm Shift Of Hadoop
Centralized Processing Doesn’t Work
Moving data to a central location for
processing (like, say, Informatica)
cannot scale. You can only buy a
machine so big.

Bandwidth Is The Bottleneck
• Moving data around
is expensive.
• Bandwidth $$ > CPU $$

Process The Data Locally Where It Lives

Then Return Only The Results
• You move much less data
around this way
• You also gain the advantage
of greater parallel processing

Where Did Hadoop Originate?
GFS
Presented To The
Public In 2003
MapReduce
Presented To The
Public in 2004

Spreading Out From Google
Doug Cutting was working on “Nutch”, Yahoo’s next generation search
engine at the same time when he read the Google papers and reverse
engineered the technology. The elephant was his son’s toy named….

Going Open Source
HDFS MapReduce
Released To Public 2006

A Bit More In Depth, Then A Lot More In Depth
HDFS MapReduce
HDFS is primarily a data
redundancy solution.
MapReduce is where
the work gets done.

How Hadoop Works
Hadoop is basically a massively parallel, shared
nothing, distributed processing algorithm

GFS / HDFS
HDFS Distributes Files At The Block Level Across Multiple
Commodity Devices For Redundancy On The Cheap
Not RAID:
Distribution Is Across Machines/Racks

Data Distribution
By Default, HDFS Writes Into Blocks & The Blocks
Are Distributed x3

WORM
Data Is Written Once & (Basically) Never Erased

How Is The Data Manipulated?
Not Random Reads
Data Is Read From The Stream In
Large, Contiguous Chunks

The Key To Hadoop Is MapReduce
In a Shared Nothing architecture,
programmers must break the work
down into distinct segments that are:
• Autonomous
• Digestible
• Can be processed independently
• With the expectation of incipient
failure at every step

A Canonical MapReduce Example
Image Credit: Martijn van Groningen

The data
arrives into
the system.
A MapReduce Example
The Input

The data is moved into the
HDFS system, divided into
blocks, each of which are copied
multiple times for redundancy.
A MapReduce Example
Splitting The Input Into Chunks

The Mapper picks up a chunk for
processing. The MR Framework
ensures only one mapper will be
assigned to a given chunk
A MapReduce Example
Mapping The Chunks

In this case, the Mapper
emits a word with the number
of times it was found.
A MapReduce Example
Mapping The Chunks

The Shuffler can do a rough
sort of like items (optional)
A MapReduce Example
A Shuffle Sort

The Reducer combines
the Mapper’s output into
a total
A MapReduce Example
Reducing The Emissions

The job completes with a
numeric index of words found
within the original input.
A MapReduce Example
The Output

MapReduce Is Not Only Hadoop
http://blogs.oracle.com/datawarehousing/2009/10/in-database_map-reduce.html
MapReduce is a programming paradigm, not a language. You can do MapReduce
within an Oracle database; it’s just usually not a good idea. A large MapReduce
job would quickly exhaust the SGA of any Oracle environment.

Problem Solving With MapReduce
• The key feature is the Shared Nothing architecture.
• Any MapReduce program has to understand
and leverage that architecture.
• This is usually a paradigm shift for most
programmers and one that many cannot
overcome.

Programming With MapReduce
• HDFS & MapReduce Is
Written In Java
1. package org.myorg;
2.
3. import java.io.*;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.filecache.DistributedCache;
8. import org.apache.hadoop.conf.*;
9. import org.apache.hadoop.io.*;
10. import org.apache.hadoop.mapreduce.*;
11. import org.apache.hadoop.mapreduce.lib.input.*;
12. import org.apache.hadoop.mapreduce.lib.output.*;
13. import org.apache.hadoop.util.*;
14.
15. public class WordCount2 extends Configured implements Tool {
16.
17. public static class Map
18. extends Mapper<LongWritable, Text, Text, IntWritable> {
19.
20. static enum Counters { INPUT_WORDS }
21.
22. private final static IntWritable one = new IntWritable(1);
23. private Text word = new Text();
24.
25. private boolean caseSensitive = true;
26. private Set<String> patternsToSkip = new HashSet<String>();
27.
28. private long numRecords = 0;
29. private String inputFile;
30.
31. public void setup(Context context) {
32. Configuration conf = context.getConfiguration();
33. caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
34. inputFile = conf.get("mapreduce.map.input.file");
35.
36. if (conf.getBoolean("wordcount.skip.patterns", false)) {
37. Path[] patternsFiles = new Path[0];
38. try {
39. patternsFiles = DistributedCache.getLocalCacheFiles(conf);
40. } catch (IOException ioe) {
41. System.err.println("Caught exception while getting cached files: "
42. + StringUtils.stringifyException(ioe));
43. }
44. for (Path patternsFile : patternsFiles) {
45. parseSkipFile(patternsFile);
46. }
47. }
48. }
49.
50. private void parseSkipFile(Path patternsFile) {
51. try { ,,,,,,
• Will Work With Any Language
Supporting STDIN/STDOUT
• Lots Of People Using Python,
R, Matlab, Perl, Ruby et al
• Is Still Very Immature &
Requires Low Level Coding

What Are Some Big Data Use Cases?
• Inverse Frequency / Weighting
• Co-Occurrence
• Behavioral Discovery
• “The Internet Of Things”
• Classification / Machine Learning
• Sorting
• Indexing
• Data Intake
• Language Processing
Basically, Clustering And Targeting

Inverse Frequency Weighting
Recommendation
Systems

Co-Occurrence
Fundamental Data Mining –
People Who Did This Also Do That

Behavioral Discovery
“The best minds of my generation are
thinking about how to make people
click ads.”
Jeff Hammerbacher,
Former Research Scientist at Facebook
Currently Chief Scientist at Cloudera

“The Internet Of Things”
“Data Exhaust”

Classification / Machine Learning

Sorting
Current Record Holder:
•10PB sort
•8000 nodes
•6 hours, 27 minutes
•September 7, 2011
Current Record Holder:
•1.5 TB
•2103 nodes
•59 seconds
•February 26, 2013

Data Intake
Hadoop can be used as a massive parallel ETL tool;
Flume to ingest files, MapReduce to transform them.

Language Processing
Includes Sentiment
Analysis
How can you infer
meaning from
someone’s words?
Does that smile mean
happy? Sarcastic?
Bemusement?
Anticipation?

How Can Big Data Help You?
9 Use Cases:
• Natural Language Processing
• Internal Misconduct
• Fraud Detection
• Marketing
• Risk Management
• Compliance / Regulatory Reporting
• Portfolio Management
• IT Optimization
• Predictive Analysis

Compliance / Regulatory Reporting

Predictive Analysis
Think data mining on
steroids. One of the main
benefits Hadoop brings to
the enterprise is the ability
to analyze every piece of
data, not just a statistical
sample or an aggregated
form of the entire
datastream.

Risk Management
Photo credit: Guinness World
Records (88 catches, by the way)

When considering a new hire, an extended investigation may show risky behavior on the
applicant’s part which may exclude him or her from more sensitive positions.
Risk Management
Behavioral Analysis

Fraud Detection
“Dear Company: I hurt myself working on the line and now I can’t walk without a
cane.” Then he tells his Facebook friends he’s going to his house in Belize for
some waterskiing.

Internal Misconduct
One of the reasons why the FBI was able to close in on the identities of the people
involved is that they geolocated the sender and recipient of the Gmail emails and
connected those IP addresses with known users on those same IP addresses.

Portfolio Management
• Evaluate portfolio performance on existing holdings
• Evaluate portfolio for future activities
• High speed arbitrage trading
• Simply keeping up:
"Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9th
straight year in a row”
10,000 credit card transactions per second
Statistics courtesy of ComputerWorld, April 2012

Sentiment Analysis – Social Network Analysis
Companies used to
rely on warranty cards
and the like to collect
demographic data.
People either did not
fill out the forms or did
so with inaccurate
information.

People are much more likely to be truthful when talking to their friends.

This person – and
20 of their friends
– are talking about
the NFL.
This person
is a runner
Someone
likes Kindle
Someone is
current with
pop music

Even Where You Least Expect It.
You Might Be Thinking Something Like “My Customer Will Never Use Social
Media For Anything I Care About. No Sargent Is Ever Going To Tweet “The Straps
On This New Rucksack Are So Comfortable!!!”

Internal Social Networking At Customer Sites
• Oracle already uses an internal social network to facilitate work.
• The US Military is beginning to explore a similar type of environment.
• It is not unreasonable to plan for the DoD installing a network on base; Your
company could incorporate feedback from end users into design decisions.

Sentiment Analysis – Apple iOS6, Maps & Stock Price
Apple Released iOS6 with their
own version of Maps. It has had
some issues, to put it mildly.
Photo courtesy of
http://theamazingios6
maps.tumblr.com/

Over half of all trades in the US are initiated by a computer algorithm.
Source: Planet Money (NPR) Aug 2012

Photo courtesy of
maps.tumblr.com/
People started to
tweet about the
maps problem, and
it went viral (to the
point that someone
created a Tumblr
blog to make fun
of Apple’s fiasco.

Photo courtesy of
maps.tumblr.com/
As the twitter stream started to peak, Apple’s
stock price took a short dip. I believe it likely
that automatic trading algorithms started to
sell off Apple based on the negative sentiment
analysis from Twitter and Facebook.

Natural Language Processing
Big
Huge
Blooming
Ample
Blimp
Gigantic
Abundant
Broad
Bulky
Capacious
Colossal
Comprehensive
Copious
Enormous
Excessive
Exorbitant
Extensive
Extravagant
Full
Generous
Giant
Goodly
Big
Grand
Grandiose
Great
Hefty
Humongous
Immeasurable
Immense
Jumbo
Gargantuan
Massive
Monumental
Mountainous
Plentiful
Populous
Roomy
Sizable
Spacious
Stupendous
Substantial
Super
Sweeping
Vast
Voluminous
Whopping
Wide
Ginormous
Mongo
Badonka
Booku
Doozy

Big
Huge
Blooming
Ample
Blimp
Gigantic
Abundant
Broad
Bulky
Capacious
Colossal
Comprehensive
Copious
Enormous
Excessive
Exorbitant
Extensive
Extravagant
Full
Generous
Giant
Goodly
Big
Grand
Grandiose
Great
Hefty
Humongous
Immeasurable
Immense
Jumbo
Gargantuan
Massive
Monumental
Mountainous
Plentiful
Populous
Roomy
Sizable
Spacious
Stupendous
Substantial
Super
Sweeping
Vast
Voluminous
Whopping
Wide
Ginormous
Mongo
Badonka
Booku
Doozy
Large

Anticipate Customer Need

React To Competitor’s Missteps

Cultural Fit For Hires
As of Apr 22, there were 724 Hadoop
openings in the DC area. There will be
hundreds – if not thousands – of applicants
for each position. How can you determine
who is the most appropriate candidate, not
just technically, but culturally?

Cultural Fit?
A good way to think of cultural fit is
the “airport test.” If you’re thinking
of hiring someone and you had to sit
with them in an airport for a few
hours because of a delayed flight,
would that make you happy? Or
would you cringe at the thought of
hours of forced conversation?

Analyze Their Writings For Cultural Fit
Go beyond simple keyword searches to find out more about the person.
Regardless of what their resume says, language analysis can reveal details about
where they grew up and where they experienced their formative years.

Do they say “faucet” or “spigot”? “Wallet” or “billfold”? “Dog”, “hound” or
“hound dog”? “Groovy”, “cool”, “sweet” or “off the hook”? While these words
are synonyms, they carry cultural connotations with them. Find candidates with
the same markers as your existing team for a more cohesive unit.
Analyze Their Writings For Cultural Fit

IT Optimization – Enabling The Environment
I’m running
out of
supplies!
I’m overheating!
Everything
Is Fine.
Wheel 21
is out of
alignment.
I’m 42.4%
full.

IT Optimization – Enabling The Shop Floor
A More Specific Example
I’m 42.4%
full.

Make The Trash Smart
We can make the trash bins “smart” by
putting a wifi enabled scale beneath each
bin and using that to determine when the
bins reaching capacity.

As of now, the custodian has to check each bin to see if
it is full. With a “smart” bin, the custodian can check his
smart phone and see does and does not need to be done.
Cut Down On Clean Up Labor

More importantly, we can now focus on what is happening
to the bins and how they are being used. For example, we
may find outliers where one bin is filling much faster than
all of the others.
Cut Down On Clean Up Labor

“Data Exhaust”
We can drill into why that bin is filling faster, leverage
the Six Sigma efficiency processes already in place
and improve the overall performance of the line.
Drilling Into To Waste Production

IT Optimization – Classify Legacy Data
A customer can use a machine learning process
to take unknown data and sort it into useful data
elements. For example, a retail car part company
might use this process to sort photos – is that
circle a steering wheel, a hubcap or a tire?

So, All We Need Is Hadoop, Right?
• Lack of Security • Ad-hoc Query Support
• SQL support • Readily Available
Technical Resources
Hadoop is amazing at processing, but lacks a number of features found in
traditional RDBMS platforms (like, say Oracle).

Then How Do We Fix Those Problems?
In general, do the data crunching in Hadoop, then import the results into a system
like Oracle for more traditional BI analysis.

Oracle’s Big Data Appliance
In Depth

Big Data Appliance
The Specs Of The Machine
Hardware:
•18 Compute/Storage Nodes
• 2 6 code Intel processors
• 48G Memory (up to 144G)
• 12x3TB SAS DIsks
•3 InfiniBand Switches
•Ethernet Switch, KVM, PDU
•42U rack
Software:
•Oracle Linux
•Java Virtual Machine
•Cloudera Hadoop Distribution
•R (statistical programming language)
•Oracle NoSQL Database
Environmental:
•12.25 kVA Power Draw
•41k BTU/hr Cooling
•1886 CFM Airflow
216 Cores
864G RAM (2.5T Max)
648T Storage
• 12.0 KW Power Draw
• 42k KJ/hr Cooling

Big Data Appliance
The Cloudera Distribution

105 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Competitive
Advantage
Degree Of Complexity
Some are
here
Growing
investment here
The Analytics Evolution
What Is Happening In The Industry
Standard Reporting: What Happened?
Ad Hoc Reporting: How Many, How Often, Where?
Query/Drill Down: What Exactly Is The Problem?
Alerts: What Actions Are Needed?
Simulation: What Could Happen….?
Forecasting: What If These Trends Continue?
Predictive Modeling: What Will Happen Next If…?
Optimization: How Can We Achieve The Best Outcome?
How can we achieve the best
Stochastic Optimization: outcome, including the effects
of variability?
Descriptive: Analyzing Data
To Determine What Has
Happened Or Is Happening
Now
Predictive: Examining
Data To Discover
Whether Trends Will
Continue Into The Future
Prescriptive: Studying Data
To Elevate The Best Course
Of Action For The Future
Competing On Analytics: The New Science Of Winning;
Thomas Davenport & Jeanne Harris, 2007

reserved.
Competitive
Advantage
Degree Of Complexity
Some are
here
Growing
investment here
The Analytics Evolution
Where Big Data Fits On This Model
Standard Reporting: What Happened?
Ad Hoc Reporting: How Many, How Often, Where?
Query/Drill Down: What Exactly Is The Problem?
Alerts: What Actions Are Needed?
Simulation: What Could Happen….?
Forecasting: What If These Trends Continue?
Predictive Modeling: What Will Happen Next If…?
Optimization: How Can We Achieve The Best Outcome?
How can we achieve the best
Stochastic Optimization: outcome, including the effects
of variability?
Descriptive: Analyzing Data
To Determine What Has
Happened Or Is Happening
Now
Predictive: Examining
Data To Discover
Whether Trends Will
Continue Into The Future
Prescriptive: Studying Data
To Elevate The Best Course
Of Action For The Future
Competing On Analytics: The New Science Of Winning;
Thomas Davenport & Jeanne Harris, 2007
Where Big Data Best Fits

reserved.
Typical Stages In Analytics
Choosing The Right Solutions For The Right Data Needs
Growing
investment
here
Growing
investment
here

reserved.
IncreasingBusiness
Value
Information Architecture Maturity
DATA &
ANALYTICS
DIVERSITY
CONSOLIDAT
E DATA
DATA WAREHOUSE
& What is happening today
Most
are here!
DATA MARTS
& What happened yesterday
BIG DATA
& What could happen tomorrow
Some are
here
Growing
investment here
The Data Warehouse Evolution
What Are Oracle’s Customers Deploying Today?

reserved.
How will you
acquire live
streams of
unstructured data?
ANALYZE
DECIDE
ORGANIZE
ACQUIRE
What Is Your Big Data Strategy?
Where Does Your Data Originate?

reserved.
How will you
organize big data
so it can be
integrated into
your data center?
ANALYZE
DECIDE ACQUIRE
ORGANIZE
What Do You Do With It Once You Have It?

reserved.
What skill sets
and tools will you
use to analyze big
data?
ANALYZE
DECIDE ACQUIRE
ORGANIZEANALYZE
How Do You Manipulate It Once You Have It?

reserved.
How will you
share the
analysis in real-
time?
ANALYZE
ACQUIRE
ORGANIZE
DECIDE
What To You Do After You’re Done?

reserved.
Make
Better
Decisions
Using
Big Data
ANALYZE
DECIDE ACQUIRE
ORGANIZE
Big Data In Action

reserved.
Traditional BI Big Data
ChangeRequests
Hypothesis
Identify Data SourcesExplore Results
Reduce Ambiguity
Refine Models
Improved Hypothesis
The Big Data Development Process

reserved.
Oracle
Exalytics
InfiniBand
Oracle
Real-Time
Decisions
Oracle
Big Data
Appliance
Oracle
Exadata
InfiniBand
AcquireOrganize & Discover Analyze Decide
Endeca Information Discovery
Oracle’s Big Data Solution

PerformanceAchievement
PerformanceAchievement
Time
(Days)
Time
(Months)
100%
Measure,
diagnose,
tune and
reconfigure
Test & debug
failure modes
Assemble
dozens of
components
Multi-
vendor
finger
pointing
Custom Configuration
Oracle’s Big Data Solution
Pre-Built And Optimized Out Of The Box

6x faster than custom 20-node Hadoop
cluster for large batch transformation
jobs
2.5x faster than 30-node Hadoop
cluster for tagging and parsing text
documents
Big Data Appliance Performance Comparisons

• Oracle Loader for Hadoop (OLH)
• A MapReduce utility to optimize data loading from HDFS into Oracle
Database
• Oracle Direct Connector for HDFS
• Access data directly in HDFS using external tables
• ODI Application Adapter for Hadoop
• ODI Knowledge Modules optimized for Hive and OLH
• Oracle R Connector for Hadoop
• Load Results into
Oracle Database
at 12TB/hour
BDA
Oracle
Exadata
InfiniBand
Oracle
Big Data
Connectors
Oracle Big Data Connectors

• The R open source environment for statistical computing and
graphics is growing in popularity for advanced analytics
• Widely taught in colleges and universities
• Popular among millions of statisticians
• R programs can run unchanged against
data residing in the Oracle Database
• Reduce latency
• Improve data security
• Augment results with powerful graphics
• Integrate R results and graphics with
OBIEE dashboards
Oracle Database Advanced Analytics Option
Oracle R Enterprise

Classification
Association
Rules
Clustering
Attribute
Importance
Problem Algorithm Applicability
Classical statistical technique
Popular / Rules / transparency
Embedded app
Wide / narrow data / text
Minimum Description Length (MDL) Attribute reduction
Identify useful data
Reduce data noise
Hierarchical K-Means
Hierarchical O-Cluster
Product grouping
Text mining
Gene and protein analysis
Apriori
Market basket analysis
Link analysis
Multiple Regression (GLM)
Support Vector Machine
Classical statistical technique
Wide / narrow data / text
Regression
Feature
Extraction
Non-Negative Matrix Factorization (NMF)
Text analysis
Feature reduction
Logistic Regression (GLM)
Decision Trees
Naïve Bayes
Support Vector Machine
One Class Support Vector Machine (SVM) Lack examplesAnomaly
Detection
A1 A2 A3 A4 A5 A6 A7
F1 F2 F3 F4
Oracle Database Advanced Analytics Option
Oracle Data Mining

• Ranking functions
• rank, dense_rank, cume_dist, percent_rank,
ntile
•
Window Aggregate functions (moving
and cumulative)
• Avg, sum, min, max, count, variance, stddev,
first_value, last_value
• LAG/LEAD functions
• Direct inter-row reference using offsets
• Reporting Aggregate functions
• Sum, avg, min, max, variance, stddev, count,
ratio_to_report
• Statistical Aggregates
• Correlation, linear regression family, covariance
• Linear regression
• Fitting of an ordinary-least-squares regression
line to a set of number pairs.
• Frequently combined with the COVAR_POP,
COVAR_SAMP, and CORR functions
Descriptive Statistics
• DBMS_STAT_FUNCS: summarizes numerical
columns of a table and returns count, min, max,
range, mean, median, stats_mode, variance,
standard deviation, quantile values, +/- n sigma
values, top/bottom 5 values
• Correlations
• Pearson’s correlation coefficients, Spearman's
and Kendall's (both nonparametric).
• Cross Tabs
• Enhanced with % statistics: chi squared, phi
coefficient, Cramer's V, contingency coefficient,
Cohen's kappa
• Hypothesis Testing
• Student t-test , F-test, Binomial test, Wilcoxon
Signed Ranks test, Chi-square, Mann Whitney
test, Kolmogorov-Smirnov test, One-way
ANOVA
• Distribution Fitting
• Kolmogorov-Smirnov Test, Anderson-Darling
Test, Chi-Squared Test, Normal, Uniform,
Weibull, Exponential
Oracle Database SQL Analytics
Included In The Oracle Database

ANALYZE
DECIDE
ACQUIRE
ORGANIZE
DISCOVER
VISUALIZE
STREAM
Oracle Big Data Ecosystem

Big Data Is More Than Just Hardware & Software

The Math Is The Hard Part
This is a very simple equation for a Fourier transformation of a wave kernel at 0.

The Math Is The Hard Part
This is a photograph of a data scientist’s white board at Bit.ly

Data Scientists Are Expensive And Hard To Find
• Typical Job Description:
“Ph.D. in data mining, machine
learning, statistical analysis,
applied mathematics or equivalent;
three-plus years hands-on practical
experience with large-scale data
analysis; and fluency in analytical
tools such as SAS, R, etc.”
• Looking For “baIT”:
• Business
• Analytics
• IT
All in the same personAll in the same person
These people exist, but are
very expensive.

Growing Your Own Data Scientist
• Business Acumen
• Familiarity/Likes Computational
Linear Algebra / Matrix Analysis
• Interest in SAS, R, Matlab
• Familiarity/Likes Lisp

Big Data Cannot Do Everything
Big Data Is A Great Tool
But Not A Silver Bullet
You would never run a POS system on Hadoop; Hadoop is far too batch oriented
to support this type of activity. Similarly, random access of data does not work
well in the Hadoop world.

When Big Data? When Relational?
Size Of Data
(rough measure)

When Big Data? When Relational?
RDBMS vs Hadoop: A Comparison
Fully SQL Compliant Helper Languages (Hive, Pig)
Many RDBMS Vendors Extend SQL In Useful Ways Very Useful But Not As Robust As SQL
Optmized For Query Performance Optmized For Analytics Operations
Tunable (Input Vs Output, Long Running Queries, Etc) Specifically Those Of A Statistics Nature
Armies Of Trained And Available Resources
Resources Are Hard To Find And
Expensive When Found
Requires More Specialized Hardware
At Performance Extremes
Designed To Work On Commodity
Hardware At All Levels
OLTP, OLAP, ODS, DSS, Hybrid -- More General Purpose Basically Only For Analytics
Expensive To Implement Over Wide Geographical Distribution Designed To Span Data Centers
Very Mature Technology Very New Technology
Real Time or Batch Processing Batch Operations Only
Nontrivial Licensing Costs Open Source ("Free" --ish)
About 2 PB As Largest Commercial
Cluster (Telecom Company)
100+ PB As Largest Commercial
Cluster (Facebook) (as of March 2013)
Ad Hoc Operations Common, If Not Encouraged
Ad Hoc Operations Possible With HBase
But Nontrivial

It Is Not An “Either/Or” Choice
RDBMS and Hadoop Each Solve Different Problems

A Quick Recap
GFS
Presented To The
Public In 2003
MapReduce
Presented To The
Public in 2004

YES
Hadoop Is Already Dead?
Sort Of*
* = for a specific set of problems…

Name
Pub
Year Use What It Does Impact Open Source?
Colossus n/a
GFS for realtime
systems No
Caffeine 2009 Real Time Search
Incremental updates of analytics
and indexes in real time
Estimated to be 100x faster
than Hadoop No
Pregel 2009
Social Graphs,
Location Graphs,
Learning &
Discovery, Network
Optimization,
Internet Of Things Analyze next neighbor problems
Estimated to handled billions
of nodes & trillions of edges
Alpha
Apache
Giraph
Percolator 2010
Large scale
incremental
processing using
distributed
transactions
Makes transactional, atomic
updates in a widely distributed
data environment. Eliminates
need to rerun a batch for a
(relatively) small update.
Data in the environment
remains much more up to
date with less effort.
Dremel 2010
SQL like language
for queries on the
above technologies
Interactive, ad hoc queries over
trillion row tables in subsecond
time. Works against Caffeine /
Pregel / Colossus without
requiring MapReduce
Easier for analysts and non
technical people to be
productive (i.e. not as many
data scientists are required)
Very
Alpha
Apache
Drill
(Incubator)
Spanner
Oct
2012
Fully consistent (?),
transactional,
horizontally
scalable, distributed
database spanning
the globe
Uses GPS sensors and atomic
clocks to keep the clocks of
servers in sync regardless of
location or other factors.
Transactional support on a
global scale at a fraction of
the cost and where (many
times) not technically
possible otherwise.
No, and
unlikely
to ever
be
Storm 2012
Real time Hadoop-
like processing
The power of Hadoop in real
time. Not from Google; from
Twitter
Eliminates requirement for
batch processing
Yes
Beta*
The New Stuff In Overview

One Last Thing
Is Just The Start Of The Equation

One Last Thing
Hadoop For Analytics And Determining Boundary Conditions
Is Just The Start Of The Equation
Use Hadoop to analyze all of the data in your environment and then generate
mathematical models from that data.

One Last Thing
Acting On Boundary Conditions
Once the model has been built (and vetted), it can be used to resolve events in
real time, thereby getting around the batch bottleneck of Hadoop.

No Really. One More Last Thing

Who Is Hilary Mason?
• Chief Data Scientist At bit.ly
• One of the major innovators
in data science
• Scary smart and fun to be around
• A heck of a teacher, to boot
Photo credit: Pinar Ozger, Strata 2011

Interpret
The end goal of any Big Data solution is to provide data which can be interpreted
into meaningful decisions. But, before we can interpret the data, we must first…
The Mason 5 Step Process For Big Data
In Reverse Order

Model
Model the data into a useful paradigm which will allow us to make sense of any
new data based on past experiences. But, before we can model the data, we must
first….
In Reverse Order

Explore
Explore the data we have and look for meaningful patterns from which we could
extract a useful model. But, before we can look through the data for meaningful
patterns, we first have to…
In Reverse Order

Scrub
Clean and clarify the data we have to make it as neat as possible and easier to
manipulate. But, before we can clean the data, we have to start with…
In Reverse Order

Obtain
Obtaining as much data as possible. Advances in technology – coupled with
Moore’s law – means that DASD is very, very cheap these days. So much so that
you may as well hang on to as much data as you can, because you never know
when it will prove useful.
In Reverse Order

Some Resources
White Papers:
• An Architect’s Guide To Big Data
• Big Data For The Enterprise
• Big Data Gets Real Time
• Build vs. Buy For Hadoop
This Deck:
Slideshare
Web Resources:
• Oracle Big Data
• Oracle Big Data Appliance
• Oracle Big Data Connectors
Me:
charles dot scyphers oracle dot com
@scyphers (twitter)

Big Data: An Overview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Big Data: An Overview

Ähnlich wie Big Data: An Overview (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data: An Overview

Hinweis der Redaktion