SlideShare ist ein Scribd-Unternehmen logo
1 von 134
<Insert Picture Here>
Big Data: An Overview
What Is Big Data?
What Is Big Data?
• Big Data is not simply a huge pile of information
• A good starting place is the following paraphrase:
“Big Data describes datasets so large they become
awkward to manage with traditional database tools
• at a reasonable cost.”
VOLUME VELOCITY VARIETY VALUE
SOCIAL
BLOG
SMART
METER
101100101001
001001101010
101011100101
010100100101
A Breakdown Of What Makes Up Big Data
Data Growth Explosion
• 1 GB of stored content can create 1 PB of data in transit
Data & Image courtesy of IDC
• The totality of stored data is doubling about every 2 years
• This meant 130 EB in 2005
• 1227 EB in 2010 (1.19 ZB)
• 7910 EB in 2015 (7.72 ZB)
2005 20152010
• More than 90% is unstructured data
and managed outside Relational
Database
• Approx. 500 quadrillion files
• Quantity doubles every 2 years
1.8 trillion gigabytes of data
was created in 2011…
10,000
0
GBofData
(INBILLIONS)
STRUCTURED DATA
(MANAGED INSIDE RELATIONAL DATABASE)
UNSTRUCTURED DATA
(MANAGED OUTSIDE RELATIONAL DATABASE)
Growth Of Big Data
Harnessing Insight From Big Data Is Now Possible
So, Any Just Any Dataset?
• Big Data Can Work
With Any Dataset
• However, Big Data
Shines When Dealing
With Unstructured Data
Structured Vs. Unstructured
Structured Data is any data to which a
pre-defined data model can be applied
in an automated fashion producing in a
semantically meaningful result without
referencing some outside elements.
If you can’t, it’s unstructured
In other words, if you can apply some
template to a data set and have it
instantly make sense to the average
person, it’s structured.
Really? Only Two Categories?
Okay, there’s also
semi-structured data.
Which basically
means after the
template is applied,
some of the result
will make sense and
some will not.
XML is a classic
example of this
kind of data.
Formal Definitions Of Data Types
Structured Data:
Entities in the same group have the same descriptions (or attributes), while descriptions for
all entities in a group (or schema): a) have the same defined format; b) have a predefined
length; c) are all present; and d) follow the same order. Structured data are what is normally
associated with conventional databases such as relational transactional ones where
information is organized into rows and columns within tables. Spreadsheets are another
example. Nearly all understood database management systems (DBMS) are designed for
structural data
Semi-Structured Data:
Semi-structured data are intermediate between the two forms above wherein “tags” or
“structure” are associated or embedded within unstructured data. Semi-structured data are
organized in semantic entities, similar entities are grouped together, entities in the same
group may not have same attributes, the order of attributes is not necessarily important, not
all attributes may be required, and the size or type of same attributes in a group may differ. To
be organized and searched, semi-structured data should be provided electronically from
database systems, file systems (e.g., bibliographic data, Web data) or via data exchange
formats (e.g., EDI, scientific data, XML).
Unstructured Data:
Data can be of any type and do not necessarily follow any format or sequence, do not follow
any rules, are not predictable, and can generally be described as “free form.” Examples of
unstructured data include text, images, video or sound (the latter two also known as
“streaming media”). Generally, “search engines” are used for retrieval of unstructured data
via querying on keywords or tokens that are indexed at time of the data ingest.
Informal Definitions Of Data Types
Structured Data:
Fits neatly into a relational structure.
Semi-Structured Data:
Think documents or EDI.
Unstructured Data:
Can be anything.
Text Video Sound Images
Tools For Dealing With Semi/Un-Structured Data
What Is Hadoop?
“The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
“The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver
high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be
prone to failures.”
Rather than moving the data to a central server for processing
The Paradigm Shift Of Hadoop
Centralized Processing Doesn’t Work
Moving data to a central location for
processing (like, say, Informatica)
cannot scale. You can only buy a
machine so big.
The Paradigm Shift Of Hadoop
Bandwidth Is The Bottleneck
• Moving data around
is expensive.
• Bandwidth $$ > CPU $$
The Paradigm Shift Of Hadoop
Process The Data Locally Where It Lives
The Paradigm Shift Of Hadoop
Then Return Only The Results
• You move much less data
around this way
• You also gain the advantage
of greater parallel processing
Where Did Hadoop Originate?
GFS
Presented To The
Public In 2003
MapReduce
Presented To The
Public in 2004
Spreading Out From Google
Doug Cutting was working on “Nutch”, Yahoo’s next generation search
engine at the same time when he read the Google papers and reverse
engineered the technology. The elephant was his son’s toy named….
Going Open Source
HDFS MapReduce
Released To Public 2006
A Bit More In Depth, Then A Lot More In Depth
HDFS MapReduce
HDFS is primarily a data
redundancy solution.
MapReduce is where
the work gets done.
How Hadoop Works
Hadoop is basically a massively parallel, shared
nothing, distributed processing algorithm
GFS / HDFS
HDFS Distributes Files At The Block Level Across Multiple
Commodity Devices For Redundancy On The Cheap
Not RAID:
Distribution Is Across Machines/Racks
Data Distribution
By Default, HDFS Writes Into Blocks & The Blocks
Are Distributed x3
WORM
Data Is Written Once & (Basically) Never Erased
How Is The Data Manipulated?
Not Random Reads
Data Is Read From The Stream In
Large, Contiguous Chunks
The Key To Hadoop Is MapReduce
In a Shared Nothing architecture,
programmers must break the work
down into distinct segments that are:
• Autonomous
• Digestible
• Can be processed independently
• With the expectation of incipient
failure at every step
A Canonical MapReduce Example
Image Credit: Martijn van Groningen
The data
arrives into
the system.
A MapReduce Example
The Input
The data is moved into the
HDFS system, divided into
blocks, each of which are copied
multiple times for redundancy.
A MapReduce Example
Splitting The Input Into Chunks
The Mapper picks up a chunk for
processing. The MR Framework
ensures only one mapper will be
assigned to a given chunk
A MapReduce Example
Mapping The Chunks
In this case, the Mapper
emits a word with the number
of times it was found.
A MapReduce Example
Mapping The Chunks
The Shuffler can do a rough
sort of like items (optional)
A MapReduce Example
A Shuffle Sort
The Reducer combines
the Mapper’s output into
a total
A MapReduce Example
Reducing The Emissions
The job completes with a
numeric index of words found
within the original input.
A MapReduce Example
The Output
MapReduce Is Not Only Hadoop
http://blogs.oracle.com/datawarehousing/2009/10/in-database_map-reduce.html
MapReduce is a programming paradigm, not a language. You can do MapReduce
within an Oracle database; it’s just usually not a good idea. A large MapReduce
job would quickly exhaust the SGA of any Oracle environment.
Problem Solving With MapReduce
• The key feature is the Shared Nothing architecture.
• Any MapReduce program has to understand
and leverage that architecture.
• This is usually a paradigm shift for most
programmers and one that many cannot
overcome.
Programming With MapReduce
• HDFS & MapReduce Is
Written In Java
1. package org.myorg;
2.
3. import java.io.*;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.filecache.DistributedCache;
8. import org.apache.hadoop.conf.*;
9. import org.apache.hadoop.io.*;
10. import org.apache.hadoop.mapreduce.*;
11. import org.apache.hadoop.mapreduce.lib.input.*;
12. import org.apache.hadoop.mapreduce.lib.output.*;
13. import org.apache.hadoop.util.*;
14.
15. public class WordCount2 extends Configured implements Tool {
16.
17. public static class Map
18. extends Mapper<LongWritable, Text, Text, IntWritable> {
19.
20. static enum Counters { INPUT_WORDS }
21.
22. private final static IntWritable one = new IntWritable(1);
23. private Text word = new Text();
24.
25. private boolean caseSensitive = true;
26. private Set<String> patternsToSkip = new HashSet<String>();
27.
28. private long numRecords = 0;
29. private String inputFile;
30.
31. public void setup(Context context) {
32. Configuration conf = context.getConfiguration();
33. caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
34. inputFile = conf.get("mapreduce.map.input.file");
35.
36. if (conf.getBoolean("wordcount.skip.patterns", false)) {
37. Path[] patternsFiles = new Path[0];
38. try {
39. patternsFiles = DistributedCache.getLocalCacheFiles(conf);
40. } catch (IOException ioe) {
41. System.err.println("Caught exception while getting cached files: "
42. + StringUtils.stringifyException(ioe));
43. }
44. for (Path patternsFile : patternsFiles) {
45. parseSkipFile(patternsFile);
46. }
47. }
48. }
49.
50. private void parseSkipFile(Path patternsFile) {
51. try { ,,,,,,
• Will Work With Any Language
Supporting STDIN/STDOUT
• Lots Of People Using Python,
R, Matlab, Perl, Ruby et al
• Is Still Very Immature &
Requires Low Level Coding
What Are Some Big Data Use Cases?
• Inverse Frequency / Weighting
• Co-Occurrence
• Behavioral Discovery
• “The Internet Of Things”
• Classification / Machine Learning
• Sorting
• Indexing
• Data Intake
• Language Processing
Basically, Clustering And Targeting
Inverse Frequency Weighting
Recommendation
Systems
Co-Occurrence
Fundamental Data Mining –
People Who Did This Also Do That
Behavioral Discovery
Behavioral Discovery
“The best minds of my generation are
thinking about how to make people
click ads.”
Jeff Hammerbacher,
Former Research Scientist at Facebook
Currently Chief Scientist at Cloudera
“The Internet Of Things”
“Data Exhaust”
Classification / Machine Learning
Sorting
Current Record Holder:
•10PB sort
•8000 nodes
•6 hours, 27 minutes
•September 7, 2011
Current Record Holder:
•1.5 TB
•2103 nodes
•59 seconds
•February 26, 2013
Indexing
Data Intake
Hadoop can be used as a massive parallel ETL tool;
Flume to ingest files, MapReduce to transform them.
Language Processing
Includes Sentiment
Analysis
How can you infer
meaning from
someone’s words?
Does that smile mean
happy? Sarcastic?
Bemusement?
Anticipation?
How Can Big Data Help You?
9 Use Cases:
• Natural Language Processing
• Internal Misconduct
• Fraud Detection
• Marketing
• Risk Management
• Compliance / Regulatory Reporting
• Portfolio Management
• IT Optimization
• Predictive Analysis
Compliance / Regulatory Reporting
Predictive Analysis
Think data mining on
steroids. One of the main
benefits Hadoop brings to
the enterprise is the ability
to analyze every piece of
data, not just a statistical
sample or an aggregated
form of the entire
datastream.
Risk Management
Photo credit: Guinness World
Records (88 catches, by the way)
When considering a new hire, an extended investigation may show risky behavior on the
applicant’s part which may exclude him or her from more sensitive positions.
Risk Management
Behavioral Analysis
Fraud Detection
“Dear Company: I hurt myself working on the line and now I can’t walk without a
cane.” Then he tells his Facebook friends he’s going to his house in Belize for
some waterskiing.
Internal Misconduct
One of the reasons why the FBI was able to close in on the identities of the people
involved is that they geolocated the sender and recipient of the Gmail emails and
connected those IP addresses with known users on those same IP addresses.
Portfolio Management
• Evaluate portfolio performance on existing holdings
• Evaluate portfolio for future activities
• High speed arbitrage trading
• Simply keeping up:
"Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9th
straight year in a row”
10,000 credit card transactions per second
Statistics courtesy of ComputerWorld, April 2012
Sentiment Analysis – Social Network Analysis
Companies used to
rely on warranty cards
and the like to collect
demographic data.
People either did not
fill out the forms or did
so with inaccurate
information.
Sentiment Analysis – Social Network Analysis
People are much more likely to be truthful when talking to their friends.
Sentiment Analysis – Social Network Analysis
This person – and
20 of their friends
– are talking about
the NFL.
This person
is a runner
Someone
likes Kindle
Someone is
current with
pop music
Sentiment Analysis – Social Network Analysis
Even Where You Least Expect It.
You Might Be Thinking Something Like “My Customer Will Never Use Social
Media For Anything I Care About. No Sargent Is Ever Going To Tweet “The Straps
On This New Rucksack Are So Comfortable!!!”
Sentiment Analysis – Social Network Analysis
Internal Social Networking At Customer Sites
• Oracle already uses an internal social network to facilitate work.
• The US Military is beginning to explore a similar type of environment.
• It is not unreasonable to plan for the DoD installing a network on base; Your
company could incorporate feedback from end users into design decisions.
Sentiment Analysis – Apple iOS6, Maps & Stock Price
Apple Released iOS6 with their
own version of Maps. It has had
some issues, to put it mildly.
Photo courtesy of
http://theamazingios6
maps.tumblr.com/
Sentiment Analysis – Apple iOS6, Maps & Stock Price
Over half of all trades in the US are initiated by a computer algorithm.
Source: Planet Money (NPR) Aug 2012
Sentiment Analysis – Apple iOS6, Maps & Stock Price
Photo courtesy of
http://theamazingios6
maps.tumblr.com/
People started to
tweet about the
maps problem, and
it went viral (to the
point that someone
created a Tumblr
blog to make fun
of Apple’s fiasco.
Sentiment Analysis – Apple iOS6, Maps & Stock Price
Photo courtesy of
http://theamazingios6
maps.tumblr.com/
As the twitter stream started to peak, Apple’s
stock price took a short dip. I believe it likely
that automatic trading algorithms started to
sell off Apple based on the negative sentiment
analysis from Twitter and Facebook.
Natural Language Processing
Big
Huge
Blooming
Ample
Blimp
Gigantic
Abundant
Broad
Bulky
Capacious
Colossal
Comprehensive
Copious
Enormous
Excessive
Exorbitant
Extensive
Extravagant
Full
Generous
Giant
Goodly
Big
Grand
Grandiose
Great
Hefty
Humongous
Immeasurable
Immense
Jumbo
Gargantuan
Massive
Monumental
Mountainous
Plentiful
Populous
Roomy
Sizable
Spacious
Stupendous
Substantial
Super
Sweeping
Vast
Voluminous
Whopping
Wide
Ginormous
Mongo
Badonka
Booku
Doozy
Natural Language Processing
Big
Huge
Blooming
Ample
Blimp
Gigantic
Abundant
Broad
Bulky
Capacious
Colossal
Comprehensive
Copious
Enormous
Excessive
Exorbitant
Extensive
Extravagant
Full
Generous
Giant
Goodly
Big
Grand
Grandiose
Great
Hefty
Humongous
Immeasurable
Immense
Jumbo
Gargantuan
Massive
Monumental
Mountainous
Plentiful
Populous
Roomy
Sizable
Spacious
Stupendous
Substantial
Super
Sweeping
Vast
Voluminous
Whopping
Wide
Ginormous
Mongo
Badonka
Booku
Doozy
Large
Natural Language Processing
Anticipate Customer Need
Natural Language Processing
React To Competitor’s Missteps
Natural Language Processing
Cultural Fit For Hires
As of Apr 22, there were 724 Hadoop
openings in the DC area. There will be
hundreds – if not thousands – of applicants
for each position. How can you determine
who is the most appropriate candidate, not
just technically, but culturally?
Natural Language Processing
Cultural Fit?
A good way to think of cultural fit is
the “airport test.” If you’re thinking
of hiring someone and you had to sit
with them in an airport for a few
hours because of a delayed flight,
would that make you happy? Or
would you cringe at the thought of
hours of forced conversation?
Natural Language Processing
Analyze Their Writings For Cultural Fit
Go beyond simple keyword searches to find out more about the person.
Regardless of what their resume says, language analysis can reveal details about
where they grew up and where they experienced their formative years.
Do they say “faucet” or “spigot”? “Wallet” or “billfold”? “Dog”, “hound” or
“hound dog”? “Groovy”, “cool”, “sweet” or “off the hook”? While these words
are synonyms, they carry cultural connotations with them. Find candidates with
the same markers as your existing team for a more cohesive unit.
Natural Language Processing
Analyze Their Writings For Cultural Fit
IT Optimization
IT Optimization – Enabling The Environment
I’m running
out of
supplies!
I’m overheating!
Everything
Is Fine.
Wheel 21
is out of
alignment.
I’m 42.4%
full.
IT Optimization – Enabling The Shop Floor
A More Specific Example
I’m 42.4%
full.
IT Optimization – Enabling The Shop Floor
Make The Trash Smart
We can make the trash bins “smart” by
putting a wifi enabled scale beneath each
bin and using that to determine when the
bins reaching capacity.
As of now, the custodian has to check each bin to see if
it is full. With a “smart” bin, the custodian can check his
smart phone and see does and does not need to be done.
IT Optimization – Enabling The Shop Floor
Cut Down On Clean Up Labor
More importantly, we can now focus on what is happening
to the bins and how they are being used. For example, we
may find outliers where one bin is filling much faster than
all of the others.
IT Optimization – Enabling The Shop Floor
Cut Down On Clean Up Labor
“Data Exhaust”
We can drill into why that bin is filling faster, leverage
the Six Sigma efficiency processes already in place
and improve the overall performance of the line.
IT Optimization – Enabling The Shop Floor
Drilling Into To Waste Production
IT Optimization – Classify Legacy Data
A customer can use a machine learning process
to take unknown data and sort it into useful data
elements. For example, a retail car part company
might use this process to sort photos – is that
circle a steering wheel, a hubcap or a tire?
So, All We Need Is Hadoop, Right?
• Lack of Security • Ad-hoc Query Support
• SQL support • Readily Available
Technical Resources
Hadoop is amazing at processing, but lacks a number of features found in
traditional RDBMS platforms (like, say Oracle).
Then How Do We Fix Those Problems?
In general, do the data crunching in Hadoop, then import the results into a system
like Oracle for more traditional BI analysis.
Oracle’s Big Data Appliance
Oracle’s Big Data Appliance
In Depth
Big Data Appliance
The Specs Of The Machine
Hardware:
•18 Compute/Storage Nodes
• 2 6 code Intel processors
• 48G Memory (up to 144G)
• 12x3TB SAS DIsks
•3 InfiniBand Switches
•Ethernet Switch, KVM, PDU
•42U rack
Software:
•Oracle Linux
•Java Virtual Machine
•Cloudera Hadoop Distribution
•R (statistical programming language)
•Oracle NoSQL Database
Environmental:
•12.25 kVA Power Draw
•41k BTU/hr Cooling
•1886 CFM Airflow
216 Cores
864G RAM (2.5T Max)
648T Storage
• 12.0 KW Power Draw
• 42k KJ/hr Cooling
Big Data Appliance
The Cloudera Distribution
105 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Competitive
Advantage
Degree Of Complexity
Some are
here
Growing
investment here
The Analytics Evolution
What Is Happening In The Industry
Standard Reporting: What Happened?
Ad Hoc Reporting: How Many, How Often, Where?
Query/Drill Down: What Exactly Is The Problem?
Alerts: What Actions Are Needed?
Simulation: What Could Happen….?
Forecasting: What If These Trends Continue?
Predictive Modeling: What Will Happen Next If…?
Optimization: How Can We Achieve The Best Outcome?
How can we achieve the best
Stochastic Optimization: outcome, including the effects
of variability?
Descriptive: Analyzing Data
To Determine What Has
Happened Or Is Happening
Now
Predictive: Examining
Data To Discover
Whether Trends Will
Continue Into The Future
Prescriptive: Studying Data
To Elevate The Best Course
Of Action For The Future
Competing On Analytics: The New Science Of Winning;
Thomas Davenport & Jeanne Harris, 2007
106 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Competitive
Advantage
Degree Of Complexity
Some are
here
Growing
investment here
The Analytics Evolution
Where Big Data Fits On This Model
Standard Reporting: What Happened?
Ad Hoc Reporting: How Many, How Often, Where?
Query/Drill Down: What Exactly Is The Problem?
Alerts: What Actions Are Needed?
Simulation: What Could Happen….?
Forecasting: What If These Trends Continue?
Predictive Modeling: What Will Happen Next If…?
Optimization: How Can We Achieve The Best Outcome?
How can we achieve the best
Stochastic Optimization: outcome, including the effects
of variability?
Descriptive: Analyzing Data
To Determine What Has
Happened Or Is Happening
Now
Predictive: Examining
Data To Discover
Whether Trends Will
Continue Into The Future
Prescriptive: Studying Data
To Elevate The Best Course
Of Action For The Future
Competing On Analytics: The New Science Of Winning;
Thomas Davenport & Jeanne Harris, 2007
Where Big Data Best Fits
107 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Typical Stages In Analytics
Choosing The Right Solutions For The Right Data Needs
Growing
investment
here
Growing
investment
here
108 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
IncreasingBusiness
Value
Information Architecture Maturity
DATA &
ANALYTICS
DIVERSITY
CONSOLIDAT
E DATA
DATA WAREHOUSE
& What is happening today
Most
are here!
DATA MARTS
& What happened yesterday
BIG DATA
& What could happen tomorrow
Some are
here
Growing
investment here
The Data Warehouse Evolution
What Are Oracle’s Customers Deploying Today?
109 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
How will you
acquire live
streams of
unstructured data?
ANALYZE
DECIDE
ORGANIZE
ACQUIRE
What Is Your Big Data Strategy?
Where Does Your Data Originate?
110 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
How will you
organize big data
so it can be
integrated into
your data center?
ANALYZE
DECIDE ACQUIRE
ORGANIZE
What Is Your Big Data Strategy?
What Do You Do With It Once You Have It?
111 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
What skill sets
and tools will you
use to analyze big
data?
ANALYZE
DECIDE ACQUIRE
ORGANIZEANALYZE
What Is Your Big Data Strategy?
How Do You Manipulate It Once You Have It?
112 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
How will you
share the
analysis in real-
time?
ANALYZE
ACQUIRE
ORGANIZE
DECIDE
What Is Your Big Data Strategy?
What To You Do After You’re Done?
113 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Make
Better
Decisions
Using
Big Data
ANALYZE
DECIDE ACQUIRE
ORGANIZE
Big Data In Action
114 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Traditional BI Big Data
ChangeRequests
Hypothesis
Identify Data SourcesExplore Results
Reduce Ambiguity
Refine Models
Improved Hypothesis
The Big Data Development Process
115 Copyright © 2012, Oracle and/or its affiliates. All rights
reserved.
Oracle
Exalytics
InfiniBand
Oracle
Real-Time
Decisions
Oracle
Big Data
Appliance
Oracle
Exadata
InfiniBand
AcquireOrganize & Discover Analyze Decide
Endeca Information Discovery
Oracle’s Big Data Solution
PerformanceAchievement
PerformanceAchievement
Time
(Days)
Time
(Months)
100%
Measure,
diagnose,
tune and
reconfigure
Test & debug
failure modes
Assemble
dozens of
components
Multi-
vendor
finger
pointing
Custom Configuration
Oracle’s Big Data Solution
Pre-Built And Optimized Out Of The Box
6x faster than custom 20-node Hadoop
cluster for large batch transformation
jobs
2.5x faster than 30-node Hadoop
cluster for tagging and parsing text
documents
Big Data Appliance Performance Comparisons
• Oracle Loader for Hadoop (OLH)
• A MapReduce utility to optimize data loading from HDFS into Oracle
Database
• Oracle Direct Connector for HDFS
• Access data directly in HDFS using external tables
• ODI Application Adapter for Hadoop
• ODI Knowledge Modules optimized for Hive and OLH
• Oracle R Connector for Hadoop
• Load Results into
Oracle Database
at 12TB/hour
BDA
Oracle
Exadata
InfiniBand
Oracle
Big Data
Connectors
Oracle Big Data Connectors
• The R open source environment for statistical computing and
graphics is growing in popularity for advanced analytics
• Widely taught in colleges and universities
• Popular among millions of statisticians
• R programs can run unchanged against
data residing in the Oracle Database
• Reduce latency
• Improve data security
• Augment results with powerful graphics
• Integrate R results and graphics with
OBIEE dashboards
Oracle Database Advanced Analytics Option
Oracle R Enterprise
Classification
Association
Rules
Clustering
Attribute
Importance
Problem Algorithm Applicability
Classical statistical technique
Popular / Rules / transparency
Embedded app
Wide / narrow data / text
Minimum Description Length (MDL) Attribute reduction
Identify useful data
Reduce data noise
Hierarchical K-Means
Hierarchical O-Cluster
Product grouping
Text mining
Gene and protein analysis
Apriori
Market basket analysis
Link analysis
Multiple Regression (GLM)
Support Vector Machine
Classical statistical technique
Wide / narrow data / text
Regression
Feature
Extraction
Non-Negative Matrix Factorization (NMF)
Text analysis
Feature reduction
Logistic Regression (GLM)
Decision Trees
Naïve Bayes
Support Vector Machine
One Class Support Vector Machine (SVM) Lack examplesAnomaly
Detection
A1 A2 A3 A4 A5 A6 A7
F1 F2 F3 F4
Oracle Database Advanced Analytics Option
Oracle Data Mining
• Ranking functions
• rank, dense_rank, cume_dist, percent_rank,
ntile
•
Window Aggregate functions (moving
and cumulative)
• Avg, sum, min, max, count, variance, stddev,
first_value, last_value
• LAG/LEAD functions
• Direct inter-row reference using offsets
• Reporting Aggregate functions
• Sum, avg, min, max, variance, stddev, count,
ratio_to_report
• Statistical Aggregates
• Correlation, linear regression family, covariance
• Linear regression
• Fitting of an ordinary-least-squares regression
line to a set of number pairs.
• Frequently combined with the COVAR_POP,
COVAR_SAMP, and CORR functions
Descriptive Statistics
• DBMS_STAT_FUNCS: summarizes numerical
columns of a table and returns count, min, max,
range, mean, median, stats_mode, variance,
standard deviation, quantile values, +/- n sigma
values, top/bottom 5 values
• Correlations
• Pearson’s correlation coefficients, Spearman's
and Kendall's (both nonparametric).
• Cross Tabs
• Enhanced with % statistics: chi squared, phi
coefficient, Cramer's V, contingency coefficient,
Cohen's kappa
• Hypothesis Testing
• Student t-test , F-test, Binomial test, Wilcoxon
Signed Ranks test, Chi-square, Mann Whitney
test, Kolmogorov-Smirnov test, One-way
ANOVA
• Distribution Fitting
• Kolmogorov-Smirnov Test, Anderson-Darling
Test, Chi-Squared Test, Normal, Uniform,
Weibull, Exponential
Oracle Database SQL Analytics
Included In The Oracle Database
ANALYZE
DECIDE
ACQUIRE
ORGANIZE
DISCOVER
VISUALIZE
STREAM
Oracle Big Data Ecosystem
Having Said That…
Big Data Is More Than Just Hardware & Software
The Math Is The Hard Part
This is a very simple equation for a Fourier transformation of a wave kernel at 0.
The Math Is The Hard Part
This is a photograph of a data scientist’s white board at Bit.ly
Data Scientists Are Expensive And Hard To Find
• Typical Job Description:
“Ph.D. in data mining, machine
learning, statistical analysis,
applied mathematics or equivalent;
three-plus years hands-on practical
experience with large-scale data
analysis; and fluency in analytical
tools such as SAS, R, etc.”
• Looking For “baIT”:
• Business
• Analytics
• IT
All in the same personAll in the same person
These people exist, but are
very expensive.
Growing Your Own Data Scientist
• Business Acumen
• Familiarity/Likes Computational
Linear Algebra / Matrix Analysis
• Interest in SAS, R, Matlab
• Familiarity/Likes Lisp
Big Data Cannot Do Everything
Big Data Cannot Do Everything
Big Data Is A Great Tool
But Not A Silver Bullet
You would never run a POS system on Hadoop; Hadoop is far too batch oriented
to support this type of activity. Similarly, random access of data does not work
well in the Hadoop world.
When Big Data? When Relational?
Size Of Data
(rough measure)
When Big Data? When Relational?
RDBMS vs Hadoop: A Comparison
Fully SQL Compliant Helper Languages (Hive, Pig)
Many RDBMS Vendors Extend SQL In Useful Ways Very Useful But Not As Robust As SQL
Optmized For Query Performance Optmized For Analytics Operations
Tunable (Input Vs Output, Long Running Queries, Etc) Specifically Those Of A Statistics Nature
Armies Of Trained And Available Resources
Resources Are Hard To Find And
Expensive When Found
Requires More Specialized Hardware
At Performance Extremes
Designed To Work On Commodity
Hardware At All Levels
OLTP, OLAP, ODS, DSS, Hybrid -- More General Purpose Basically Only For Analytics
Expensive To Implement Over Wide Geographical Distribution Designed To Span Data Centers
Very Mature Technology Very New Technology
Real Time or Batch Processing Batch Operations Only
Nontrivial Licensing Costs Open Source ("Free" --ish)
About 2 PB As Largest Commercial
Cluster (Telecom Company)
100+ PB As Largest Commercial
Cluster (Facebook) (as of March 2013)
Ad Hoc Operations Common, If Not Encouraged
Ad Hoc Operations Possible With HBase
But Nontrivial
It Is Not An “Either/Or” Choice
RDBMS and Hadoop Each Solve Different Problems
Where Are Things Heading?
A Quick Recap
GFS
Presented To The
Public In 2003
MapReduce
Presented To The
Public in 2004
YES
Hadoop Is Already Dead?
Sort Of*
* = for a specific set of problems…
Name
Pub
Year Use What It Does Impact Open Source?
Colossus n/a
GFS for realtime
systems No
Caffeine 2009 Real Time Search
Incremental updates of analytics
and indexes in real time
Estimated to be 100x faster
than Hadoop No
Pregel 2009
Social Graphs,
Location Graphs,
Learning &
Discovery, Network
Optimization,
Internet Of Things Analyze next neighbor problems
Estimated to handled billions
of nodes & trillions of edges
Alpha
Apache
Giraph
Percolator 2010
Large scale
incremental
processing using
distributed
transactions
Makes transactional, atomic
updates in a widely distributed
data environment. Eliminates
need to rerun a batch for a
(relatively) small update.
Data in the environment
remains much more up to
date with less effort.
Dremel 2010
SQL like language
for queries on the
above technologies
Interactive, ad hoc queries over
trillion row tables in subsecond
time. Works against Caffeine /
Pregel / Colossus without
requiring MapReduce
Easier for analysts and non
technical people to be
productive (i.e. not as many
data scientists are required)
Very
Alpha
Apache
Drill
(Incubator)
Spanner
Oct
2012
Fully consistent (?),
transactional,
horizontally
scalable, distributed
database spanning
the globe
Uses GPS sensors and atomic
clocks to keep the clocks of
servers in sync regardless of
location or other factors.
Transactional support on a
global scale at a fraction of
the cost and where (many
times) not technically
possible otherwise.
No, and
unlikely
to ever
be
Storm 2012
Real time Hadoop-
like processing
The power of Hadoop in real
time. Not from Google; from
Twitter
Eliminates requirement for
batch processing
Yes
Beta*
The New Stuff In Overview
One Last Thing
Is Just The Start Of The Equation
One Last Thing
Hadoop For Analytics And Determining Boundary Conditions
Is Just The Start Of The Equation
Use Hadoop to analyze all of the data in your environment and then generate
mathematical models from that data.
One Last Thing
Acting On Boundary Conditions
Once the model has been built (and vetted), it can be used to resolve events in
real time, thereby getting around the batch bottleneck of Hadoop.
No Really. One More Last Thing
Who Is Hilary Mason?
• Chief Data Scientist At bit.ly
• One of the major innovators
in data science
• Scary smart and fun to be around
• A heck of a teacher, to boot
Photo credit: Pinar Ozger, Strata 2011
Interpret
The end goal of any Big Data solution is to provide data which can be interpreted
into meaningful decisions. But, before we can interpret the data, we must first…
The Mason 5 Step Process For Big Data
In Reverse Order
Model
Model the data into a useful paradigm which will allow us to make sense of any
new data based on past experiences. But, before we can model the data, we must
first….
The Mason 5 Step Process For Big Data
In Reverse Order
Explore
Explore the data we have and look for meaningful patterns from which we could
extract a useful model. But, before we can look through the data for meaningful
patterns, we first have to…
The Mason 5 Step Process For Big Data
In Reverse Order
Scrub
Clean and clarify the data we have to make it as neat as possible and easier to
manipulate. But, before we can clean the data, we have to start with…
The Mason 5 Step Process For Big Data
In Reverse Order
Obtain
Obtaining as much data as possible. Advances in technology – coupled with
Moore’s law – means that DASD is very, very cheap these days. So much so that
you may as well hang on to as much data as you can, because you never know
when it will prove useful.
The Mason 5 Step Process For Big Data
In Reverse Order
Questions?
Some Resources
White Papers:
• An Architect’s Guide To Big Data
• Big Data For The Enterprise
• Big Data Gets Real Time
• Build vs. Buy For Hadoop
This Deck:
Slideshare
Web Resources:
• Oracle Big Data
• Oracle Big Data Appliance
• Oracle Big Data Connectors
Me:
charles dot scyphers oracle dot com
@scyphers (twitter)
153

Weitere ähnliche Inhalte

Was ist angesagt?

Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceDenodo
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesSlideTeam
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality DashboardsWilliam Sharp
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challengeLenia Miltiadous
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Data Governance Workshop
Data Governance WorkshopData Governance Workshop
Data Governance WorkshopCCG
 
Classification of data mart
Classification of data martClassification of data mart
Classification of data martkhush_boo31
 
Power BI : A Detailed Discussion
Power BI : A Detailed DiscussionPower BI : A Detailed Discussion
Power BI : A Detailed DiscussionSwatiTripathi44
 

Was ist angesagt? (20)

Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation Slides
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Data Quality Dashboards
Data Quality DashboardsData Quality Dashboards
Data Quality Dashboards
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
The data quality challenge
The data quality challengeThe data quality challenge
The data quality challenge
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Data Governance Workshop
Data Governance WorkshopData Governance Workshop
Data Governance Workshop
 
Classification of data mart
Classification of data martClassification of data mart
Classification of data mart
 
Power BI : A Detailed Discussion
Power BI : A Detailed DiscussionPower BI : A Detailed Discussion
Power BI : A Detailed Discussion
 

Andere mochten auch

Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionDataStax
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Top 10 campus interview questions with answers
Top 10 campus interview questions with answersTop 10 campus interview questions with answers
Top 10 campus interview questions with answerstoddharry267
 
Cambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applications
Cambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applicationsCambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applications
Cambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applicationsSmart Villages
 
Beeswax Hive editor in Hue
Beeswax Hive editor in HueBeeswax Hive editor in Hue
Beeswax Hive editor in HueRomain Rigaux
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Basic erp concepts
Basic erp conceptsBasic erp concepts
Basic erp conceptsmukki4u
 

Andere mochten auch (20)

Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Top 10 campus interview questions with answers
Top 10 campus interview questions with answersTop 10 campus interview questions with answers
Top 10 campus interview questions with answers
 
Cambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applications
Cambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applicationsCambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applications
Cambridge | Jan-14 | Biomass-fuelled Stirling Engine for off-grid applications
 
Beeswax Hive editor in Hue
Beeswax Hive editor in HueBeeswax Hive editor in Hue
Beeswax Hive editor in Hue
 
Physical features of canada
Physical features of canadaPhysical features of canada
Physical features of canada
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Basic erp concepts
Basic erp conceptsBasic erp concepts
Basic erp concepts
 

Ähnlich wie Big Data: An Overview

Ähnlich wie Big Data: An Overview (20)

Anju
AnjuAnju
Anju
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big data
Big dataBig data
Big data
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Big Data: An Overview

  • 1. <Insert Picture Here> Big Data: An Overview
  • 2. What Is Big Data?
  • 3. What Is Big Data? • Big Data is not simply a huge pile of information • A good starting place is the following paraphrase: “Big Data describes datasets so large they become awkward to manage with traditional database tools • at a reasonable cost.”
  • 4. VOLUME VELOCITY VARIETY VALUE SOCIAL BLOG SMART METER 101100101001 001001101010 101011100101 010100100101 A Breakdown Of What Makes Up Big Data
  • 5. Data Growth Explosion • 1 GB of stored content can create 1 PB of data in transit Data & Image courtesy of IDC • The totality of stored data is doubling about every 2 years • This meant 130 EB in 2005 • 1227 EB in 2010 (1.19 ZB) • 7910 EB in 2015 (7.72 ZB)
  • 6. 2005 20152010 • More than 90% is unstructured data and managed outside Relational Database • Approx. 500 quadrillion files • Quantity doubles every 2 years 1.8 trillion gigabytes of data was created in 2011… 10,000 0 GBofData (INBILLIONS) STRUCTURED DATA (MANAGED INSIDE RELATIONAL DATABASE) UNSTRUCTURED DATA (MANAGED OUTSIDE RELATIONAL DATABASE) Growth Of Big Data Harnessing Insight From Big Data Is Now Possible
  • 7. So, Any Just Any Dataset? • Big Data Can Work With Any Dataset • However, Big Data Shines When Dealing With Unstructured Data
  • 8. Structured Vs. Unstructured Structured Data is any data to which a pre-defined data model can be applied in an automated fashion producing in a semantically meaningful result without referencing some outside elements. If you can’t, it’s unstructured In other words, if you can apply some template to a data set and have it instantly make sense to the average person, it’s structured.
  • 9. Really? Only Two Categories? Okay, there’s also semi-structured data. Which basically means after the template is applied, some of the result will make sense and some will not. XML is a classic example of this kind of data.
  • 10. Formal Definitions Of Data Types Structured Data: Entities in the same group have the same descriptions (or attributes), while descriptions for all entities in a group (or schema): a) have the same defined format; b) have a predefined length; c) are all present; and d) follow the same order. Structured data are what is normally associated with conventional databases such as relational transactional ones where information is organized into rows and columns within tables. Spreadsheets are another example. Nearly all understood database management systems (DBMS) are designed for structural data Semi-Structured Data: Semi-structured data are intermediate between the two forms above wherein “tags” or “structure” are associated or embedded within unstructured data. Semi-structured data are organized in semantic entities, similar entities are grouped together, entities in the same group may not have same attributes, the order of attributes is not necessarily important, not all attributes may be required, and the size or type of same attributes in a group may differ. To be organized and searched, semi-structured data should be provided electronically from database systems, file systems (e.g., bibliographic data, Web data) or via data exchange formats (e.g., EDI, scientific data, XML). Unstructured Data: Data can be of any type and do not necessarily follow any format or sequence, do not follow any rules, are not predictable, and can generally be described as “free form.” Examples of unstructured data include text, images, video or sound (the latter two also known as “streaming media”). Generally, “search engines” are used for retrieval of unstructured data via querying on keywords or tokens that are indexed at time of the data ingest.
  • 11. Informal Definitions Of Data Types Structured Data: Fits neatly into a relational structure. Semi-Structured Data: Think documents or EDI. Unstructured Data: Can be anything. Text Video Sound Images
  • 12. Tools For Dealing With Semi/Un-Structured Data
  • 13. What Is Hadoop? “The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
  • 14. Rather than moving the data to a central server for processing The Paradigm Shift Of Hadoop Centralized Processing Doesn’t Work Moving data to a central location for processing (like, say, Informatica) cannot scale. You can only buy a machine so big.
  • 15. The Paradigm Shift Of Hadoop Bandwidth Is The Bottleneck • Moving data around is expensive. • Bandwidth $$ > CPU $$
  • 16. The Paradigm Shift Of Hadoop Process The Data Locally Where It Lives
  • 17. The Paradigm Shift Of Hadoop Then Return Only The Results • You move much less data around this way • You also gain the advantage of greater parallel processing
  • 18. Where Did Hadoop Originate? GFS Presented To The Public In 2003 MapReduce Presented To The Public in 2004
  • 19. Spreading Out From Google Doug Cutting was working on “Nutch”, Yahoo’s next generation search engine at the same time when he read the Google papers and reverse engineered the technology. The elephant was his son’s toy named….
  • 20. Going Open Source HDFS MapReduce Released To Public 2006
  • 21. A Bit More In Depth, Then A Lot More In Depth HDFS MapReduce HDFS is primarily a data redundancy solution. MapReduce is where the work gets done.
  • 22. How Hadoop Works Hadoop is basically a massively parallel, shared nothing, distributed processing algorithm
  • 23. GFS / HDFS HDFS Distributes Files At The Block Level Across Multiple Commodity Devices For Redundancy On The Cheap Not RAID: Distribution Is Across Machines/Racks
  • 24. Data Distribution By Default, HDFS Writes Into Blocks & The Blocks Are Distributed x3
  • 25. WORM Data Is Written Once & (Basically) Never Erased
  • 26. How Is The Data Manipulated? Not Random Reads Data Is Read From The Stream In Large, Contiguous Chunks
  • 27. The Key To Hadoop Is MapReduce In a Shared Nothing architecture, programmers must break the work down into distinct segments that are: • Autonomous • Digestible • Can be processed independently • With the expectation of incipient failure at every step
  • 28. A Canonical MapReduce Example Image Credit: Martijn van Groningen
  • 29. The data arrives into the system. A MapReduce Example The Input
  • 30. The data is moved into the HDFS system, divided into blocks, each of which are copied multiple times for redundancy. A MapReduce Example Splitting The Input Into Chunks
  • 31. The Mapper picks up a chunk for processing. The MR Framework ensures only one mapper will be assigned to a given chunk A MapReduce Example Mapping The Chunks
  • 32. In this case, the Mapper emits a word with the number of times it was found. A MapReduce Example Mapping The Chunks
  • 33. The Shuffler can do a rough sort of like items (optional) A MapReduce Example A Shuffle Sort
  • 34. The Reducer combines the Mapper’s output into a total A MapReduce Example Reducing The Emissions
  • 35. The job completes with a numeric index of words found within the original input. A MapReduce Example The Output
  • 36. MapReduce Is Not Only Hadoop http://blogs.oracle.com/datawarehousing/2009/10/in-database_map-reduce.html MapReduce is a programming paradigm, not a language. You can do MapReduce within an Oracle database; it’s just usually not a good idea. A large MapReduce job would quickly exhaust the SGA of any Oracle environment.
  • 37. Problem Solving With MapReduce • The key feature is the Shared Nothing architecture. • Any MapReduce program has to understand and leverage that architecture. • This is usually a paradigm shift for most programmers and one that many cannot overcome.
  • 38. Programming With MapReduce • HDFS & MapReduce Is Written In Java 1. package org.myorg; 2. 3. import java.io.*; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.filecache.DistributedCache; 8. import org.apache.hadoop.conf.*; 9. import org.apache.hadoop.io.*; 10. import org.apache.hadoop.mapreduce.*; 11. import org.apache.hadoop.mapreduce.lib.input.*; 12. import org.apache.hadoop.mapreduce.lib.output.*; 13. import org.apache.hadoop.util.*; 14. 15. public class WordCount2 extends Configured implements Tool { 16. 17. public static class Map 18. extends Mapper<LongWritable, Text, Text, IntWritable> { 19. 20. static enum Counters { INPUT_WORDS } 21. 22. private final static IntWritable one = new IntWritable(1); 23. private Text word = new Text(); 24. 25. private boolean caseSensitive = true; 26. private Set<String> patternsToSkip = new HashSet<String>(); 27. 28. private long numRecords = 0; 29. private String inputFile; 30. 31. public void setup(Context context) { 32. Configuration conf = context.getConfiguration(); 33. caseSensitive = conf.getBoolean("wordcount.case.sensitive", true); 34. inputFile = conf.get("mapreduce.map.input.file"); 35. 36. if (conf.getBoolean("wordcount.skip.patterns", false)) { 37. Path[] patternsFiles = new Path[0]; 38. try { 39. patternsFiles = DistributedCache.getLocalCacheFiles(conf); 40. } catch (IOException ioe) { 41. System.err.println("Caught exception while getting cached files: " 42. + StringUtils.stringifyException(ioe)); 43. } 44. for (Path patternsFile : patternsFiles) { 45. parseSkipFile(patternsFile); 46. } 47. } 48. } 49. 50. private void parseSkipFile(Path patternsFile) { 51. try { ,,,,,, • Will Work With Any Language Supporting STDIN/STDOUT • Lots Of People Using Python, R, Matlab, Perl, Ruby et al • Is Still Very Immature & Requires Low Level Coding
  • 39. What Are Some Big Data Use Cases? • Inverse Frequency / Weighting • Co-Occurrence • Behavioral Discovery • “The Internet Of Things” • Classification / Machine Learning • Sorting • Indexing • Data Intake • Language Processing Basically, Clustering And Targeting
  • 41. Co-Occurrence Fundamental Data Mining – People Who Did This Also Do That
  • 43. Behavioral Discovery “The best minds of my generation are thinking about how to make people click ads.” Jeff Hammerbacher, Former Research Scientist at Facebook Currently Chief Scientist at Cloudera
  • 44. “The Internet Of Things” “Data Exhaust”
  • 46. Sorting Current Record Holder: •10PB sort •8000 nodes •6 hours, 27 minutes •September 7, 2011 Current Record Holder: •1.5 TB •2103 nodes •59 seconds •February 26, 2013
  • 48. Data Intake Hadoop can be used as a massive parallel ETL tool; Flume to ingest files, MapReduce to transform them.
  • 49. Language Processing Includes Sentiment Analysis How can you infer meaning from someone’s words? Does that smile mean happy? Sarcastic? Bemusement? Anticipation?
  • 50. How Can Big Data Help You? 9 Use Cases: • Natural Language Processing • Internal Misconduct • Fraud Detection • Marketing • Risk Management • Compliance / Regulatory Reporting • Portfolio Management • IT Optimization • Predictive Analysis
  • 52. Predictive Analysis Think data mining on steroids. One of the main benefits Hadoop brings to the enterprise is the ability to analyze every piece of data, not just a statistical sample or an aggregated form of the entire datastream.
  • 53. Risk Management Photo credit: Guinness World Records (88 catches, by the way)
  • 54. When considering a new hire, an extended investigation may show risky behavior on the applicant’s part which may exclude him or her from more sensitive positions. Risk Management Behavioral Analysis
  • 55. Fraud Detection “Dear Company: I hurt myself working on the line and now I can’t walk without a cane.” Then he tells his Facebook friends he’s going to his house in Belize for some waterskiing.
  • 56. Internal Misconduct One of the reasons why the FBI was able to close in on the identities of the people involved is that they geolocated the sender and recipient of the Gmail emails and connected those IP addresses with known users on those same IP addresses.
  • 57. Portfolio Management • Evaluate portfolio performance on existing holdings • Evaluate portfolio for future activities • High speed arbitrage trading • Simply keeping up: "Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9th straight year in a row” 10,000 credit card transactions per second Statistics courtesy of ComputerWorld, April 2012
  • 58. Sentiment Analysis – Social Network Analysis Companies used to rely on warranty cards and the like to collect demographic data. People either did not fill out the forms or did so with inaccurate information.
  • 59. Sentiment Analysis – Social Network Analysis People are much more likely to be truthful when talking to their friends.
  • 60. Sentiment Analysis – Social Network Analysis This person – and 20 of their friends – are talking about the NFL. This person is a runner Someone likes Kindle Someone is current with pop music
  • 61. Sentiment Analysis – Social Network Analysis Even Where You Least Expect It. You Might Be Thinking Something Like “My Customer Will Never Use Social Media For Anything I Care About. No Sargent Is Ever Going To Tweet “The Straps On This New Rucksack Are So Comfortable!!!”
  • 62. Sentiment Analysis – Social Network Analysis Internal Social Networking At Customer Sites • Oracle already uses an internal social network to facilitate work. • The US Military is beginning to explore a similar type of environment. • It is not unreasonable to plan for the DoD installing a network on base; Your company could incorporate feedback from end users into design decisions.
  • 63. Sentiment Analysis – Apple iOS6, Maps & Stock Price Apple Released iOS6 with their own version of Maps. It has had some issues, to put it mildly. Photo courtesy of http://theamazingios6 maps.tumblr.com/
  • 64. Sentiment Analysis – Apple iOS6, Maps & Stock Price Over half of all trades in the US are initiated by a computer algorithm. Source: Planet Money (NPR) Aug 2012
  • 65. Sentiment Analysis – Apple iOS6, Maps & Stock Price Photo courtesy of http://theamazingios6 maps.tumblr.com/ People started to tweet about the maps problem, and it went viral (to the point that someone created a Tumblr blog to make fun of Apple’s fiasco.
  • 66. Sentiment Analysis – Apple iOS6, Maps & Stock Price Photo courtesy of http://theamazingios6 maps.tumblr.com/ As the twitter stream started to peak, Apple’s stock price took a short dip. I believe it likely that automatic trading algorithms started to sell off Apple based on the negative sentiment analysis from Twitter and Facebook.
  • 70. Natural Language Processing React To Competitor’s Missteps
  • 71. Natural Language Processing Cultural Fit For Hires As of Apr 22, there were 724 Hadoop openings in the DC area. There will be hundreds – if not thousands – of applicants for each position. How can you determine who is the most appropriate candidate, not just technically, but culturally?
  • 72. Natural Language Processing Cultural Fit? A good way to think of cultural fit is the “airport test.” If you’re thinking of hiring someone and you had to sit with them in an airport for a few hours because of a delayed flight, would that make you happy? Or would you cringe at the thought of hours of forced conversation?
  • 73. Natural Language Processing Analyze Their Writings For Cultural Fit Go beyond simple keyword searches to find out more about the person. Regardless of what their resume says, language analysis can reveal details about where they grew up and where they experienced their formative years.
  • 74. Do they say “faucet” or “spigot”? “Wallet” or “billfold”? “Dog”, “hound” or “hound dog”? “Groovy”, “cool”, “sweet” or “off the hook”? While these words are synonyms, they carry cultural connotations with them. Find candidates with the same markers as your existing team for a more cohesive unit. Natural Language Processing Analyze Their Writings For Cultural Fit
  • 76. IT Optimization – Enabling The Environment I’m running out of supplies! I’m overheating! Everything Is Fine. Wheel 21 is out of alignment. I’m 42.4% full.
  • 77. IT Optimization – Enabling The Shop Floor A More Specific Example I’m 42.4% full.
  • 78. IT Optimization – Enabling The Shop Floor Make The Trash Smart We can make the trash bins “smart” by putting a wifi enabled scale beneath each bin and using that to determine when the bins reaching capacity.
  • 79. As of now, the custodian has to check each bin to see if it is full. With a “smart” bin, the custodian can check his smart phone and see does and does not need to be done. IT Optimization – Enabling The Shop Floor Cut Down On Clean Up Labor
  • 80. More importantly, we can now focus on what is happening to the bins and how they are being used. For example, we may find outliers where one bin is filling much faster than all of the others. IT Optimization – Enabling The Shop Floor Cut Down On Clean Up Labor
  • 81. “Data Exhaust” We can drill into why that bin is filling faster, leverage the Six Sigma efficiency processes already in place and improve the overall performance of the line. IT Optimization – Enabling The Shop Floor Drilling Into To Waste Production
  • 82. IT Optimization – Classify Legacy Data A customer can use a machine learning process to take unknown data and sort it into useful data elements. For example, a retail car part company might use this process to sort photos – is that circle a steering wheel, a hubcap or a tire?
  • 83. So, All We Need Is Hadoop, Right? • Lack of Security • Ad-hoc Query Support • SQL support • Readily Available Technical Resources Hadoop is amazing at processing, but lacks a number of features found in traditional RDBMS platforms (like, say Oracle).
  • 84. Then How Do We Fix Those Problems? In general, do the data crunching in Hadoop, then import the results into a system like Oracle for more traditional BI analysis.
  • 85. Oracle’s Big Data Appliance
  • 86. Oracle’s Big Data Appliance In Depth
  • 87. Big Data Appliance The Specs Of The Machine Hardware: •18 Compute/Storage Nodes • 2 6 code Intel processors • 48G Memory (up to 144G) • 12x3TB SAS DIsks •3 InfiniBand Switches •Ethernet Switch, KVM, PDU •42U rack Software: •Oracle Linux •Java Virtual Machine •Cloudera Hadoop Distribution •R (statistical programming language) •Oracle NoSQL Database Environmental: •12.25 kVA Power Draw •41k BTU/hr Cooling •1886 CFM Airflow 216 Cores 864G RAM (2.5T Max) 648T Storage • 12.0 KW Power Draw • 42k KJ/hr Cooling
  • 88. Big Data Appliance The Cloudera Distribution
  • 89. 105 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Competitive Advantage Degree Of Complexity Some are here Growing investment here The Analytics Evolution What Is Happening In The Industry Standard Reporting: What Happened? Ad Hoc Reporting: How Many, How Often, Where? Query/Drill Down: What Exactly Is The Problem? Alerts: What Actions Are Needed? Simulation: What Could Happen….? Forecasting: What If These Trends Continue? Predictive Modeling: What Will Happen Next If…? Optimization: How Can We Achieve The Best Outcome? How can we achieve the best Stochastic Optimization: outcome, including the effects of variability? Descriptive: Analyzing Data To Determine What Has Happened Or Is Happening Now Predictive: Examining Data To Discover Whether Trends Will Continue Into The Future Prescriptive: Studying Data To Elevate The Best Course Of Action For The Future Competing On Analytics: The New Science Of Winning; Thomas Davenport & Jeanne Harris, 2007
  • 90. 106 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Competitive Advantage Degree Of Complexity Some are here Growing investment here The Analytics Evolution Where Big Data Fits On This Model Standard Reporting: What Happened? Ad Hoc Reporting: How Many, How Often, Where? Query/Drill Down: What Exactly Is The Problem? Alerts: What Actions Are Needed? Simulation: What Could Happen….? Forecasting: What If These Trends Continue? Predictive Modeling: What Will Happen Next If…? Optimization: How Can We Achieve The Best Outcome? How can we achieve the best Stochastic Optimization: outcome, including the effects of variability? Descriptive: Analyzing Data To Determine What Has Happened Or Is Happening Now Predictive: Examining Data To Discover Whether Trends Will Continue Into The Future Prescriptive: Studying Data To Elevate The Best Course Of Action For The Future Competing On Analytics: The New Science Of Winning; Thomas Davenport & Jeanne Harris, 2007 Where Big Data Best Fits
  • 91. 107 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Typical Stages In Analytics Choosing The Right Solutions For The Right Data Needs Growing investment here Growing investment here
  • 92. 108 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. IncreasingBusiness Value Information Architecture Maturity DATA & ANALYTICS DIVERSITY CONSOLIDAT E DATA DATA WAREHOUSE & What is happening today Most are here! DATA MARTS & What happened yesterday BIG DATA & What could happen tomorrow Some are here Growing investment here The Data Warehouse Evolution What Are Oracle’s Customers Deploying Today?
  • 93. 109 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. How will you acquire live streams of unstructured data? ANALYZE DECIDE ORGANIZE ACQUIRE What Is Your Big Data Strategy? Where Does Your Data Originate?
  • 94. 110 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. How will you organize big data so it can be integrated into your data center? ANALYZE DECIDE ACQUIRE ORGANIZE What Is Your Big Data Strategy? What Do You Do With It Once You Have It?
  • 95. 111 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. What skill sets and tools will you use to analyze big data? ANALYZE DECIDE ACQUIRE ORGANIZEANALYZE What Is Your Big Data Strategy? How Do You Manipulate It Once You Have It?
  • 96. 112 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. How will you share the analysis in real- time? ANALYZE ACQUIRE ORGANIZE DECIDE What Is Your Big Data Strategy? What To You Do After You’re Done?
  • 97. 113 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Make Better Decisions Using Big Data ANALYZE DECIDE ACQUIRE ORGANIZE Big Data In Action
  • 98. 114 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Traditional BI Big Data ChangeRequests Hypothesis Identify Data SourcesExplore Results Reduce Ambiguity Refine Models Improved Hypothesis The Big Data Development Process
  • 99. 115 Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle Exalytics InfiniBand Oracle Real-Time Decisions Oracle Big Data Appliance Oracle Exadata InfiniBand AcquireOrganize & Discover Analyze Decide Endeca Information Discovery Oracle’s Big Data Solution
  • 100. PerformanceAchievement PerformanceAchievement Time (Days) Time (Months) 100% Measure, diagnose, tune and reconfigure Test & debug failure modes Assemble dozens of components Multi- vendor finger pointing Custom Configuration Oracle’s Big Data Solution Pre-Built And Optimized Out Of The Box
  • 101. 6x faster than custom 20-node Hadoop cluster for large batch transformation jobs 2.5x faster than 30-node Hadoop cluster for tagging and parsing text documents Big Data Appliance Performance Comparisons
  • 102. • Oracle Loader for Hadoop (OLH) • A MapReduce utility to optimize data loading from HDFS into Oracle Database • Oracle Direct Connector for HDFS • Access data directly in HDFS using external tables • ODI Application Adapter for Hadoop • ODI Knowledge Modules optimized for Hive and OLH • Oracle R Connector for Hadoop • Load Results into Oracle Database at 12TB/hour BDA Oracle Exadata InfiniBand Oracle Big Data Connectors Oracle Big Data Connectors
  • 103. • The R open source environment for statistical computing and graphics is growing in popularity for advanced analytics • Widely taught in colleges and universities • Popular among millions of statisticians • R programs can run unchanged against data residing in the Oracle Database • Reduce latency • Improve data security • Augment results with powerful graphics • Integrate R results and graphics with OBIEE dashboards Oracle Database Advanced Analytics Option Oracle R Enterprise
  • 104. Classification Association Rules Clustering Attribute Importance Problem Algorithm Applicability Classical statistical technique Popular / Rules / transparency Embedded app Wide / narrow data / text Minimum Description Length (MDL) Attribute reduction Identify useful data Reduce data noise Hierarchical K-Means Hierarchical O-Cluster Product grouping Text mining Gene and protein analysis Apriori Market basket analysis Link analysis Multiple Regression (GLM) Support Vector Machine Classical statistical technique Wide / narrow data / text Regression Feature Extraction Non-Negative Matrix Factorization (NMF) Text analysis Feature reduction Logistic Regression (GLM) Decision Trees Naïve Bayes Support Vector Machine One Class Support Vector Machine (SVM) Lack examplesAnomaly Detection A1 A2 A3 A4 A5 A6 A7 F1 F2 F3 F4 Oracle Database Advanced Analytics Option Oracle Data Mining
  • 105. • Ranking functions • rank, dense_rank, cume_dist, percent_rank, ntile • Window Aggregate functions (moving and cumulative) • Avg, sum, min, max, count, variance, stddev, first_value, last_value • LAG/LEAD functions • Direct inter-row reference using offsets • Reporting Aggregate functions • Sum, avg, min, max, variance, stddev, count, ratio_to_report • Statistical Aggregates • Correlation, linear regression family, covariance • Linear regression • Fitting of an ordinary-least-squares regression line to a set of number pairs. • Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics • DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, median, stats_mode, variance, standard deviation, quantile values, +/- n sigma values, top/bottom 5 values • Correlations • Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric). • Cross Tabs • Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa • Hypothesis Testing • Student t-test , F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA • Distribution Fitting • Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential Oracle Database SQL Analytics Included In The Oracle Database
  • 108. Big Data Is More Than Just Hardware & Software
  • 109. The Math Is The Hard Part This is a very simple equation for a Fourier transformation of a wave kernel at 0.
  • 110. The Math Is The Hard Part This is a photograph of a data scientist’s white board at Bit.ly
  • 111. Data Scientists Are Expensive And Hard To Find • Typical Job Description: “Ph.D. in data mining, machine learning, statistical analysis, applied mathematics or equivalent; three-plus years hands-on practical experience with large-scale data analysis; and fluency in analytical tools such as SAS, R, etc.” • Looking For “baIT”: • Business • Analytics • IT All in the same personAll in the same person These people exist, but are very expensive.
  • 112. Growing Your Own Data Scientist • Business Acumen • Familiarity/Likes Computational Linear Algebra / Matrix Analysis • Interest in SAS, R, Matlab • Familiarity/Likes Lisp
  • 113. Big Data Cannot Do Everything
  • 114. Big Data Cannot Do Everything Big Data Is A Great Tool But Not A Silver Bullet You would never run a POS system on Hadoop; Hadoop is far too batch oriented to support this type of activity. Similarly, random access of data does not work well in the Hadoop world.
  • 115. When Big Data? When Relational? Size Of Data (rough measure)
  • 116. When Big Data? When Relational? RDBMS vs Hadoop: A Comparison Fully SQL Compliant Helper Languages (Hive, Pig) Many RDBMS Vendors Extend SQL In Useful Ways Very Useful But Not As Robust As SQL Optmized For Query Performance Optmized For Analytics Operations Tunable (Input Vs Output, Long Running Queries, Etc) Specifically Those Of A Statistics Nature Armies Of Trained And Available Resources Resources Are Hard To Find And Expensive When Found Requires More Specialized Hardware At Performance Extremes Designed To Work On Commodity Hardware At All Levels OLTP, OLAP, ODS, DSS, Hybrid -- More General Purpose Basically Only For Analytics Expensive To Implement Over Wide Geographical Distribution Designed To Span Data Centers Very Mature Technology Very New Technology Real Time or Batch Processing Batch Operations Only Nontrivial Licensing Costs Open Source ("Free" --ish) About 2 PB As Largest Commercial Cluster (Telecom Company) 100+ PB As Largest Commercial Cluster (Facebook) (as of March 2013) Ad Hoc Operations Common, If Not Encouraged Ad Hoc Operations Possible With HBase But Nontrivial
  • 117. It Is Not An “Either/Or” Choice RDBMS and Hadoop Each Solve Different Problems
  • 118. Where Are Things Heading?
  • 119. A Quick Recap GFS Presented To The Public In 2003 MapReduce Presented To The Public in 2004
  • 120. YES Hadoop Is Already Dead? Sort Of* * = for a specific set of problems…
  • 121. Name Pub Year Use What It Does Impact Open Source? Colossus n/a GFS for realtime systems No Caffeine 2009 Real Time Search Incremental updates of analytics and indexes in real time Estimated to be 100x faster than Hadoop No Pregel 2009 Social Graphs, Location Graphs, Learning & Discovery, Network Optimization, Internet Of Things Analyze next neighbor problems Estimated to handled billions of nodes & trillions of edges Alpha Apache Giraph Percolator 2010 Large scale incremental processing using distributed transactions Makes transactional, atomic updates in a widely distributed data environment. Eliminates need to rerun a batch for a (relatively) small update. Data in the environment remains much more up to date with less effort. Dremel 2010 SQL like language for queries on the above technologies Interactive, ad hoc queries over trillion row tables in subsecond time. Works against Caffeine / Pregel / Colossus without requiring MapReduce Easier for analysts and non technical people to be productive (i.e. not as many data scientists are required) Very Alpha Apache Drill (Incubator) Spanner Oct 2012 Fully consistent (?), transactional, horizontally scalable, distributed database spanning the globe Uses GPS sensors and atomic clocks to keep the clocks of servers in sync regardless of location or other factors. Transactional support on a global scale at a fraction of the cost and where (many times) not technically possible otherwise. No, and unlikely to ever be Storm 2012 Real time Hadoop- like processing The power of Hadoop in real time. Not from Google; from Twitter Eliminates requirement for batch processing Yes Beta* The New Stuff In Overview
  • 122. One Last Thing Is Just The Start Of The Equation
  • 123. One Last Thing Hadoop For Analytics And Determining Boundary Conditions Is Just The Start Of The Equation Use Hadoop to analyze all of the data in your environment and then generate mathematical models from that data.
  • 124. One Last Thing Acting On Boundary Conditions Once the model has been built (and vetted), it can be used to resolve events in real time, thereby getting around the batch bottleneck of Hadoop.
  • 125. No Really. One More Last Thing
  • 126. Who Is Hilary Mason? • Chief Data Scientist At bit.ly • One of the major innovators in data science • Scary smart and fun to be around • A heck of a teacher, to boot Photo credit: Pinar Ozger, Strata 2011
  • 127. Interpret The end goal of any Big Data solution is to provide data which can be interpreted into meaningful decisions. But, before we can interpret the data, we must first… The Mason 5 Step Process For Big Data In Reverse Order
  • 128. Model Model the data into a useful paradigm which will allow us to make sense of any new data based on past experiences. But, before we can model the data, we must first…. The Mason 5 Step Process For Big Data In Reverse Order
  • 129. Explore Explore the data we have and look for meaningful patterns from which we could extract a useful model. But, before we can look through the data for meaningful patterns, we first have to… The Mason 5 Step Process For Big Data In Reverse Order
  • 130. Scrub Clean and clarify the data we have to make it as neat as possible and easier to manipulate. But, before we can clean the data, we have to start with… The Mason 5 Step Process For Big Data In Reverse Order
  • 131. Obtain Obtaining as much data as possible. Advances in technology – coupled with Moore’s law – means that DASD is very, very cheap these days. So much so that you may as well hang on to as much data as you can, because you never know when it will prove useful. The Mason 5 Step Process For Big Data In Reverse Order
  • 133. Some Resources White Papers: • An Architect’s Guide To Big Data • Big Data For The Enterprise • Big Data Gets Real Time • Build vs. Buy For Hadoop This Deck: Slideshare Web Resources: • Oracle Big Data • Oracle Big Data Appliance • Oracle Big Data Connectors Me: charles dot scyphers oracle dot com @scyphers (twitter)
  • 134. 153

Hinweis der Redaktion

  1. Not just a lot of information. [click] My working definition is anything so large it becomes very hard to manage with the usual tools. It’s not that you cannot work with big data using your traditional toolsets, it’s just that you can do it faster and cheaper.
  2. CIO see licensing as a barrier- Focus pricing on researchers Technology programming - Data Management ERA A number of new challenges Volume – always been a problem but more so now because of the increased opportunity to gather data. Grabbing data. Equipment have more and more monitors in them which generate more and more data In the past People typically grab a piece of information that they wanted and ditch the others. Today people are finding theses Steams of data more interesting and want to get hold of those. So the Volume of data you would like to retain growing rapidly Linked to that is velocity not only is the data growing but it is arriving a lot faster. So if you look at collecting data from machine or any source these days it could come a phenomenal rate like at terabytes per minutes. And typically people are looking to diving into a lot more different data sources. It can be data they generate themselves or data from special data sources. Link linked , twitter and other sources scraping informaiton and linking it into what they have. And the types of data is not text and numbers but it is images, pictures, graphs tv camera linked to that is the challenge of Value You have this huge collection of data, huge constinent of data, you want to collect You have these different types of data. You get huge value across multiple groups you get huge value but only n small pieces of data from each of these groups are relivent to your business or the research being done. That they want to work These are the challenges so how can Oracle help you get that value add
  3. Data in transit – your phone call or the email of your vacation photos while traveling over the network backbone 1 GB stored content can create 1 PB in transit Stored data is doubling about every 2 years. 130 Exabytes in 2005 1227 Exabytes in 2010 (1.19 Zettabytes) 7910 EB in 2015 (7.72 Zettabytes)
  4. Big Data is driving significant data volume in customers who are leveraging it. A wide variety of sources provide this type of data.
  5. Definitions are from Peter Wood, Professor of Computer Science at the University of London
  6. These definitions are solely my own
  7. There are lots, but the main one (and the one on which we are going to focus today) is Hadoop
  8. It costs a lot more money to build bandwidth than it does CPU
  9. Meanwhile at Yahoo, Doug Cutting was working on Nutch, Yahoo’s next generation search tool. The elephant is important; trust me
  10. Hadoop is basically a massively parallel, shared nothing, distributed processing algorithm
  11. HDFS Distributes Files At The Block Level Across Multiple Commodity Devices For Redundancy On The Cheap Not RAID: Distribution Is Across Machines/Racks
  12. By Default, HDFS Writes Into Blocks &amp; The Blocks Are Distributed Three Times. The size of the files can be set by the user. Pay attention to the NameNode here; this server keeps track of where all the chunks have been distributed across the file system. If you lose it, you’re hosed and have to rebuild everything from scratch.
  13. Data Is Written Once &amp; (Basically) Never Erased
  14. Data Is Read From The Stream In Large, Contiguous Chunks, Not Random Reads
  15. Hadoop is just a programming paradigm. You can do MapReduce inside an Oracle database; you generally just don’t want to do so.
  16. Basically, a way of measuring how important an attribute is to the whole. The number of times it appears within the item compared to the background environment.
  17. What does a given person do and how would they behave in a given situation
  18. “ 80% of all network traffic (internet or otherwise) is one machine talking with another machine.” Mike Olsen, Cloudera
  19. Spam vs. Ham
  20. Flume as a intake device, MapReduce as a transformation engine. Instead of the classic hub &amp; spoke of Informatica, you can run your ETL across a few thousand nodes and massively increase the throughput. Facebook uses Hadoop as a underlying architecture (through lots of filtering) in it’s messaging application. 1.5M ops/sec at peak 75B+ ops/day
  21. Includes Sentiment Analysis What is this person thinking? Is that a happy smile, a sarcastic smile, a sad smile?
  22. A customer can ingest all the logs from every machine in their environment and data mine the results to find any machine out of compliance.
  23. Monte Carlo simulations complex derivate valuations predict when a customer is heading into credit problems and shorten their terms before you get caught in their problems demand forecasting
  24. credit risk, scoring and analysis Parallelizing data access as well as computation “ A large financial institution combined their data warehouses into a single Hadoop environment. They then used that information to more accurately score their customer portfolio risk&quot; Social networking activity Bill payments (cell phone, for example) How often have you moved
  25. When considering a new hire, an extended investigation may show risky behavior on the applicant’s part which may exclude him or her from some of the more sensitive areas.
  26. I hurt myself on the yards and you have to pay me workers comp. Then he tells Twitter he’s going to his house in Belize for some waterskiing.
  27. look for bad actors within NGC Nick Lesson at Barings in 1995, for example Shrinkage detection. Enable the security people to better do their jobs in monitoring the activities of people in sensitive positions. The Petraeus scandal – one of the reasons why the FBI was able to close in on the identities of the people involve is that they were able to geolocate the sender and receiver of the Gmail emails and then connect those IP addresses with known users having the same IP addresses.
  28. Portfolio evaluation for existing holdings Portfolio eval for future activities High speed arbitrage trading Simply keeping up &quot;Options were 4.55B contracts in 2011 -- 17% over 2010 and the 9th straight year in a row”, 10k credit card transactions per second -- all stats here from ComputerWorld, 042512
  29. People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  30. People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  31. People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  32. People either do not fill out these forms or they fill them out with inaccurate information. These same people usually will tell their friends not just the truth, but the whole truth. And they will do it on Facebook and Twitter.
  33. Social Networking is coming to NGC’s customers at some point in time. It won’t be Facebook, but it will be something internally for the Navy (and/or the military). Oracle uses a secured social network internally to great effect… Live Twitter demo: http://50.17.239.57:9704/analytics/saw.dll?dashboard&amp;PortalPath=%2Fshared%2FSentiment%20Analysis%2F_portal%2FSentiment%20Analsysis weblogic/welcome1
  34. Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  35. Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  36. Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  37. Over 50% of all trades are done at the behest of a computer. As the #io6maps #fail tags trended on Twitter, a sell off of Apple occurred.
  38. Advance Auto Parts
  39. Advance Auto Parts
  40. Use machine processing to “read” the press releases and blogs of your customers to learn when they are getting ready to cut their budget. NGC can then position themselves to best answer their customer needs. [click] This can also extend to picking opportunities [click] from other competitors when they fall short. For that matter, [click] have programs scouring your competitor’s site and then use their own information against them. “Gosh, Air Force, I don’t know if I’d trust Boeing right about now; aren’t they using some of the same Dreamliner tech on their avionics package? Maybe we could help out there….”
  41. Use machine processing to “read” the press releases and blogs of your customers to learn when they are getting ready to cut their budget. NGC can then position themselves to best answer their customer needs. [click] This can also extend to picking opportunities [click] from other competitors when they fall short. For that matter, [click] have programs scouring your competitor’s site and then use their own information against them. “Gosh, Air Force, I don’t know if I’d trust Boeing right about now; aren’t they using some of the same Dreamliner tech on their avionics package? Maybe we could help out there….”
  42. As of Monday, there are [click] 724Hadoop postings in the DC area open. For each of those jobs, [click] you’ll have hundreds – if not thousands – of applicants. So, how can you determine [click] that she is the one you want. Not because she’s the most technically adept, but because she is going to fit with your corporate culture and existing team.
  43. What do I mean by Cultural Fit? Well, the easiest way to get this across is what I call the airport test. When you’re thinking of hiring someone [click] and you have to sit in an airport [click] with them while the flight is delayed [click] for a few hours, would that make you happy or would you cringe at the thought of hours of chit-chat and making conversation.
  44. Instead of doing a simple keyword match in the resume, go beyond the resume and find out more about the person. Regardless of where their resume says they worked or went to school, language analysis can reveal details about where they grew up and where they experienced their formative years. [click] is that a faucet or a spigot? [click] A wallet or a billfold? [click] A dog, a hound or a hound dog? And it’s more than just regional. All these words basically mean the same thing, but come from a different cultural point in time. You can use all of this information – and you can get from Facebook, twitter, blog posts and the like – to help determine if a potential hire is going to work well within your team. And you can do this all before they ever set foot on your property for an interview.
  45. Instead of doing a simple keyword match in the resume, go beyond the resume and find out more about the person. Regardless of where their resume says they worked or went to school, language analysis can reveal details about where they grew up and where they experienced their formative years. [click] is that a faucet or a spigot? [click] A wallet or a billfold? [click] A dog, a hound or a hound dog? And it’s more than just regional. All these words basically mean the same thing, but come from a different cultural point in time. You can use all of this information – and you can get from Facebook, twitter, blog posts and the like – to help determine if a potential hire is going to work well within your team. And you can do this all before they ever set foot on your property for an interview.
  46. Log analysis Improve uptimes through predictive failure analysis
  47. The machines on a manufacturing floor produce data exhaust: Use this exhaust to improve the efficiency of the production line.
  48. Trash bins are not an item most would consider when it comes to the internet of things. Here’s how they could provide valuable intelligence
  49. We make the trash bins smart. [advance] You can buy a consumer grade, wifi enabled scale for about $100 a piece; I’ve seen bulk quotes on the internet for as low as $40 a pop. Put one of these scales under each of the bins [advance] and now the bin will tell you when it’s full.
  50. Currently, the custodian has to go [just start advancing 13 times], check each bin at time and then empty the bin if necessary. With a self-reporting bin, the custodian [advance to phone image] can check his smart phone [advance to next slide]. WalGreens did this, and cut $57M out of their bottom line in 2012.
  51. And see where he needs to go. Less time on the floor, less costs for cleanup, a more efficient waste management process. But, more importantly, we can now focus on what is happening when these bins are filling up. [advance] We can create a histogram for the amount of waste ingested at each bin. If you look [advance], you can see an outlier on the high side and [advance] an outlier on the low side. Take this one. [advance]
  52. Why does this particular bin fill up so much faster than all the others? Is there something inefficient in the line which can be remedied? [advance] This is an example of data exhaust from before. Once we learn that this bin is filling up much faster than the other bins, we can start to look into the line around it and see if there is something about the manufacturing process which can be improved. After a bit of digging, we may discover that there is a problem with the machine cutting away too much metal; we refactor the line to send less metal down the pipe, saving on material costs and improving the efficiency of the line.
  53. Advance Auto Parts
  54. No. Hadoop is amazing at processing, but lacks a number of features found in traditional RDBMS platforms (like, say Oracle). These features include (but are not limited to): Security Ad-hoc query support SQL support Readily available technical resources
  55. In general, do the data crunching in Hadoop, then import the results into a system like Oracle for more traditional BI analysis. Oracle Connectors; other options
  56. Storage is the primary limiting factor, with one exception
  57. Storage is the primary limiting factor, with one exception
  58. If you remember from before, the NameNode controls the file distribution. It’s also the bottleneck for growth; you can only add nodes and files to the system if the NameNode can hold that information with its available RAM.
  59. So, for the NameNode, load the machine up with as much memory as possible.
  60. FUSE-DFS is a utility to allow a user to mount the distributed file system as a traditional file system (e.g. you can cannot it to another server as a remote disk)
  61. Hue is the Cloudera analog to OEM
  62. Here are some of the powerful capabilities of Cloudera Manager Service health and performance – Cloudera Manager is the only Hadoop management application that gives you the ability to get a real time view of the health of all the services running in the Hadoop stack. Competitive products tend to focus primarily on the file system, which is only 1 piece of the solution. Host-Level Snapshots – this gives you a view into that status of each host or node in your cluster Monitor and Diagnose Workloads – with Cloudera Manager, you can view and compare current and historical job performance for benchmarking, troubleshooting and optimization View/Search Hadoop Logs – Cloudera Manager is the only Hadoop management application that provides comprehensive log management. Each screen provides contextual log views, so you only need to view the logs that are relevant to what you’re looking at. You can also search logs by keyword, type and severity. Track Events – Cloudera Manager creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and makes them available for alerting and searching Usage/Performance Reports – With Cloudera Manager you can visualize current and historical disk usage by user, group and directory. Track MapReduce activity on the cluster by job or user
  63. Mahout is a collection of machine learning libraries. Mahout is also the job title for elephant wranglers in India
  64. Oozie manages workflow and dependencies between MR jobs
  65. Flume supports massively fast intake of log files.
  66. Sqoop is a very simple connector between Hadoop and any ANSI SQL database using JDBC.
  67. Pig and Hive are helper languages to provide a more SQL like interface to the Hadoop environment. Both work with MapReduce behind the scenes.
  68. Hbase supports read/write access in a columnar store style
  69. Whirr is the deployment tool to push out new nodes into the Hadoop environment. Very similar to Chef or Ferret, if your customers are already familiar with either of those tools
  70. Zookeeper manages the coordination between all of the distributed services
  71. BigTop is a test harness for Hadoop – both the environment as well as specific MapReduce jobs
  72. Build slide. In Analytics, we start with [click] Standard report, move to [click] Ad Hoc, then [click] Drill Down. These are all [click] ways of analyzing what has happened or what is happening right now. Next, are alerts [click] to let me know that action must be taken, [click] simulation to experiment with ways to shape the action, [click] forecasting to take a look at what is happening now and projecting it into the future, and [click] prediction to play “what if?” All of these are [click] predictive in nature – what’s going to happen next. The top tier is when you get into [click] various forms of [click] optimization – both when you believe you have a good handle on the circumstances and when you do not. These areas are [click] prescriptive – given what we expect to be next, what is the best course of action.
  73. Big Data can play across of these areas, but it is better suited for the higher level, more complex operations. It’s not that Big Data cannot support a more standard approach to reporting, it’s just that those areas are probably better served by existing, lower cost options.
  74. This is what Oracle sees as the typical stages in analytics … ranges from initial data discovery to predictive analytics. [click] Many organizations are investing at the two ends of this spectrum today.
  75. Our customers continue to evolve. [click] While there is a lot of hype and promise from Big Data, most are continuing to focus on aligning data warehouses with business needs, etc. [click] However, investments in Big Data are becoming much more common, often starting with proof of concepts.
  76. &quot;Big Data is not only about analytics, it&apos;s about the entire value chain. So when you think about Big Data solutions you have to think about all the different steps. In the first step, you need to actually acquire and store the data.
  77. The next step is to organize the data – you will have acquired massive amounts of unstructured data, but it won’t be of use until you organize or transform and distill it such that it can be easily integrated into your data center.
  78. Next, you will want to analyze the data – slice it and dice it, do data mining on it , look at it in tables and cubes etc. Basically, you want to know what this means.
  79. And lastly, you want to turn this into something useful something that decision makers can see in their dashboards quickly so that they can act upon in near real-time.
  80. There are a lot of new technologies out there that address the challenges at each stage of the process we just talked about.
  81. 4/3/2012 Copyright 2012 Oracle Corporation. All rights reserved. Slide Conquering Big Data with the Oracle Information Model We typically look at capabilities through People, Process, and Tools. We had a lot of discussion this morning on tools and products. So let me direct your attention to a few other dimensions of big data capability.   First, the Big Data process is different.   The development of traditional BI and DW is entirely different from Big Data. With traditional BI, you know the answer you are looking for. You simply define requirements and build to your objective.   With Big Data (of course, not in all cases), you may have an idea or interest, but you don’t know what would come out of it. The answer for your initial question will trigger the next set of questions. So, the development process is more fluid. It requires that you explore the data as you develop and refine your hypothesis.   So this might be a process you go through with big data Hypothesis – The Big Idea Data Sources – Acquire, Access, Capture Data (private weblogs, streams, public [data.gov]) Explore Results – Simple MapReduce results with Hive/QL or SQL, use interactive query on through search, use visualization Reduce Ambiguity – Apply statistical models—eliminate outliers, find concentrations, and make correlations You interpret the outcome and continuously refine models and establish an improved hypothesis.   In the end, this analysis might lead to a creation of new theories and predictions based upon the data. Again, it’s very fluid and very different from traditional SDLC and BI development.
  82. The comparison with the 30-node cloud based cluster is showing a single 18 node BDA being 2.5x faster than an almost twice as large Amazon cluster. The reason that this is only 2.5x is because a 30 node cluster has substantially more mappers and reducers running. On a normalized basis a BDA achieves 4x the throughput of the Amazon cluster.
  83. Direct Connect: Optimized version of External Tables for HDFS. Fast, Parallelized data movement with automatic load balancing Loader: A MapReduce utility to load data from Hadoop into Oracle. Handles data conversion on the Hadoop side, makes loads very fast and efficient ODI Adapter: Works with ODI, creates MapReduce jobs behind the scenes, uses Hive (qv) R Connector: Writes MR jobs behind the scenes, Connects R, Oracle, local file system and HDFS.
  84. Embedded analytics focus: Oracle R Enterprise enabling R statistics programs to be run against data in the Oracle Database eliminating latency and improving data security.
  85. Embedded analytics focus: Data Mining algorithms available via SQL as part of the Advanced Analytics Option.
  86. Embedded analytics focus: What’s included in the Oracle Database at no charge.
  87. Oracle Endeca Information Discovery provides the Endeca Server which provides a “multi-faceted” data model that automatically provides drill paths through structured and unstructured data that is loaded into the server.
  88. Support for mobile experience provided by the BI Foundation Suite for iOS (Apple) devices, here represented as being hosted on Exalytics.
  89. Oracle’s goal is to reduce the amount of time required to implement these solutions. Simplify the support. Allow you to focus on delivering value – and not on maintaining infrastructure. And to provide the tools you need to effectively analyze data and generate insights. Let’s look at this picture from left to right. Twitter data streamed into the ..
  90. This is a very simple equation for a Fourier transformation of a wave kernel at 0. If you think the data analysts with your customer would look at the above equation and cringe or hear the description I just gave and glaze over, then they are not ready for this.
  91. A picture of one whiteboard at bit.ly
  92. The demand for people with programming skills, math skills and business acumen is out of this world.
  93. Many companies are opting to grow their own rather than hire from the outside. If this is your customer, they need to look for a programmer who liked Lisp in college, knows computational matrixes and his/her way around the business issues.
  94. Big Data is a very powerful tool, but it is not the right tool for every problem.
  95. You would never operate a POS system on Hadoop – you can only sell that widget once and only once and the batch processing nature of Hadoop doesn’t support this type of activity If you remember from the technical overview, Hadoop reads data in contiguous streams, so [click] random access of data does not work very well in a Hadoop world.
  96. The amount of data is the wrong measurement. &lt;1/1-50/50-300/300-600/600+ is my yardstick, but only if I have to make a size determinant.
  97. The amount of data is the wrong measurement. &lt;1/1-50/50-300/300-600/600+ is my yardstick, but only if I have to make a size determinant.
  98. In general, do the data crunching in Hadoop, then import the results into a system like Oracle for more traditional BI analysis. Oracle Connectors; other options
  99. Caffeine was built by Google to address real time indexing (instant results when searching). This technology will be of high interest for organizations looking to access their quickly changing data in real time, but not as useful for longitudinal or historical introspection.
  100. Use Hadoop to analyze all of the data within your corpus and then generate a mathematical model. This model can be as simple [click] as a hard knee waveform or as complex [click] as a multivariate linear regression
  101. Once the model has been created (and properly vetted, of course), it can be used to determine resolution of events in real time – thereby getting around the batch bottleneck of Hadoop. And these real time events can be handled quite well in a system like Oracle’s Complex Event Processing (hand over)
  102. Hilary Mason is the chief data scientist at bit.ly (a web service which shortens links for social media). They handle ~80M new URLs per day and ~300M clicks per day. She’s an excellent lecturer and instructor – you really should find time to listen to her speak – and I’ve learned quite a bit from her over the years. She views Big Data projects as moving across 5 distinct stages. Let’s go through them…. in reverse order. In other words, let’s start at the end project. What do we want as the end result of a Big Data project?
  103. The end goal of any Big Data solution is to provide data which can be interpreted into meaningful decisions. But, before we can interpret the data, we must first….
  104. Model the data into a useful paradigm which will allow us to make sense of any new data based upon past experiences. But, before we can model the data, we must first
  105. Explore the data we have and look for meaningful patterns from which we could extract a useful model. But before we can look through the data for a meaningful pattern, we first have to…
  106. Clean and clarify the data we have to make it as neat as possible and as easier to manipulate. But before we can clean the data, we have to start with…
  107. Obtaining the as much data as possible. The advances in technology coupled with Moore’s law means that DASD is very, very cheap these days. So much so that you might as well hang on to as much data as you can, because you never know when it will prove useful. And here’s where the BDA comes back into play. Able to ingest terabytes of data per hour with disk to store (particularly when coupled with ZFS) – it’s a great starting place.