SlideShare a Scribd company logo
1 of 69
Download to read offline
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1
Mohammed Guller
Oct 02, 2016
Introduction to Big Data
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
Agenda
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4
About Me
• Engineering Manager / Principal Architect at Glassbeam
• Founded two startups
• Passionate about building products, big data analytics, and
machine learning
• www.linkedin.com/in/mohammedguller
• @MohammedGuller
4
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6
• Hands-on guide with lots of examples
• Covers both fundamental and advanced
topics such as machine learning
• Includes a primer on functional
programming and Scala
• Introduces other important Big Data
technologies such as HDFS, Parquet,
Kafka, HBase, Cassandra, Mesos, and
YARN
Big Data Analytics with Spark
Available on Amazon
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7
About Glassbeam
Glassbeam brings structure and meaning to data from any connected machine or device while providing
actionable intelligence
Cloud based analytics platform that helps
organizations turn raw machine data to insights
Making sense of multi
structured machine data
 Data center devices
 Medical devices
 Sensors
 ATMs
 Automobiles
 Data from any machine
Providing comprehensive set of apps
& tools for machine data analysis
 50,000+ systems being tracked today
 1,500+ different software rev codes
 1.2 Billion sensor readings per day
 1+ Trillion sensor readings tracked
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9
Data Growing At a Faster Pace Than Ever
9
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10
Internet of Things (IoT)
• Network of objects embedded with
software for collecting and sending data
over the Internet
• 5x more connected things than people by
2020
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11
Industrial IoT
• Manufacturing
• Automotive
• Medical
• Data Center
• EVC
• Smart Meter
11
Glassbeam target market is focused on driving opera onal & business
analy cs value for connected product companies in Industrial IoT market
IT & Networks Medical & Health Care
Transporta on
EV Chargers & Smart Grid
Industrial & Mfg
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12
Key Attributes of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
•
•
•
•
•
•
•
•
•
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13
Big Data Comes with Big Challenges
• Storage
• Processing
• Value
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14
Storage Challenges
• Legacy SAN / NAS storage devices are expensive
• Traditional RDBMS were not designed for Big Data
• Cannot handle volume, velocity, variety of Big Data
14
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15
Processing Challenges
• Diverse processing
• Organizations want do more than just BI / traditional analytics
• Go beyond SQL queries
• Timeliness
• Process data in reasonable amount of time
• Value of data decreases over time
15
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16
How Much Data Can a Standard Server Process
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17
•
•
17
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18
• Large number of CPUs / cores
• Faster cores
• Large amount of memory
• Faster memory bus
• High-performance architecture
Scale-up with Powerful High-end Server
18
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19
Disadvantages of Scale-up Architecture
• Proprietary
• Expensive
• Limited scalability
19
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20
• Cluster of servers
• Commodity machines
• Pool together resources
• CPU
• Memory
• Disk
Scale-out Architecture
20
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21
Benefits of Scale-out Architecture
• Relatively inexpensive
• Economical to scale
• No huge upfront investment
• Start small and expand cluster as workload increases
21
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22
Challenges With Scale-out Architecture
• Writing distributed applications is very hard
• Split job into chunks that can be distributed across a cluster
• Schedule compute resources among different jobs
• Manage inter-node communication
• Handle network and node failures
• Hardware failures are more common at a cluster level
• Probability of a single node failing is low
• Probability of any one node in a large cluster failing is high
22
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23
Getting Value Out of Big Data
• Traditional analytics / BI
• Custom processing
• Machine Learning
• Predictive analytics
• Automate complex tasks
• Stream processing
• Analyze in real-time/near real-time
• React in real-time
23
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24
Traditional Analytics / BI
• What
• Customer growth for the last month/quarter/year
• Segmentation of customers by demographics
• Average time spent by mobile app users
• Why
• Sales growth slowed
• regional issue
• supply issue
• Profit dropped
• revenue dropped
• expenses increased
24
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25
Custom Processing
• Index web pages
• Google
• Bing
• Process genome data
• Identify mutations linked to cancer, Alzheimer's and other disease
• Click analysis
• Log analysis
• 360-degree real time view of a customer
25
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26
Predictive Analytics
• Advertisements that a visitor will most likely click
• Movies / songs / news that a customer will like
• Products that a customer will buy
• Patient will have an heart attack
26
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27
• Virtual assistant
• Siri
• Google Now
• Autonomous machine
• Self-driving car
• Robots
• Tag Images
• Facebook
• Flickr
• Expert System
• Medical diagnosis
• Personalized medicine
• Security
• Fraud detection
• Network Security
• Music recognition
• Shazam
• SoundHound
Automate Complex Tasks
27
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29
•
•
•
•
•
•
29
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30
•
•
•
•
•
•
30
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31
• Text
• CSV
• JSON
• XML
• Binary
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar
(ORC)
File Formats
31
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32
• Hive
• Spark SQL
• Impala
• Presto
• Drill
• Phoenix
• HAWQ
• Tajo
Distributed SQL Query Engine
32
Data Warehouse
Distributed
Storage
Distributed
Query Engine
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33
•
•
•
•
•
•
33
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34
•
•
•
34
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35
•
•
35
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36
Publish – Subscribe / Messaging Systems
• Kafka
• RabbitMQ
• ActiveMQ
• ZeroMQ
36
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37
• Batch
• Hadoop MapReduce
• HPCC
• Stream
• Kafka Streams
• Heron
• Storm
• Samza
• Batch and Stream
• Spark
• Flink
• Beam
• Apex
• Ignite
Big Data Computing Frameworks
37
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39
• Distributed publish-subscribe
messaging system
• Partitioned and replicated
commit log service for
building distributed datastore
Kafka
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
40
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41
•
•
•
•
•
•
•
•
41
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42
•
•
•
•
•
•
42
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44
Hadoop
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45
•
•
•
•
45
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46
•
•
•
•
•
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 50
Hadoop is Not a Single Product
50
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51
Hadoop Core Components
51
=
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53
•
•
•
53
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54
•
•
•
•
•
•
•
54
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55
Adoption of Spark is Growing Rapidly
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56
Spark
Fast, easy-to-use, general-purpose cluster computing framework
for processing large datasets using a simpler programming
model
56
• • •
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57
Benefits
• Scale
• Fault-tolerance
• Abstracts distributed computing
• Hides the messy details of writing distributed applications
• Allows developers to just focus on the data processing logic
• Same code works on a laptop or a cluster of servers
• Ease-of-use
• Speed
• Flexibility
57
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 58
Easy To Use
• Library with an expressive API
• Scala, Java, Python, R
• RDD API with 80+ operators (MR has only two)
• Dataset/DataFrame API
• Interactive development
• spark-shell
• notebooks
58
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 59
• Batch processing
• Interactive analytics
• Stream analysis
• Machine learning
• Graph analytics
Integrated Libraries For a Variety of DP Tasks
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60
Benefits of a Unified Platform
• Solve a variety of problems with a single toolkit
• No need to learn different tools for each use case
• Avoid code and data duplication
• Achieve operational simplicity
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61
Why is Spark Fast
• Advanced job execution engine
• Allows applications to cache data in memory
61
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62
Advanced Job Execution Engine
• Directed Acyclic Graph (DAG) of stages
• simple job can contain just one stage
• complex job can contain many stages
• eliminates expensive operations between multiple jobs
• synchronization
• serialization/deserialization
• disk I/O
• Lazy operator evaluation
• Pipelined operations
62
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65
Allows Applications to Cache Data in Memory
•Minimize disk I/O
•Reading data from memory is orders of magnitude
faster than reading from disk
•In-memory data sharing across DAGs
• different jobs can work with the same cached data
65
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66
Why Caching Makes Applications Run Faster
66
100 MB/s
500 MB/s
10 GB/s
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67
Read Latency Comparison
67
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74
Spark Does Not Provide Storage
• Works with a variety of data sources
• No need to import data into Spark
• Scale compute and storage cluster independently
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75
Process Data From a Variety Of Data Sources
And Many More
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76
Spark Does Not Replace Hadoop
76
= =
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77
Hadoop is Optional
77
= =
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78
Ideal Applications
• Complex data processing
• multi-step pipeline
• Iterative algorithm
• Machine Learning
• Graph analytics
• Ad hoc analysis
• Interactive
© COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 110110

More Related Content

What's hot

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

What's hot (19)

Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Top 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data SolutionTop 5 Considerations for a Big Data Solution
Top 5 Considerations for a Big Data Solution
 
Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
 
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 

Viewers also liked

Viewers also liked (15)

Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014Wikibon Big Data Capital Markets Day 2014
Wikibon Big Data Capital Markets Day 2014
 
Create your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouseCreate your Big Data vision and Hadoop-ify your data warehouse
Create your Big Data vision and Hadoop-ify your data warehouse
 
Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
 
Becoming a Data Driven Organisation
Becoming a Data Driven OrganisationBecoming a Data Driven Organisation
Becoming a Data Driven Organisation
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...Usama Fayyad talk at IIT Madras on March 27, 2015:  BigData, AllData, Old Dat...
Usama Fayyad talk at IIT Madras on March 27, 2015: BigData, AllData, Old Dat...
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture Lecture on Data Science in a Data-Driven Culture
Lecture on Data Science in a Data-Driven Culture
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
How to reach a Data Driven culture
How to reach a Data Driven cultureHow to reach a Data Driven culture
How to reach a Data Driven culture
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

Similar to Introduction to Big Data

Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service
David Linthicum
 

Similar to Introduction to Big Data (20)

Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2Datameer6 for prospects - june 2016_v2
Datameer6 for prospects - june 2016_v2
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
 
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 
How Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data CenterHow Cloud Providers are Playing with Traditional Data Center
How Cloud Providers are Playing with Traditional Data Center
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs Integrating Hyper-converged Systems with Existing SANs
Integrating Hyper-converged Systems with Existing SANs
 
Journey to the Cloud: What I Wish I Knew Before I Started
 Journey to the Cloud: What I Wish I Knew Before I Started Journey to the Cloud: What I Wish I Knew Before I Started
Journey to the Cloud: What I Wish I Knew Before I Started
 
Building the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing MicroservicesBuilding the Glue for Service Discovery & Load Balancing Microservices
Building the Glue for Service Discovery & Load Balancing Microservices
 
Journey to analytics in the cloud
Journey to analytics in the cloudJourney to analytics in the cloud
Journey to analytics in the cloud
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
 
DC/OS 1.8 Container Networking
DC/OS 1.8 Container NetworkingDC/OS 1.8 Container Networking
DC/OS 1.8 Container Networking
 
Erlang containers
Erlang containersErlang containers
Erlang containers
 
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018Data Engineering the Startup Way - AWS Startup Day Chicago 2018
Data Engineering the Startup Way - AWS Startup Day Chicago 2018
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Linthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-serviceLinthicum next generation-iaa s-paas-and-database-as-a-service
Linthicum next generation-iaa s-paas-and-database-as-a-service
 
The Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the cloudsThe Cloud and Microsoft Windows Azure - A Walk through the clouds
The Cloud and Microsoft Windows Azure - A Walk through the clouds
 
Introduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use CasesIntroduction To IPaaS: Drivers, Requirements And Use Cases
Introduction To IPaaS: Drivers, Requirements And Use Cases
 

Recently uploaded

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Introduction to Big Data

  • 1. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1 Mohammed Guller Oct 02, 2016 Introduction to Big Data
  • 2. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2 Big Data Big Data Technologies Kafka Hadoop Spark Agenda
  • 3. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4 About Me • Engineering Manager / Principal Architect at Glassbeam • Founded two startups • Passionate about building products, big data analytics, and machine learning • www.linkedin.com/in/mohammedguller • @MohammedGuller 4
  • 4. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6 • Hands-on guide with lots of examples • Covers both fundamental and advanced topics such as machine learning • Includes a primer on functional programming and Scala • Introduces other important Big Data technologies such as HDFS, Parquet, Kafka, HBase, Cassandra, Mesos, and YARN Big Data Analytics with Spark Available on Amazon
  • 5. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7 About Glassbeam Glassbeam brings structure and meaning to data from any connected machine or device while providing actionable intelligence Cloud based analytics platform that helps organizations turn raw machine data to insights Making sense of multi structured machine data  Data center devices  Medical devices  Sensors  ATMs  Automobiles  Data from any machine Providing comprehensive set of apps & tools for machine data analysis  50,000+ systems being tracked today  1,500+ different software rev codes  1.2 Billion sensor readings per day  1+ Trillion sensor readings tracked
  • 6. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8 Big Data Big Data Technologies Kafka Hadoop Spark
  • 7. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9 Data Growing At a Faster Pace Than Ever 9
  • 8. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10 Internet of Things (IoT) • Network of objects embedded with software for collecting and sending data over the Internet • 5x more connected things than people by 2020
  • 9. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11 Industrial IoT • Manufacturing • Automotive • Medical • Data Center • EVC • Smart Meter 11 Glassbeam target market is focused on driving opera onal & business analy cs value for connected product companies in Industrial IoT market IT & Networks Medical & Health Care Transporta on EV Chargers & Smart Grid Industrial & Mfg
  • 10. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12 Key Attributes of Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data • • • • • • • • •
  • 11. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13 Big Data Comes with Big Challenges • Storage • Processing • Value
  • 12. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14 Storage Challenges • Legacy SAN / NAS storage devices are expensive • Traditional RDBMS were not designed for Big Data • Cannot handle volume, velocity, variety of Big Data 14
  • 13. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15 Processing Challenges • Diverse processing • Organizations want do more than just BI / traditional analytics • Go beyond SQL queries • Timeliness • Process data in reasonable amount of time • Value of data decreases over time 15
  • 14. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16 How Much Data Can a Standard Server Process
  • 15. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17 • • 17
  • 16. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18 • Large number of CPUs / cores • Faster cores • Large amount of memory • Faster memory bus • High-performance architecture Scale-up with Powerful High-end Server 18
  • 17. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19 Disadvantages of Scale-up Architecture • Proprietary • Expensive • Limited scalability 19
  • 18. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20 • Cluster of servers • Commodity machines • Pool together resources • CPU • Memory • Disk Scale-out Architecture 20
  • 19. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21 Benefits of Scale-out Architecture • Relatively inexpensive • Economical to scale • No huge upfront investment • Start small and expand cluster as workload increases 21
  • 20. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22 Challenges With Scale-out Architecture • Writing distributed applications is very hard • Split job into chunks that can be distributed across a cluster • Schedule compute resources among different jobs • Manage inter-node communication • Handle network and node failures • Hardware failures are more common at a cluster level • Probability of a single node failing is low • Probability of any one node in a large cluster failing is high 22
  • 21. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23 Getting Value Out of Big Data • Traditional analytics / BI • Custom processing • Machine Learning • Predictive analytics • Automate complex tasks • Stream processing • Analyze in real-time/near real-time • React in real-time 23
  • 22. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24 Traditional Analytics / BI • What • Customer growth for the last month/quarter/year • Segmentation of customers by demographics • Average time spent by mobile app users • Why • Sales growth slowed • regional issue • supply issue • Profit dropped • revenue dropped • expenses increased 24
  • 23. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25 Custom Processing • Index web pages • Google • Bing • Process genome data • Identify mutations linked to cancer, Alzheimer's and other disease • Click analysis • Log analysis • 360-degree real time view of a customer 25
  • 24. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26 Predictive Analytics • Advertisements that a visitor will most likely click • Movies / songs / news that a customer will like • Products that a customer will buy • Patient will have an heart attack 26
  • 25. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27 • Virtual assistant • Siri • Google Now • Autonomous machine • Self-driving car • Robots • Tag Images • Facebook • Flickr • Expert System • Medical diagnosis • Personalized medicine • Security • Fraud detection • Network Security • Music recognition • Shazam • SoundHound Automate Complex Tasks 27
  • 26. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28 Big Data Big Data Technologies Kafka Hadoop Spark
  • 27. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29 • • • • • • 29
  • 28. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30 • • • • • • 30
  • 29. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31 • Text • CSV • JSON • XML • Binary • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC) File Formats 31
  • 30. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32 • Hive • Spark SQL • Impala • Presto • Drill • Phoenix • HAWQ • Tajo Distributed SQL Query Engine 32 Data Warehouse Distributed Storage Distributed Query Engine
  • 31. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33 • • • • • • 33
  • 32. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34 • • • 34
  • 33. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35 • • 35
  • 34. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36 Publish – Subscribe / Messaging Systems • Kafka • RabbitMQ • ActiveMQ • ZeroMQ 36
  • 35. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37 • Batch • Hadoop MapReduce • HPCC • Stream • Kafka Streams • Heron • Storm • Samza • Batch and Stream • Spark • Flink • Beam • Apex • Ignite Big Data Computing Frameworks 37
  • 36. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38 Big Data Big Data Technologies Kafka Hadoop Spark
  • 37. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39 • Distributed publish-subscribe messaging system • Partitioned and replicated commit log service for building distributed datastore Kafka
  • 38. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40 • • • • • • • • • • • • • • • 40
  • 39. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41 • • • • • • • • 41
  • 40. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42 • • • • • • 42
  • 41. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43 Big Data Big Data Technologies Kafka Hadoop Spark
  • 42. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44 Hadoop
  • 43. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45 • • • • 45
  • 44. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46 • • • • •
  • 45. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47
  • 46. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48
  • 47. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49
  • 48. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 50 Hadoop is Not a Single Product 50
  • 49. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51 Hadoop Core Components 51 =
  • 50. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52 Big Data Big Data Technologies Kafka Hadoop Spark
  • 51. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53 • • • 53
  • 52. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54 • • • • • • • 54
  • 53. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55 Adoption of Spark is Growing Rapidly
  • 54. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56 Spark Fast, easy-to-use, general-purpose cluster computing framework for processing large datasets using a simpler programming model 56 • • •
  • 55. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57 Benefits • Scale • Fault-tolerance • Abstracts distributed computing • Hides the messy details of writing distributed applications • Allows developers to just focus on the data processing logic • Same code works on a laptop or a cluster of servers • Ease-of-use • Speed • Flexibility 57
  • 56. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 58 Easy To Use • Library with an expressive API • Scala, Java, Python, R • RDD API with 80+ operators (MR has only two) • Dataset/DataFrame API • Interactive development • spark-shell • notebooks 58
  • 57. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 59 • Batch processing • Interactive analytics • Stream analysis • Machine learning • Graph analytics Integrated Libraries For a Variety of DP Tasks Spark Core Spark SQL GraphX Spark Streaming MLlib
  • 58. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60 Benefits of a Unified Platform • Solve a variety of problems with a single toolkit • No need to learn different tools for each use case • Avoid code and data duplication • Achieve operational simplicity
  • 59. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61 Why is Spark Fast • Advanced job execution engine • Allows applications to cache data in memory 61
  • 60. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62 Advanced Job Execution Engine • Directed Acyclic Graph (DAG) of stages • simple job can contain just one stage • complex job can contain many stages • eliminates expensive operations between multiple jobs • synchronization • serialization/deserialization • disk I/O • Lazy operator evaluation • Pipelined operations 62
  • 61. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65 Allows Applications to Cache Data in Memory •Minimize disk I/O •Reading data from memory is orders of magnitude faster than reading from disk •In-memory data sharing across DAGs • different jobs can work with the same cached data 65
  • 62. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66 Why Caching Makes Applications Run Faster 66 100 MB/s 500 MB/s 10 GB/s
  • 63. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67 Read Latency Comparison 67 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM
  • 64. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74 Spark Does Not Provide Storage • Works with a variety of data sources • No need to import data into Spark • Scale compute and storage cluster independently
  • 65. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75 Process Data From a Variety Of Data Sources And Many More
  • 66. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76 Spark Does Not Replace Hadoop 76 = =
  • 67. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77 Hadoop is Optional 77 = =
  • 68. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78 Ideal Applications • Complex data processing • multi-step pipeline • Iterative algorithm • Machine Learning • Graph analytics • Ad hoc analysis • Interactive
  • 69. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 110110