More Related Content
Similar to Introduction to Big Data (20)
Introduction to Big Data
- 1. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1
Mohammed Guller
Oct 02, 2016
Introduction to Big Data
- 2. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
Agenda
- 3. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4
About Me
• Engineering Manager / Principal Architect at Glassbeam
• Founded two startups
• Passionate about building products, big data analytics, and
machine learning
• www.linkedin.com/in/mohammedguller
• @MohammedGuller
4
- 4. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6
• Hands-on guide with lots of examples
• Covers both fundamental and advanced
topics such as machine learning
• Includes a primer on functional
programming and Scala
• Introduces other important Big Data
technologies such as HDFS, Parquet,
Kafka, HBase, Cassandra, Mesos, and
YARN
Big Data Analytics with Spark
Available on Amazon
- 5. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7
About Glassbeam
Glassbeam brings structure and meaning to data from any connected machine or device while providing
actionable intelligence
Cloud based analytics platform that helps
organizations turn raw machine data to insights
Making sense of multi
structured machine data
Data center devices
Medical devices
Sensors
ATMs
Automobiles
Data from any machine
Providing comprehensive set of apps
& tools for machine data analysis
50,000+ systems being tracked today
1,500+ different software rev codes
1.2 Billion sensor readings per day
1+ Trillion sensor readings tracked
- 6. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
- 7. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9
Data Growing At a Faster Pace Than Ever
9
- 8. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10
Internet of Things (IoT)
• Network of objects embedded with
software for collecting and sending data
over the Internet
• 5x more connected things than people by
2020
- 9. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11
Industrial IoT
• Manufacturing
• Automotive
• Medical
• Data Center
• EVC
• Smart Meter
11
Glassbeam target market is focused on driving opera onal & business
analy cs value for connected product companies in Industrial IoT market
IT & Networks Medical & Health Care
Transporta on
EV Chargers & Smart Grid
Industrial & Mfg
- 10. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12
Key Attributes of Big Data
Volume
Scale of Data
Variety
Diversity of Data
Velocity
Speed of Data
•
•
•
•
•
•
•
•
•
- 11. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13
Big Data Comes with Big Challenges
• Storage
• Processing
• Value
- 12. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14
Storage Challenges
• Legacy SAN / NAS storage devices are expensive
• Traditional RDBMS were not designed for Big Data
• Cannot handle volume, velocity, variety of Big Data
14
- 13. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15
Processing Challenges
• Diverse processing
• Organizations want do more than just BI / traditional analytics
• Go beyond SQL queries
• Timeliness
• Process data in reasonable amount of time
• Value of data decreases over time
15
- 14. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16
How Much Data Can a Standard Server Process
- 16. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18
• Large number of CPUs / cores
• Faster cores
• Large amount of memory
• Faster memory bus
• High-performance architecture
Scale-up with Powerful High-end Server
18
- 17. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19
Disadvantages of Scale-up Architecture
• Proprietary
• Expensive
• Limited scalability
19
- 18. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20
• Cluster of servers
• Commodity machines
• Pool together resources
• CPU
• Memory
• Disk
Scale-out Architecture
20
- 19. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21
Benefits of Scale-out Architecture
• Relatively inexpensive
• Economical to scale
• No huge upfront investment
• Start small and expand cluster as workload increases
21
- 20. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22
Challenges With Scale-out Architecture
• Writing distributed applications is very hard
• Split job into chunks that can be distributed across a cluster
• Schedule compute resources among different jobs
• Manage inter-node communication
• Handle network and node failures
• Hardware failures are more common at a cluster level
• Probability of a single node failing is low
• Probability of any one node in a large cluster failing is high
22
- 21. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23
Getting Value Out of Big Data
• Traditional analytics / BI
• Custom processing
• Machine Learning
• Predictive analytics
• Automate complex tasks
• Stream processing
• Analyze in real-time/near real-time
• React in real-time
23
- 22. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24
Traditional Analytics / BI
• What
• Customer growth for the last month/quarter/year
• Segmentation of customers by demographics
• Average time spent by mobile app users
• Why
• Sales growth slowed
• regional issue
• supply issue
• Profit dropped
• revenue dropped
• expenses increased
24
- 23. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25
Custom Processing
• Index web pages
• Google
• Bing
• Process genome data
• Identify mutations linked to cancer, Alzheimer's and other disease
• Click analysis
• Log analysis
• 360-degree real time view of a customer
25
- 24. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26
Predictive Analytics
• Advertisements that a visitor will most likely click
• Movies / songs / news that a customer will like
• Products that a customer will buy
• Patient will have an heart attack
26
- 25. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27
• Virtual assistant
• Siri
• Google Now
• Autonomous machine
• Self-driving car
• Robots
• Tag Images
• Facebook
• Flickr
• Expert System
• Medical diagnosis
• Personalized medicine
• Security
• Fraud detection
• Network Security
• Music recognition
• Shazam
• SoundHound
Automate Complex Tasks
27
- 26. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
- 27. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29
•
•
•
•
•
•
29
- 28. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30
•
•
•
•
•
•
30
- 29. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31
• Text
• CSV
• JSON
• XML
• Binary
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar
(ORC)
File Formats
31
- 30. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32
• Hive
• Spark SQL
• Impala
• Presto
• Drill
• Phoenix
• HAWQ
• Tajo
Distributed SQL Query Engine
32
Data Warehouse
Distributed
Storage
Distributed
Query Engine
- 31. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33
•
•
•
•
•
•
33
- 34. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36
Publish – Subscribe / Messaging Systems
• Kafka
• RabbitMQ
• ActiveMQ
• ZeroMQ
36
- 35. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37
• Batch
• Hadoop MapReduce
• HPCC
• Stream
• Kafka Streams
• Heron
• Storm
• Samza
• Batch and Stream
• Spark
• Flink
• Beam
• Apex
• Ignite
Big Data Computing Frameworks
37
- 36. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
- 37. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39
• Distributed publish-subscribe
messaging system
• Partitioned and replicated
commit log service for
building distributed datastore
Kafka
- 38. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
40
- 39. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41
•
•
•
•
•
•
•
•
41
- 40. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42
•
•
•
•
•
•
42
- 41. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
- 43. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45
•
•
•
•
45
- 44. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46
•
•
•
•
•
- 48. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 50
Hadoop is Not a Single Product
50
- 49. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51
Hadoop Core Components
51
=
- 50. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52
Big Data
Big Data Technologies
Kafka
Hadoop
Spark
- 52. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54
•
•
•
•
•
•
•
54
- 53. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55
Adoption of Spark is Growing Rapidly
- 54. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56
Spark
Fast, easy-to-use, general-purpose cluster computing framework
for processing large datasets using a simpler programming
model
56
• • •
- 55. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57
Benefits
• Scale
• Fault-tolerance
• Abstracts distributed computing
• Hides the messy details of writing distributed applications
• Allows developers to just focus on the data processing logic
• Same code works on a laptop or a cluster of servers
• Ease-of-use
• Speed
• Flexibility
57
- 56. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 58
Easy To Use
• Library with an expressive API
• Scala, Java, Python, R
• RDD API with 80+ operators (MR has only two)
• Dataset/DataFrame API
• Interactive development
• spark-shell
• notebooks
58
- 57. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 59
• Batch processing
• Interactive analytics
• Stream analysis
• Machine learning
• Graph analytics
Integrated Libraries For a Variety of DP Tasks
Spark Core
Spark
SQL
GraphX
Spark
Streaming
MLlib
- 58. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60
Benefits of a Unified Platform
• Solve a variety of problems with a single toolkit
• No need to learn different tools for each use case
• Avoid code and data duplication
• Achieve operational simplicity
- 59. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61
Why is Spark Fast
• Advanced job execution engine
• Allows applications to cache data in memory
61
- 60. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62
Advanced Job Execution Engine
• Directed Acyclic Graph (DAG) of stages
• simple job can contain just one stage
• complex job can contain many stages
• eliminates expensive operations between multiple jobs
• synchronization
• serialization/deserialization
• disk I/O
• Lazy operator evaluation
• Pipelined operations
62
- 61. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65
Allows Applications to Cache Data in Memory
•Minimize disk I/O
•Reading data from memory is orders of magnitude
faster than reading from disk
•In-memory data sharing across DAGs
• different jobs can work with the same cached data
65
- 62. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66
Why Caching Makes Applications Run Faster
66
100 MB/s
500 MB/s
10 GB/s
- 63. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67
Read Latency Comparison
67
0
50
100
150
200
1 TB
Time (Min)
Data Read
HDD
SSD
RAM
- 64. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74
Spark Does Not Provide Storage
• Works with a variety of data sources
• No need to import data into Spark
• Scale compute and storage cluster independently
- 65. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75
Process Data From a Variety Of Data Sources
And Many More
- 66. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76
Spark Does Not Replace Hadoop
76
= =
- 67. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77
Hadoop is Optional
77
= =
- 68. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78
Ideal Applications
• Complex data processing
• multi-step pipeline
• Iterative algorithm
• Machine Learning
• Graph analytics
• Ad hoc analysis
• Interactive