The document introduces big data concepts and provides an overview of tools for working with big data. It discusses the characteristics of big data including volume, variety, velocity, and value. Popular tools like Hadoop, HDFS, and the Hadoop ecosystem are explained. The document also covers database history and why many database options now exist for big data.
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Big data
1. Introduction to Big Data Survival Guide!
Luan Cestari
February 28 , 2014
1
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
2. Please, let me ask ...
●
●
2
Who already tested a product/project related to Big
Data?
Who does work with Big Data?
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
3. What are we going to see here
●
The demystification the term ¨Big Data¨ and beyond!
●
●
What does the people claim to be Big Data
What is the relationship between Big Data and
databases
●
●
●
Some facts about database history
Why there are so many DB available?
How to clue all this stuff together?
●
3
Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
4. Why Big Data is important
●
Many companies is already dealing with Big Data
using Open Source tools
●
●
4
There is demand for people to work with those tools as
a developer and analyst
You can also work with some integration between those
system and building to improve a already existing tool or
the next Big Data Tool
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
5. Why Big Data is important
●
When a company is using Big Data tools, it can grow
very fast and complex:
●
●
●
5
Many different clusters (due tenant, geo localized or
different versions)
Different technologies for very related propose (also due
different team skills or use cases)
Many many software integration, layers to segregate the
different aspects and re factoring due the the fast pace
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
6. Cool ... but what is Big Data after all?
●
Just tons of information isn't enough, it also needs to
be have:
●
●
Velocity
●
Value
●
6
Variety
And Volume
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
7. More about Volume: How Big it can be?
●
What is the size of daily batch job from Facebook? 100
GB 10000GB 100000GB?
●
7
Answer:104 857 600 gigabytes of users log
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
8. More about Variety: Where the data are from?
●
Customer generated Content
●
M2M
●
Sensors
●
B2B
●
B2C
●
Social Network
●
8
And others Devices: mobile phones, setbox, Security
Cameras
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
9. More about Value
●
The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can do:
●
9
Analysts (find correlations using statistics, signal
processing, machine learning, persona, etc) using
different kind of tools (SQL, search engines, stream
processing)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
10. More about Value
●
The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can:
●
10
Find correlations using statistical or predictive analytics,
signal processing, machine learning, natural language
processing, BI, visualization, etc using different kind of
tools (SQL, search engines, stream processing)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
11. More about Value
●
●
11
So the value are the insights generated that may help
you to generate a better product, making better
decision or take a competitive advantage over the
other competitors
The Open Source helps also the value to enable it in a
cost effective way, instead buying tons of expensive
tools
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
12. ... and the Velocity
●
This is a very interesting point due different analyzes
may require different times:
●
●
12
A traffic system may need a streaming system to
analyze and predict the actual traffic and suggest better
routes over the city
The same traffic system may need to process several
weeks to have a good prediction of the average traffic
over the road, so that could be an offline batch
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
13. ... and the Velocity
●
13
The main point is that there isn't a silver bullet for this,
different store system may be required for different
services that it aims to provide
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
14. SQL History
●
●
Hierarchical Database in 60`s
Then Relational Database in 80`s and until couple
years ago was the only solution used in most of the
enterprise
●
14
Big companies used to buy expensive special DW
database system to analyze their data
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
15. ... and now
15
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
16. ... and now
16
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
17. Again the reason for that
●
For example the Web Analysis in Facebook:
●
●
+240 Billion photos
●
+1 Trillion connections
●
●
+1 Billion users
22% of references of the Internet
Harvard Business Review
●
●
17
A change from DW to a Big Data system made a 96
hours job run in just 4 hours
2012 2.5 exabyte create a day
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
18. We need to avoid the Golden hammer/Silver
Bullet Anti-pattern
18
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
19. Hadoop ecosystem save the day
●
●
Open Source projects that help you to deal with the Big
Data
Don't need vertical scaling (big machines), you ca use
cluster of commodity machines and archive even
better results
●
Parallel Processing
●
Fault tolerant Jobs
●
Redundant and distributed data (for disk failure and to
avoid moving data around)
●
●
19
Less complex programming model
It have low level native lib for high performance
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
20. Hadoop ecosystem save the day
●
●
But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●
20
so this is why Hadoop is not alone, there are many
different projects which integrate with it
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
21. Hadoop ecosystem save the day
●
●
But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●
●
so this is why Hadoop is not alone, there are many
different projects which integrate with it
There are several big companies that offer Hadoop and
other projects as a big product and they help the
community, I will talk a little more about Hortonworks
and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
21
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
22. Hadoop ecosystem save the day
●
22
Cluadera: CDH
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
23. Hadoop ecosystem save the day
●
Cluadera:
●
23
How to create this whole stack with minimum effort:
Cloudera Manager
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
24. Hadoop ecosystem save the day
●
24
Hortonworks: HDP
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
25. Hadoop ecosystem save the day
●
Hortonworks:
●
●
25
They use Ambari to management the cluster like
Claudera Manager does
They also have Tez to enhance the speed of the
workloads
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
26. Hadoop ecosystem save the day
●
And more tools:
●
●
26
You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example
tenants/cloud)
Apache BigTop, Fuse-DFS, Apache Crunch, Apache
Whirr, Apache Hama,Apache Giraph, Open MPI,
Cascading (and its extensions), Weave, and more
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
27. Hadoop ecosystem save the day
●
27
There more tools for specific cases, like low latency
with Spark ecosystem
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
28. Hadoop ecosystem save the day
●
28
But you can also use other tools for low latency such
as Twitter Storm, Yahoo S4, Linkedin Samza (or
Kafka), Amazon Kinesis, Google Millwheel
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
29. The integration with other system will be complex
●
29
An overview:
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
30. A different approach: Lambda Architecture
●
30
Idea from Twitter Team (like Nathan Marz) about how
to deal with Big Data Systems
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
32. Introduction to Big Data Survival Guide!
Luan Cestari
February 28 , 2014
1
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
33. Please, let me ask ...
●
●
2
Who already tested a product/project related to Big
Data?
Who does work with Big Data?
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Scalable
Portable
On-demand
Resource Management
Measureable
34. What are we going to see here
●
The demystification the term ¨Big Data¨ and beyond!
●
●
What does the people claim to be Big Data
What is the relationship between Big Data and
databases
●
●
●
How to clue all this stuff together?
●
3
Some facts about database history
Why there are so many DB available?
Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
The difference in http://www.slideshare.net/CAinc/cloud-expo-session-fromvirtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud
35. Why Big Data is important
●
Many companies is already dealing with Big Data
using Open Source tools
●
●
4
There is demand for people to work with those tools as
a developer and analyst
You can also work with some integration between those
system and building to improve a already existing tool or
the next Big Data Tool
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
4
36. Why Big Data is important
●
When a company is using Big Data tools, it can grow
very fast and complex:
●
●
●
5
Many different clusters (due tenant, geo localized or
different versions)
Different technologies for very related propose (also due
different team skills or use cases)
Many many software integration, layers to segregate the
different aspects and re factoring due the the fast pace
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
5
37. Cool ... but what is Big Data after all?
●
Just tons of information isn't enough, it also needs to
be have:
●
●
Velocity
●
Value
●
6
Variety
And Volume
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
6
38. More about Volume: How Big it can be?
●
What is the size of daily batch job from Facebook? 100
GB 10000GB 100000GB?
●
7
Answer:104 857 600 gigabytes of users log
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
7
39. More about Variety: Where the data are from?
●
Customer generated Content
●
M2M
●
Sensors
●
B2B
●
B2C
●
Social Network
●
8
And others Devices: mobile phones, setbox, Security
Cameras
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
8
40. More about Value
●
The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can do:
●
9
Analysts (find correlations using statistics, signal
processing, machine learning, persona, etc) using
different kind of tools (SQL, search engines, stream
processing)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
9
41. More about Value
●
The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can:
●
10
Find correlations using statistical or predictive analytics,
signal processing, machine learning, natural language
processing, BI, visualization, etc using different kind of
tools (SQL, search engines, stream processing)
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
10
42. More about Value
●
●
11
So the value are the insights generated that may help
you to generate a better product, making better
decision or take a competitive advantage over the
other competitors
The Open Source helps also the value to enable it in a
cost effective way, instead buying tons of expensive
tools
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
11
43. ... and the Velocity
●
This is a very interesting point due different analyzes
may require different times:
●
●
12
A traffic system may need a streaming system to
analyze and predict the actual traffic and suggest better
routes over the city
The same traffic system may need to process several
weeks to have a good prediction of the average traffic
over the road, so that could be an offline batch
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
12
44. ... and the Velocity
●
13
The main point is that there isn't a silver bullet for this,
different store system may be required for different
services that it aims to provide
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
13
45. SQL History
●
●
Hierarchical Database in 60`s
Then Relational Database in 80`s and until couple
years ago was the only solution used in most of the
enterprise
●
14
Big companies used to buy expensive special DW
database system to analyze their data
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
14
46. ... and now
15
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
15
47. ... and now
16
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
16
48. Again the reason for that
●
For example the Web Analysis in Facebook:
●
●
+240 Billion photos
●
+1 Trillion connections
●
●
+1 Billion users
22% of references of the Internet
Harvard Business Review
●
●
17
A change from DW to a Big Data system made a 96
hours job run in just 4 hours
2012 2.5 exabyte create a day
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
17
49. We need to avoid the Golden hammer/Silver
Bullet Anti-pattern
18
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
18
50. Hadoop ecosystem save the day
●
●
Open Source projects that help you to deal with the Big
Data
Don't need vertical scaling (big machines), you ca use
cluster of commodity machines and archive even
better results
●
Parallel Processing
●
Fault tolerant Jobs
●
Redundant and distributed data (for disk failure and to
avoid moving data around)
●
●
19
Less complex programming model
It have low level native lib for high performance
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
19
51. Hadoop ecosystem save the day
●
●
But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●
20
so this is why Hadoop is not alone, there are many
different projects which integrate with it
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
20
52. Hadoop ecosystem save the day
●
●
But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●
●
so this is why Hadoop is not alone, there are many
different projects which integrate with it
There are several big companies that offer Hadoop and
other projects as a big product and they help the
community, I will talk a little more about Hortonworks
and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support
21
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
21
53. Hadoop ecosystem save the day
●
22
Cluadera: CDH
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.
22
54. Hadoop ecosystem save the day
●
Cluadera:
●
23
How to create this whole stack with minimum effort:
Cloudera Manager
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
23
55. Hadoop ecosystem save the day
●
24
Hortonworks: HDP
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow
jobs triggered by time (frequency) and data
availabilty
24
56. Hadoop ecosystem save the day
●
Hortonworks:
●
●
25
They use Ambari to management the cluster like
Claudera Manager does
They also have Tez to enhance the speed of the
workloads
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
25
57. Hadoop ecosystem save the day
●
And more tools:
●
●
26
You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example
tenants/cloud)
Apache BigTop, Fuse-DFS, Apache Crunch, Apache
Whirr, Apache Hama,Apache Giraph, Open MPI,
Cascading (and its extensions), Weave, and more
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Apache Whirr is a set of libraries for running cloud
services.
The Apache Crunch Java library provides a
framework for writing, testing, and running
MapReduce pipelines. Its goal is to make pipelines
that are composed of many user-defined functions
simple to write, easy to test, and efficient to run.
Open MPI is a standardized API typically used for
parallel and/or distributed computing
26
58. Hadoop ecosystem save the day
●
27
There more tools for specific cases, like low latency
with Spark ecosystem
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Apache Whirr is a set of libraries for running cloud
services.
27
59. Hadoop ecosystem save the day
●
28
But you can also use other tools for low latency such
as Twitter Storm, Yahoo S4, Linkedin Samza (or
Kafka), Amazon Kinesis, Google Millwheel
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Apache Whirr is a set of libraries for running cloud
services.
28
60. The integration with other system will be complex
●
29
An overview:
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
29
61. A different approach: Lambda Architecture
●
30
Idea from Twitter Team (like Nathan Marz) about how
to deal with Big Data Systems
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
30