SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
Introduction to Big Data Survival Guide!

Luan Cestari
February 28 , 2014

1

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Please, let me ask ...
●

●

2

Who already tested a product/project related to Big
Data?
Who does work with Big Data?

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
What are we going to see here
●

The demystification the term ¨Big Data¨ and beyond!
●
●

What does the people claim to be Big Data
What is the relationship between Big Data and
databases
●
●

●

Some facts about database history
Why there are so many DB available?

How to clue all this stuff together?
●

3

Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Why Big Data is important
●

Many companies is already dealing with Big Data
using Open Source tools
●

●

4

There is demand for people to work with those tools as
a developer and analyst
You can also work with some integration between those
system and building to improve a already existing tool or
the next Big Data Tool

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Why Big Data is important
●

When a company is using Big Data tools, it can grow
very fast and complex:
●

●

●

5

Many different clusters (due tenant, geo localized or
different versions)
Different technologies for very related propose (also due
different team skills or use cases)
Many many software integration, layers to segregate the
different aspects and re factoring due the the fast pace

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Cool ... but what is Big Data after all?
●

Just tons of information isn't enough, it also needs to
be have:
●
●

Velocity

●

Value

●

6

Variety

And Volume

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
More about Volume: How Big it can be?
●

What is the size of daily batch job from Facebook? 100
GB 10000GB 100000GB?
●

7

Answer:104 857 600 gigabytes of users log

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
More about Variety: Where the data are from?
●

Customer generated Content

●

M2M

●

Sensors

●

B2B

●

B2C

●

Social Network

●

8

And others Devices: mobile phones, setbox, Security
Cameras

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
More about Value
●

The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can do:
●

9

Analysts (find correlations using statistics, signal
processing, machine learning, persona, etc) using
different kind of tools (SQL, search engines, stream
processing)

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
More about Value
●

The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can:
●

10

Find correlations using statistical or predictive analytics,
signal processing, machine learning, natural language
processing, BI, visualization, etc using different kind of
tools (SQL, search engines, stream processing)

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
More about Value
●

●

11

So the value are the insights generated that may help
you to generate a better product, making better
decision or take a competitive advantage over the
other competitors
The Open Source helps also the value to enable it in a
cost effective way, instead buying tons of expensive
tools

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
... and the Velocity
●

This is a very interesting point due different analyzes
may require different times:
●

●

12

A traffic system may need a streaming system to
analyze and predict the actual traffic and suggest better
routes over the city
The same traffic system may need to process several
weeks to have a good prediction of the average traffic
over the road, so that could be an offline batch

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
... and the Velocity
●

13

The main point is that there isn't a silver bullet for this,
different store system may be required for different
services that it aims to provide

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
SQL History
●

●

Hierarchical Database in 60`s
Then Relational Database in 80`s and until couple
years ago was the only solution used in most of the
enterprise
●

14

Big companies used to buy expensive special DW
database system to analyze their data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
... and now

15

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
... and now

16

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Again the reason for that
●

For example the Web Analysis in Facebook:
●
●

+240 Billion photos

●

+1 Trillion connections

●

●

+1 Billion users

22% of references of the Internet

Harvard Business Review
●

●

17

A change from DW to a Big Data system made a 96
hours job run in just 4 hours
2012 2.5 exabyte create a day

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
We need to avoid the Golden hammer/Silver
Bullet Anti-pattern

18

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

●

Open Source projects that help you to deal with the Big
Data
Don't need vertical scaling (big machines), you ca use
cluster of commodity machines and archive even
better results
●

Parallel Processing

●

Fault tolerant Jobs

●

Redundant and distributed data (for disk failure and to
avoid moving data around)

●
●

19

Less complex programming model
It have low level native lib for high performance
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

●

But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●

20

so this is why Hadoop is not alone, there are many
different projects which integrate with it

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

●

But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●

●

so this is why Hadoop is not alone, there are many
different projects which integrate with it
There are several big companies that offer Hadoop and
other projects as a big product and they help the
community, I will talk a little more about Hortonworks
and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

21

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

22

Cluadera: CDH

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

Cluadera:
●

23

How to create this whole stack with minimum effort:
Cloudera Manager

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

24

Hortonworks: HDP

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

Hortonworks:
●

●

25

They use Ambari to management the cluster like
Claudera Manager does
They also have Tez to enhance the speed of the
workloads

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

And more tools:
●

●

26

You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example
tenants/cloud)
Apache BigTop, Fuse-DFS, Apache Crunch, Apache
Whirr, Apache Hama,Apache Giraph, Open MPI,
Cascading (and its extensions), Weave, and more

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

27

There more tools for specific cases, like low latency
with Spark ecosystem

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Hadoop ecosystem save the day
●

28

But you can also use other tools for low latency such
as Twitter Storm, Yahoo S4, Linkedin Samza (or
Kafka), Amazon Kinesis, Google Millwheel

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
The integration with other system will be complex
●

29

An overview:

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
A different approach: Lambda Architecture
●

30

Idea from Twitter Team (like Nathan Marz) about how
to deal with Big Data Systems

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Questions?

31

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Introduction to Big Data Survival Guide!

Luan Cestari
February 28 , 2014

1

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
Please, let me ask ...
●

●

2

Who already tested a product/project related to Big
Data?
Who does work with Big Data?

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Scalable
Portable
On-demand
Resource Management
Measureable
What are we going to see here
●

The demystification the term ¨Big Data¨ and beyond!
●
●

What does the people claim to be Big Data
What is the relationship between Big Data and
databases
●
●

●

How to clue all this stuff together?
●

3

Some facts about database history
Why there are so many DB available?
Some well-known Hadoop ecosystem tools that cover a very
wide of Big Data issues

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

The difference in http://www.slideshare.net/CAinc/cloud-expo-session-fromvirtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud
Why Big Data is important
●

Many companies is already dealing with Big Data
using Open Source tools
●

●

4

There is demand for people to work with those tools as
a developer and analyst
You can also work with some integration between those
system and building to improve a already existing tool or
the next Big Data Tool

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

4
Why Big Data is important
●

When a company is using Big Data tools, it can grow
very fast and complex:
●

●

●

5

Many different clusters (due tenant, geo localized or
different versions)
Different technologies for very related propose (also due
different team skills or use cases)
Many many software integration, layers to segregate the
different aspects and re factoring due the the fast pace

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

5
Cool ... but what is Big Data after all?
●

Just tons of information isn't enough, it also needs to
be have:
●
●

Velocity

●

Value

●

6

Variety

And Volume

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

6
More about Volume: How Big it can be?
●

What is the size of daily batch job from Facebook? 100
GB 10000GB 100000GB?
●

7

Answer:104 857 600 gigabytes of users log

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

7
More about Variety: Where the data are from?
●

Customer generated Content

●

M2M

●

Sensors

●

B2B

●

B2C

●

Social Network

●

8

And others Devices: mobile phones, setbox, Security
Cameras

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

8
More about Value
●

The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can do:
●

9

Analysts (find correlations using statistics, signal
processing, machine learning, persona, etc) using
different kind of tools (SQL, search engines, stream
processing)

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

9
More about Value
●

The value is about the processing the data in a
reasonable period of time, so you can forecast
something. Because of that you will need some data
scientists, so they can:
●

10

Find correlations using statistical or predictive analytics,
signal processing, machine learning, natural language
processing, BI, visualization, etc using different kind of
tools (SQL, search engines, stream processing)

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

10
More about Value
●

●

11

So the value are the insights generated that may help
you to generate a better product, making better
decision or take a competitive advantage over the
other competitors
The Open Source helps also the value to enable it in a
cost effective way, instead buying tons of expensive
tools

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

11
... and the Velocity
●

This is a very interesting point due different analyzes
may require different times:
●

●

12

A traffic system may need a streaming system to
analyze and predict the actual traffic and suggest better
routes over the city
The same traffic system may need to process several
weeks to have a good prediction of the average traffic
over the road, so that could be an offline batch

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

12
... and the Velocity
●

13

The main point is that there isn't a silver bullet for this,
different store system may be required for different
services that it aims to provide

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

13
SQL History
●

●

Hierarchical Database in 60`s
Then Relational Database in 80`s and until couple
years ago was the only solution used in most of the
enterprise
●

14

Big companies used to buy expensive special DW
database system to analyze their data

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

14
... and now

15

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

15
... and now

16

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

16
Again the reason for that
●

For example the Web Analysis in Facebook:
●
●

+240 Billion photos

●

+1 Trillion connections

●

●

+1 Billion users

22% of references of the Internet

Harvard Business Review
●

●

17

A change from DW to a Big Data system made a 96
hours job run in just 4 hours
2012 2.5 exabyte create a day

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

17
We need to avoid the Golden hammer/Silver
Bullet Anti-pattern

18

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

18
Hadoop ecosystem save the day
●

●

Open Source projects that help you to deal with the Big
Data
Don't need vertical scaling (big machines), you ca use
cluster of commodity machines and archive even
better results
●

Parallel Processing

●

Fault tolerant Jobs

●

Redundant and distributed data (for disk failure and to
avoid moving data around)

●
●

19

Less complex programming model
It have low level native lib for high performance
RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

19
Hadoop ecosystem save the day
●

●

But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●

20

so this is why Hadoop is not alone, there are many
different projects which integrate with it

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

20
Hadoop ecosystem save the day
●

●

But the Hadoop file system (HDFS) doesn't handle well
low latency requests and small files =(
Well, there isn't silver bullet, we need more tools
●

●

so this is why Hadoop is not alone, there are many
different projects which integrate with it
There are several big companies that offer Hadoop and
other projects as a big product and they help the
community, I will talk a little more about Hortonworks
and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support

21

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

21
Hadoop ecosystem save the day
●

22

Cluadera: CDH

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Apache Sqoop is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured datastores such as relational databases.

22
Hadoop ecosystem save the day
●

Cluadera:
●

23

How to create this whole stack with minimum effort:
Cloudera Manager

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

23
Hadoop ecosystem save the day
●

24

Hortonworks: HDP

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Oozie is a workflow scheduler system to manage
Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow
jobs triggered by time (frequency) and data
availabilty

24
Hadoop ecosystem save the day
●

Hortonworks:
●

●

25

They use Ambari to management the cluster like
Claudera Manager does
They also have Tez to enhance the speed of the
workloads

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

25
Hadoop ecosystem save the day
●

And more tools:
●

●

26

You may use Apache Mesos or Hadoop 2 YARN to
better manage and sharing your services (for example
tenants/cloud)
Apache BigTop, Fuse-DFS, Apache Crunch, Apache
Whirr, Apache Hama,Apache Giraph, Open MPI,
Cascading (and its extensions), Weave, and more

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Apache Whirr is a set of libraries for running cloud
services.
The Apache Crunch Java library provides a
framework for writing, testing, and running
MapReduce pipelines. Its goal is to make pipelines
that are composed of many user-defined functions
simple to write, easy to test, and efficient to run.
Open MPI is a standardized API typically used for
parallel and/or distributed computing

26
Hadoop ecosystem save the day
●

27

There more tools for specific cases, like low latency
with Spark ecosystem

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Apache Whirr is a set of libraries for running cloud
services.

27
Hadoop ecosystem save the day
●

28

But you can also use other tools for low latency such
as Twitter Storm, Yahoo S4, Linkedin Samza (or
Kafka), Amazon Kinesis, Google Millwheel

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Apache Whirr is a set of libraries for running cloud
services.

28
The integration with other system will be complex
●

29

An overview:

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

29
A different approach: Lambda Architecture
●

30

Idea from Twitter Team (like Nathan Marz) about how
to deal with Big Data Systems

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

30
Questions?

31

RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
Swapnil (Neil) Jadhav
 

Was ist angesagt? (18)

Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?
 
DLD Summer Workshop Big Data
DLD Summer Workshop Big DataDLD Summer Workshop Big Data
DLD Summer Workshop Big Data
 
Hortonworks & IBM solutions
Hortonworks & IBM solutionsHortonworks & IBM solutions
Hortonworks & IBM solutions
 
Big Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil JadhavBig Data & Open Source - Neil Jadhav
Big Data & Open Source - Neil Jadhav
 
Introduction to big data and apache spark
Introduction to big data and apache sparkIntroduction to big data and apache spark
Introduction to big data and apache spark
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
PGDay Brasilia 2017
PGDay Brasilia 2017PGDay Brasilia 2017
PGDay Brasilia 2017
 
Transform from database professional to a Big Data architect
Transform from database professional to a Big Data architectTransform from database professional to a Big Data architect
Transform from database professional to a Big Data architect
 
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud ComputingBattling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
Battling the disrupting Energy Markets utilizing PURE PLAY Cloud Computing
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Büyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi GörmekBüyük Veriyle Büyük Resmi Görmek
Büyük Veriyle Büyük Resmi Görmek
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Opensource Frameworks and BigData Processing
Opensource Frameworks and BigData ProcessingOpensource Frameworks and BigData Processing
Opensource Frameworks and BigData Processing
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Observe Changes of Taiwan Big Data Communities with Small Data
Observe Changes of Taiwan Big Data Communities with Small DataObserve Changes of Taiwan Big Data Communities with Small Data
Observe Changes of Taiwan Big Data Communities with Small Data
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 

Andere mochten auch (8)

Diário Oficial do Dia
Diário Oficial do DiaDiário Oficial do Dia
Diário Oficial do Dia
 
Acp empetur joão fernando coutinho
Acp empetur   joão fernando coutinhoAcp empetur   joão fernando coutinho
Acp empetur joão fernando coutinho
 
Vergani, RGW 2011 1
Vergani, RGW 2011 1Vergani, RGW 2011 1
Vergani, RGW 2011 1
 
Going Live
Going LiveGoing Live
Going Live
 
Slides controladoria 6
Slides controladoria 6Slides controladoria 6
Slides controladoria 6
 
Retirement Certificate (Trnsfr to FMCR)
Retirement Certificate (Trnsfr to FMCR)Retirement Certificate (Trnsfr to FMCR)
Retirement Certificate (Trnsfr to FMCR)
 
Lidera5
Lidera5Lidera5
Lidera5
 
Apresentação Institucional Ibri
Apresentação Institucional IbriApresentação Institucional Ibri
Apresentação Institucional Ibri
 

Ähnlich wie Big data

Cw13 build open hybrid cloud by diaa radwan-red hat
Cw13 build open hybrid cloud by diaa radwan-red hatCw13 build open hybrid cloud by diaa radwan-red hat
Cw13 build open hybrid cloud by diaa radwan-red hat
TheInevitableCloud
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 

Ähnlich wie Big data (20)

BDtraining
BDtrainingBDtraining
BDtraining
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
Hadoop Desktop Cluster
Hadoop Desktop ClusterHadoop Desktop Cluster
Hadoop Desktop Cluster
 
BIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social MediaBIG Data & Hadoop Applications in Social Media
BIG Data & Hadoop Applications in Social Media
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Apresentação Hadoop
Apresentação HadoopApresentação Hadoop
Apresentação Hadoop
 
LatinoWare 2013 An OpenSource Blueprint for Cloud presented by Diane Mueller,...
LatinoWare 2013 An OpenSource Blueprint for Cloud presented by Diane Mueller,...LatinoWare 2013 An OpenSource Blueprint for Cloud presented by Diane Mueller,...
LatinoWare 2013 An OpenSource Blueprint for Cloud presented by Diane Mueller,...
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.
 
Miguel Angel Diaz - Red Hat - OSL19
Miguel Angel Diaz - Red Hat - OSL19Miguel Angel Diaz - Red Hat - OSL19
Miguel Angel Diaz - Red Hat - OSL19
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapter
 
Introduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted ConfIntroduction to Google Cloud Platform for Big Data - Trusted Conf
Introduction to Google Cloud Platform for Big Data - Trusted Conf
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Cw13 build open hybrid cloud by diaa radwan-red hat
Cw13 build open hybrid cloud by diaa radwan-red hatCw13 build open hybrid cloud by diaa radwan-red hat
Cw13 build open hybrid cloud by diaa radwan-red hat
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Webinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data AnalyticsWebinar: Improving Time to Value for Enterprise Big Data Analytics
Webinar: Improving Time to Value for Enterprise Big Data Analytics
 
Big Data
Big DataBig Data
Big Data
 

Mehr von Luan Cestari

Tunning jvm em java 8
Tunning jvm em java 8Tunning jvm em java 8
Tunning jvm em java 8
Luan Cestari
 
Indo para as nuvens mais rápido e fácil com Docker
Indo para as nuvens mais rápido e fácil com DockerIndo para as nuvens mais rápido e fácil com Docker
Indo para as nuvens mais rápido e fácil com Docker
Luan Cestari
 

Mehr von Luan Cestari (8)

Tunning da jvm dos comandos às configurações
Tunning da jvm  dos comandos às configuraçõesTunning da jvm  dos comandos às configurações
Tunning da jvm dos comandos às configurações
 
Getting Started with SOA using SwitchYard
Getting Started with SOA using SwitchYardGetting Started with SOA using SwitchYard
Getting Started with SOA using SwitchYard
 
Tunning jvm em java 8
Tunning jvm em java 8Tunning jvm em java 8
Tunning jvm em java 8
 
Indo para as nuvens mais rápido e fácil com Docker
Indo para as nuvens mais rápido e fácil com DockerIndo para as nuvens mais rápido e fácil com Docker
Indo para as nuvens mais rápido e fácil com Docker
 
Lightblue project
Lightblue project Lightblue project
Lightblue project
 
Open stack
Open stackOpen stack
Open stack
 
Open stack
Open stackOpen stack
Open stack
 
Latinoware 2013 - OpenStack RDO - A walkthrough by the Open Source Cloud Comp...
Latinoware 2013 - OpenStack RDO - A walkthrough by the Open Source Cloud Comp...Latinoware 2013 - OpenStack RDO - A walkthrough by the Open Source Cloud Comp...
Latinoware 2013 - OpenStack RDO - A walkthrough by the Open Source Cloud Comp...
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Big data

  • 1. Introduction to Big Data Survival Guide! Luan Cestari February 28 , 2014 1 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 2. Please, let me ask ... ● ● 2 Who already tested a product/project related to Big Data? Who does work with Big Data? RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 3. What are we going to see here ● The demystification the term ¨Big Data¨ and beyond! ● ● What does the people claim to be Big Data What is the relationship between Big Data and databases ● ● ● Some facts about database history Why there are so many DB available? How to clue all this stuff together? ● 3 Some well-known Hadoop ecosystem tools that cover a very wide of Big Data issues RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 4. Why Big Data is important ● Many companies is already dealing with Big Data using Open Source tools ● ● 4 There is demand for people to work with those tools as a developer and analyst You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 5. Why Big Data is important ● When a company is using Big Data tools, it can grow very fast and complex: ● ● ● 5 Many different clusters (due tenant, geo localized or different versions) Different technologies for very related propose (also due different team skills or use cases) Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 6. Cool ... but what is Big Data after all? ● Just tons of information isn't enough, it also needs to be have: ● ● Velocity ● Value ● 6 Variety And Volume RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 7. More about Volume: How Big it can be? ● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB? ● 7 Answer:104 857 600 gigabytes of users log RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 8. More about Variety: Where the data are from? ● Customer generated Content ● M2M ● Sensors ● B2B ● B2C ● Social Network ● 8 And others Devices: mobile phones, setbox, Security Cameras RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 9. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do: ● 9 Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 10. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can: ● 10 Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 11. More about Value ● ● 11 So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 12. ... and the Velocity ● This is a very interesting point due different analyzes may require different times: ● ● 12 A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 13. ... and the Velocity ● 13 The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 14. SQL History ● ● Hierarchical Database in 60`s Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise ● 14 Big companies used to buy expensive special DW database system to analyze their data RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 15. ... and now 15 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 16. ... and now 16 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 17. Again the reason for that ● For example the Web Analysis in Facebook: ● ● +240 Billion photos ● +1 Trillion connections ● ● +1 Billion users 22% of references of the Internet Harvard Business Review ● ● 17 A change from DW to a Big Data system made a 96 hours job run in just 4 hours 2012 2.5 exabyte create a day RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 18. We need to avoid the Golden hammer/Silver Bullet Anti-pattern 18 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 19. Hadoop ecosystem save the day ● ● Open Source projects that help you to deal with the Big Data Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results ● Parallel Processing ● Fault tolerant Jobs ● Redundant and distributed data (for disk failure and to avoid moving data around) ● ● 19 Less complex programming model It have low level native lib for high performance RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 20. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● 20 so this is why Hadoop is not alone, there are many different projects which integrate with it RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 21. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● ● so this is why Hadoop is not alone, there are many different projects which integrate with it There are several big companies that offer Hadoop and other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support 21 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 22. Hadoop ecosystem save the day ● 22 Cluadera: CDH RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 23. Hadoop ecosystem save the day ● Cluadera: ● 23 How to create this whole stack with minimum effort: Cloudera Manager RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 24. Hadoop ecosystem save the day ● 24 Hortonworks: HDP RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 25. Hadoop ecosystem save the day ● Hortonworks: ● ● 25 They use Ambari to management the cluster like Claudera Manager does They also have Tez to enhance the speed of the workloads RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 26. Hadoop ecosystem save the day ● And more tools: ● ● 26 You may use Apache Mesos or Hadoop 2 YARN to better manage and sharing your services (for example tenants/cloud) Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 27. Hadoop ecosystem save the day ● 27 There more tools for specific cases, like low latency with Spark ecosystem RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 28. Hadoop ecosystem save the day ● 28 But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 29. The integration with other system will be complex ● 29 An overview: RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 30. A different approach: Lambda Architecture ● 30 Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 31. Questions? 31 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 32. Introduction to Big Data Survival Guide! Luan Cestari February 28 , 2014 1 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD
  • 33. Please, let me ask ... ● ● 2 Who already tested a product/project related to Big Data? Who does work with Big Data? RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Scalable Portable On-demand Resource Management Measureable
  • 34. What are we going to see here ● The demystification the term ¨Big Data¨ and beyond! ● ● What does the people claim to be Big Data What is the relationship between Big Data and databases ● ● ● How to clue all this stuff together? ● 3 Some facts about database history Why there are so many DB available? Some well-known Hadoop ecosystem tools that cover a very wide of Big Data issues RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD The difference in http://www.slideshare.net/CAinc/cloud-expo-session-fromvirtualization-to-cloud-computing-building-an-effective-pragmatic-reliable-cloud
  • 35. Why Big Data is important ● Many companies is already dealing with Big Data using Open Source tools ● ● 4 There is demand for people to work with those tools as a developer and analyst You can also work with some integration between those system and building to improve a already existing tool or the next Big Data Tool RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 4
  • 36. Why Big Data is important ● When a company is using Big Data tools, it can grow very fast and complex: ● ● ● 5 Many different clusters (due tenant, geo localized or different versions) Different technologies for very related propose (also due different team skills or use cases) Many many software integration, layers to segregate the different aspects and re factoring due the the fast pace RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 5
  • 37. Cool ... but what is Big Data after all? ● Just tons of information isn't enough, it also needs to be have: ● ● Velocity ● Value ● 6 Variety And Volume RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 6
  • 38. More about Volume: How Big it can be? ● What is the size of daily batch job from Facebook? 100 GB 10000GB 100000GB? ● 7 Answer:104 857 600 gigabytes of users log RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 7
  • 39. More about Variety: Where the data are from? ● Customer generated Content ● M2M ● Sensors ● B2B ● B2C ● Social Network ● 8 And others Devices: mobile phones, setbox, Security Cameras RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 8
  • 40. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can do: ● 9 Analysts (find correlations using statistics, signal processing, machine learning, persona, etc) using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 9
  • 41. More about Value ● The value is about the processing the data in a reasonable period of time, so you can forecast something. Because of that you will need some data scientists, so they can: ● 10 Find correlations using statistical or predictive analytics, signal processing, machine learning, natural language processing, BI, visualization, etc using different kind of tools (SQL, search engines, stream processing) RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 10
  • 42. More about Value ● ● 11 So the value are the insights generated that may help you to generate a better product, making better decision or take a competitive advantage over the other competitors The Open Source helps also the value to enable it in a cost effective way, instead buying tons of expensive tools RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 11
  • 43. ... and the Velocity ● This is a very interesting point due different analyzes may require different times: ● ● 12 A traffic system may need a streaming system to analyze and predict the actual traffic and suggest better routes over the city The same traffic system may need to process several weeks to have a good prediction of the average traffic over the road, so that could be an offline batch RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 12
  • 44. ... and the Velocity ● 13 The main point is that there isn't a silver bullet for this, different store system may be required for different services that it aims to provide RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 13
  • 45. SQL History ● ● Hierarchical Database in 60`s Then Relational Database in 80`s and until couple years ago was the only solution used in most of the enterprise ● 14 Big companies used to buy expensive special DW database system to analyze their data RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 14
  • 46. ... and now 15 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 15
  • 47. ... and now 16 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 16
  • 48. Again the reason for that ● For example the Web Analysis in Facebook: ● ● +240 Billion photos ● +1 Trillion connections ● ● +1 Billion users 22% of references of the Internet Harvard Business Review ● ● 17 A change from DW to a Big Data system made a 96 hours job run in just 4 hours 2012 2.5 exabyte create a day RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 17
  • 49. We need to avoid the Golden hammer/Silver Bullet Anti-pattern 18 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 18
  • 50. Hadoop ecosystem save the day ● ● Open Source projects that help you to deal with the Big Data Don't need vertical scaling (big machines), you ca use cluster of commodity machines and archive even better results ● Parallel Processing ● Fault tolerant Jobs ● Redundant and distributed data (for disk failure and to avoid moving data around) ● ● 19 Less complex programming model It have low level native lib for high performance RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 19
  • 51. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● 20 so this is why Hadoop is not alone, there are many different projects which integrate with it RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 20
  • 52. Hadoop ecosystem save the day ● ● But the Hadoop file system (HDFS) doesn't handle well low latency requests and small files =( Well, there isn't silver bullet, we need more tools ● ● so this is why Hadoop is not alone, there are many different projects which integrate with it There are several big companies that offer Hadoop and other projects as a big product and they help the community, I will talk a little more about Hortonworks and Cloudera`s projects sets as they are very wellknown and how they integrate. Find more on http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support 21 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 21
  • 53. Hadoop ecosystem save the day ● 22 Cluadera: CDH RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 22
  • 54. Hadoop ecosystem save the day ● Cluadera: ● 23 How to create this whole stack with minimum effort: Cloudera Manager RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 23
  • 55. Hadoop ecosystem save the day ● 24 Hortonworks: HDP RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty 24
  • 56. Hadoop ecosystem save the day ● Hortonworks: ● ● 25 They use Ambari to management the cluster like Claudera Manager does They also have Tez to enhance the speed of the workloads RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 25
  • 57. Hadoop ecosystem save the day ● And more tools: ● ● 26 You may use Apache Mesos or Hadoop 2 YARN to better manage and sharing your services (for example tenants/cloud) Apache BigTop, Fuse-DFS, Apache Crunch, Apache Whirr, Apache Hama,Apache Giraph, Open MPI, Cascading (and its extensions), Weave, and more RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Whirr is a set of libraries for running cloud services. The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Open MPI is a standardized API typically used for parallel and/or distributed computing 26
  • 58. Hadoop ecosystem save the day ● 27 There more tools for specific cases, like low latency with Spark ecosystem RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Whirr is a set of libraries for running cloud services. 27
  • 59. Hadoop ecosystem save the day ● 28 But you can also use other tools for low latency such as Twitter Storm, Yahoo S4, Linkedin Samza (or Kafka), Amazon Kinesis, Google Millwheel RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD Apache Whirr is a set of libraries for running cloud services. 28
  • 60. The integration with other system will be complex ● 29 An overview: RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 29
  • 61. A different approach: Lambda Architecture ● 30 Idea from Twitter Team (like Nathan Marz) about how to deal with Big Data Systems RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD 30
  • 62. Questions? 31 RED HAT ENTERPRISE LINUX – FOUNDATION FOR THE OPEN HYBRID CLOUD