SlideShare a Scribd company logo
1 of 38
Making Hadoop Work for
Everybody

©MapR Technologies - Confidential

1
Making Hadoop Work for Everybody


Allen Day, MapR Technologies, Principal Data Scientist



Contact:
–
–



Slides:
–



Email: allenday@maprtech.com
Twitter: @allenday
http://slideshare.net/allenday

Hash tags: #

©MapR Technologies - Confidential

2
Hadoop adoption is
widespread.
What happens next?

©MapR Technologies - Confidential

3
Big Data Trends: Where Are We Going?


Big Data Storage  Big Data Applications – don’t just store data but mine it
to extract the full benefits.
– Real-time processing requirements make low latency important
– Many exciting new technologies are available



Going from academic  practical – take machine learning & advanced
analytics from the research lab into business environments.



“Simple algorithms and lots of data trump complex models” [1] – elaborate
designs may not give the best business benefits in production. Simplicity is
the key to success.



Re-usability shortens time-to-market – use pre-existing components and
familiar architectural design patterns to reduce development cost and time

[1] Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems
©MapR Technologies - Confidential

4
Big Data Trends: Who is Driving?


Web camp
–



Big data camp
–



non-traditional scalable file systems

Everybody else
–



everything is a service with a URL or a DOM

files and databases

But


They don’t easily work
together
©MapR Technologies - Confidential

5
This is not a problem.

It’s an opportunity

©MapR Technologies - Confidential

6
New Technologies: Can They Play Together?
Examples of excellent modern technologies


d3.js[1] does real-time, interactive visualization for excellent
images of data



node.js[2] allows simple (not just web) servers



Apache Storm[3] does real-time processing



Hadoop does big data distributed storage really well

But HDFS makes Hadoop stand somewhat alone


Special steps are needed to ingest and access data on a Hadoop
cluster

MapR has changed that . . .
[1] http://d3js.org [2] http://nodejs.org [3] http://incubator.apache.org/storm
©MapR Technologies - Confidential

7
Evolution of Data Storage

Scalability
Over decades of progress,
Unix-based systems have set
the standard for compatibility
and functionality
Linux
POSIX

Functionality
Compatibility
©MapR Technologies - Confidential

8
Evolution of Data Storage

Scalability
Hadoop achieves much higher
Hadoop
scalability by trading away
essentially all of this compatibility

Linux
POSIX

Functionality
Compatibility
©MapR Technologies - Confidential

9
Evolution of Data Storage

Scalability
Hadoop

MapR enhances Apache Hadoop by
restoring the compatibility while
increasing scalability and performance
Linux
POSIX

Functionality
Compatibility
©MapR Technologies - Confidential

10
MapR Data Storage: How it’s done
HBase
NoSQL Tables API

POSIX NFS

implements

depends

Apache
HBase

implements

implements
depends
Hadoop
HDFS API

implements
MapR
Filesystem

©MapR Technologies - Confidential

implements
Apache Hadoop
HDFS

11
MapR Data Storage: How it’s done
Vertical Integration = High Performance
HBase
NoSQL Tables API

POSIX NFS

implements

depends

Apache
HBase

implements

implements
depends
Hadoop
HDFS API

implements
MapR
Filesystem

©MapR Technologies - Confidential

implements
Apache Hadoop
HDFS

12
Hadoop on MapR No Longer Stands Apart

Legacy code &
applications

New technologies
d3
node.js
Apache Storm

Multiple types of
data sources

New custom applications

MapR cluster

©MapR Technologies - Confidential

13
What does this
compatibility
mean for you?

©MapR Technologies - Confidential

14
Example:
visualization of
big data

©MapR Technologies - Confidential

15
Visualization Gives Data Impact
New technologies include visualization tools like D3.js [1] and Node.js [2]
‱ POSIX tools that run and scale easily on MapR
‱ http://trends.truliablog.com/vis/metro-movers/

[1] http://d3js.org/ [2] http://nodejs.org
©MapR Technologies - Confidential

16
Visualization Gives Data Impact
New technologies include visualization tools like D3.js [1] and Node.js [2]
‱ POSIX tools that run and scale easily on MapR
‱ http://trends.truliablog.com/vis/metro-movers/

[1] http://d3js.org/ [2] http://nodejs.org
©MapR Technologies - Confidential

17
Example:
real-time on
Hadoop

©MapR Technologies - Confidential

18
Sentiment Analysis In Real-time


Business Goal: Who is having a bad experience with my brand and
how can I fix it?



What does the result look like?
–
–
–
–

–
–



Show me now
On Twitter
Who it is


how they feel

and what product/service they’re interacting with
ALSO show me patterns of feelings related to my products/services

How: real-time big data analytics

©MapR Technologies - Confidential

19
Business Goals: From Data to Insight
etc

Twitter

etc

©MapR Technologies - Confidential

Processing

Visualization
and Reporting

Machine analysis

20

Interpret

Find Value
& Execute

Human insight
Analytics Architecture: How to Process?
etc

Twitter

etc

Processing

Visualization
and Reporting

Machine analysis

Interpret

Find Value
& Execute

Human insight



Aggregation and Queuing: depends on whether you use MapR or
other Hadoop distro (explain later)



Real-time processing: Apache Storm [1]
–

Established open source project for robust, distributed RT processing

©MapR Technologies - Confidential

21
Analytics Architecture: How to Display?
etc

Twitter

etc

Processing

Visualization
and Reporting

Machine analysis

Interpret

Human insight



Visualization: many choices, e.g. D3.js, Tableau, Processing



Web server: many choices, e.g. node.js, Twisted Web etc.

©MapR Technologies - Confidential

22

Find Value
& Execute
Analytics Architecture: End-to-End
Twitter
Twitter
API

TweetLogger

MapR

©MapR Technologies - Confidential

http

Web-server

Catcher

Storm

Topic
Queue

23

NFS

Web
Data
Aggregation and Queuing Layer Design


Apache Storm provides real-time processing framework
–
–

–
–



Record-oriented model
Function()s transform record streams into new record streams
Distributed, failure-tolerant, and scalable
No inherent state

MapR provides the real-time processing storage
–
–
–

Process records and emit values (optionally writing to the file system)
Records have to be acknowledged or else they will be retransmitted
Provides failure tolerance

©MapR Technologies - Confidential

24
Real-time on
Hadoop
demo

©MapR Technologies - Confidential

25
Demo for Real-time on Hadoop/ MapR


What application does
–
–

–



Technical requirements
–
–



Reads tweets as they happen
Remembers the top few words
Makes engaging pictures
Handle restarts well
Be fault tolerant

Best practice design tip:
–

Keep it really simple

©MapR Technologies - Confidential

26
[DEMO:cached]
[DEMO:live]

©MapR Technologies - Confidential

27
Hadoop on MapR No Longer Stands Apart
Twitter
Twitter
API

D3
Visualization

TweetLogger

MapR cluster
Apache Storm

©MapR Technologies - Confidential

28
Importance of a Real-time File System


This application design provides
–
–

A distributed, partitioned, multi-subscriber commit log
With replication and failure tolerance



This application design is easy to implement




Because the hard problems are solved at the platform layer
–
–
–



No need for replication in the queuing layer
Failure tolerance is trivial, well-hardened in production
Performance even with replication is very, very high

But 


Not all Hadoop distributions
include a real-time file system
©MapR Technologies - Confidential

29
Alternative Queuing Layer - Kafka


Apache Kafka
–
–

–



Provides distributed, partitioned, multi-subscriber commit log
As of 0.8 beta, also supports replication of data
Is well-tested. It is used extensively in production at high volumes

But 
 Kafka requires a separate cluster (not needed with MapR)
–

–
–

Data must be persisted in multiple clusters (Storm & Kafka)
Replication capability is new and not well-tested for mission-critical
environments
Failure tolerance is implemented at the application layer. This means


This design does not
generalize to other
©MapR Technologies - Confidential

30
Without MapR: many clusters

*

HDFS
Data

Flume
Flume
Flume
Cluster

Hadoop Cluster

Flume Cluster
Twitter

Twitter
API

*

*
*

Kafka API

Kafka
Kafka
Kafka
Cluster
Cluster
Cluster

Kafka
API

Kafka

Storm

Twitter
Scraper

Report
Data

http

Web-server
©MapR Technologies - Confidential

31

Web Service NAS
Analytics Architecture: End-to-End
Twitter
Twitter
API

TweetLogger

MapR

©MapR Technologies - Confidential

http

Web-server

Catcher

Storm

*

Topic
Queue

32

NFS

Web
Data
Sentiment Analysis In Real-time


Business Goal: Who is having a bad experience with my brand and
how can I fix it?



What does the result look like?
–
–
–
–

–
–
–



Show me now
On Twitter
Who it is


how they feel

and what product/service they’re interacting with
ALSO show me patterns of feelings related to my products/services
ALSO allow retrospective analysis for R&D

How: real-time big data analytics

©MapR Technologies - Confidential

33
MapR Data Platform Advantage
Twitter
Twitter
API

D3
Visualization

TweetLogger

R&D Batch
Analytics

MapR cluster

Other
Applications

Apache Storm

Such as . .
©MapR Technologies - Confidential

34
Data Warehouse / Hadoop ETL Offload
❷

Extract
Billing
Systems

Clean

Transform

MapR Distribution for
Apache Hadoop

❶

Accounts
Receivable
Structured &
Unstructured
Data

Conform

N1

N1

N1

N1




Structured Data
âč

N1

Financial
Reporting

❞

Revenue
Accounting
Other BI

Teradata

Front-ends
âș

©MapR Technologies - Confidential

Data Warehouse and
Analytics
35
When Hadoop Looks Like a NAS



Data ingestion is easy
–

Popular online gaming company changed
data ingestion from a complex Flume
cluster to a 17-line Python script

Logs

Application
servers



Database bulk import/export with
standard vendor tools
–



Large telecom saved $30M on Teradata
costs by pre-processing with MapR

1000s of applications/tools
–

Large credit card company uses MapR
volumes as the user home directories on
the Hadoop gateway servers

©MapR Technologies - Confidential

36

$
$
$
$
$

find . | grep log
cp
vi results.csv
scp
tail -f part-00000
Recap


Exciting times in Hadoop, lots of new and exciting capabilities



Exciting times in world of webservers, visualization and real-time



Integration of both worlds is easier than it looks at first
–




 if you have a real-time filesystem

MapR Data Platform is widely compatible
–
–
–

Use legacy code & applications without having to re-write
Access traditional data stores more easily
Connect to new technologies directly

©MapR Technologies - Confidential

37
Thank You!!

©MapR Technologies - Confidential

38

More Related Content

More from Allen Day, PhD

20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 

More from Allen Day, PhD (19)

20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

20131118 - Seoul - Advanced Computing Conference 10 - New Trends In Hadoop

  • 1. Making Hadoop Work for Everybody ©MapR Technologies - Confidential 1
  • 2. Making Hadoop Work for Everybody  Allen Day, MapR Technologies, Principal Data Scientist  Contact: – –  Slides: –  Email: allenday@maprtech.com Twitter: @allenday http://slideshare.net/allenday Hash tags: # ©MapR Technologies - Confidential 2
  • 3. Hadoop adoption is widespread. What happens next? ©MapR Technologies - Confidential 3
  • 4. Big Data Trends: Where Are We Going?  Big Data Storage  Big Data Applications – don’t just store data but mine it to extract the full benefits. – Real-time processing requirements make low latency important – Many exciting new technologies are available  Going from academic  practical – take machine learning & advanced analytics from the research lab into business environments.  “Simple algorithms and lots of data trump complex models” [1] – elaborate designs may not give the best business benefits in production. Simplicity is the key to success.  Re-usability shortens time-to-market – use pre-existing components and familiar architectural design patterns to reduce development cost and time [1] Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems ©MapR Technologies - Confidential 4
  • 5. Big Data Trends: Who is Driving?  Web camp –  Big data camp –  non-traditional scalable file systems Everybody else –  everything is a service with a URL or a DOM files and databases But
 They don’t easily work together ©MapR Technologies - Confidential 5
  • 6. This is not a problem. It’s an opportunity ©MapR Technologies - Confidential 6
  • 7. New Technologies: Can They Play Together? Examples of excellent modern technologies  d3.js[1] does real-time, interactive visualization for excellent images of data  node.js[2] allows simple (not just web) servers  Apache Storm[3] does real-time processing  Hadoop does big data distributed storage really well But HDFS makes Hadoop stand somewhat alone  Special steps are needed to ingest and access data on a Hadoop cluster MapR has changed that . . . [1] http://d3js.org [2] http://nodejs.org [3] http://incubator.apache.org/storm ©MapR Technologies - Confidential 7
  • 8. Evolution of Data Storage Scalability Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 8
  • 9. Evolution of Data Storage Scalability Hadoop achieves much higher Hadoop scalability by trading away essentially all of this compatibility Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 9
  • 10. Evolution of Data Storage Scalability Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential 10
  • 11. MapR Data Storage: How it’s done HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS 11
  • 12. MapR Data Storage: How it’s done Vertical Integration = High Performance HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS 12
  • 13. Hadoop on MapR No Longer Stands Apart Legacy code & applications New technologies d3 node.js Apache Storm Multiple types of data sources New custom applications MapR cluster ©MapR Technologies - Confidential 13
  • 14. What does this compatibility mean for you? ©MapR Technologies - Confidential 14
  • 15. Example: visualization of big data ©MapR Technologies - Confidential 15
  • 16. Visualization Gives Data Impact New technologies include visualization tools like D3.js [1] and Node.js [2] ‱ POSIX tools that run and scale easily on MapR ‱ http://trends.truliablog.com/vis/metro-movers/ [1] http://d3js.org/ [2] http://nodejs.org ©MapR Technologies - Confidential 16
  • 17. Visualization Gives Data Impact New technologies include visualization tools like D3.js [1] and Node.js [2] ‱ POSIX tools that run and scale easily on MapR ‱ http://trends.truliablog.com/vis/metro-movers/ [1] http://d3js.org/ [2] http://nodejs.org ©MapR Technologies - Confidential 17
  • 19. Sentiment Analysis In Real-time  Business Goal: Who is having a bad experience with my brand and how can I fix it?  What does the result look like? – – – – – –  Show me now On Twitter Who it is
 
how they feel 
and what product/service they’re interacting with ALSO show me patterns of feelings related to my products/services How: real-time big data analytics ©MapR Technologies - Confidential 19
  • 20. Business Goals: From Data to Insight etc Twitter etc ©MapR Technologies - Confidential Processing Visualization and Reporting Machine analysis 20 Interpret Find Value & Execute Human insight
  • 21. Analytics Architecture: How to Process? etc Twitter etc Processing Visualization and Reporting Machine analysis Interpret Find Value & Execute Human insight  Aggregation and Queuing: depends on whether you use MapR or other Hadoop distro (explain later)  Real-time processing: Apache Storm [1] – Established open source project for robust, distributed RT processing ©MapR Technologies - Confidential 21
  • 22. Analytics Architecture: How to Display? etc Twitter etc Processing Visualization and Reporting Machine analysis Interpret Human insight  Visualization: many choices, e.g. D3.js, Tableau, Processing  Web server: many choices, e.g. node.js, Twisted Web etc. ©MapR Technologies - Confidential 22 Find Value & Execute
  • 23. Analytics Architecture: End-to-End Twitter Twitter API TweetLogger MapR ©MapR Technologies - Confidential http Web-server Catcher Storm Topic Queue 23 NFS Web Data
  • 24. Aggregation and Queuing Layer Design  Apache Storm provides real-time processing framework – – – –  Record-oriented model Function()s transform record streams into new record streams Distributed, failure-tolerant, and scalable No inherent state MapR provides the real-time processing storage – – – Process records and emit values (optionally writing to the file system) Records have to be acknowledged or else they will be retransmitted Provides failure tolerance ©MapR Technologies - Confidential 24
  • 26. Demo for Real-time on Hadoop/ MapR  What application does – – –  Technical requirements – –  Reads tweets as they happen Remembers the top few words Makes engaging pictures Handle restarts well Be fault tolerant Best practice design tip: – Keep it really simple ©MapR Technologies - Confidential 26
  • 28. Hadoop on MapR No Longer Stands Apart Twitter Twitter API D3 Visualization TweetLogger MapR cluster Apache Storm ©MapR Technologies - Confidential 28
  • 29. Importance of a Real-time File System  This application design provides – – A distributed, partitioned, multi-subscriber commit log With replication and failure tolerance  This application design is easy to implement
  Because the hard problems are solved at the platform layer – – –  No need for replication in the queuing layer Failure tolerance is trivial, well-hardened in production Performance even with replication is very, very high But 
 Not all Hadoop distributions include a real-time file system ©MapR Technologies - Confidential 29
  • 30. Alternative Queuing Layer - Kafka  Apache Kafka – – –  Provides distributed, partitioned, multi-subscriber commit log As of 0.8 beta, also supports replication of data Is well-tested. It is used extensively in production at high volumes But 
 Kafka requires a separate cluster (not needed with MapR) – – – Data must be persisted in multiple clusters (Storm & Kafka) Replication capability is new and not well-tested for mission-critical environments Failure tolerance is implemented at the application layer. This means
 This design does not generalize to other ©MapR Technologies - Confidential 30
  • 31. Without MapR: many clusters * HDFS Data Flume Flume Flume Cluster Hadoop Cluster Flume Cluster Twitter Twitter API * * * Kafka API Kafka Kafka Kafka Cluster Cluster Cluster Kafka API Kafka Storm Twitter Scraper Report Data http Web-server ©MapR Technologies - Confidential 31 Web Service NAS
  • 32. Analytics Architecture: End-to-End Twitter Twitter API TweetLogger MapR ©MapR Technologies - Confidential http Web-server Catcher Storm * Topic Queue 32 NFS Web Data
  • 33. Sentiment Analysis In Real-time  Business Goal: Who is having a bad experience with my brand and how can I fix it?  What does the result look like? – – – – – – –  Show me now On Twitter Who it is
 
how they feel 
and what product/service they’re interacting with ALSO show me patterns of feelings related to my products/services ALSO allow retrospective analysis for R&D How: real-time big data analytics ©MapR Technologies - Confidential 33
  • 34. MapR Data Platform Advantage Twitter Twitter API D3 Visualization TweetLogger R&D Batch Analytics MapR cluster Other Applications Apache Storm Such as . . ©MapR Technologies - Confidential 34
  • 35. Data Warehouse / Hadoop ETL Offload ❷ Extract Billing Systems Clean Transform MapR Distribution for Apache Hadoop ❶ Accounts Receivable Structured & Unstructured Data Conform N1 N1 N1 N1 
 Structured Data âč N1 Financial Reporting ❞ Revenue Accounting Other BI Teradata Front-ends âș ©MapR Technologies - Confidential Data Warehouse and Analytics 35
  • 36. When Hadoop Looks Like a NAS
  Data ingestion is easy – Popular online gaming company changed data ingestion from a complex Flume cluster to a 17-line Python script Logs Application servers  Database bulk import/export with standard vendor tools –  Large telecom saved $30M on Teradata costs by pre-processing with MapR 1000s of applications/tools – Large credit card company uses MapR volumes as the user home directories on the Hadoop gateway servers ©MapR Technologies - Confidential 36 $ $ $ $ $ find . | grep log cp vi results.csv scp tail -f part-00000
  • 37. Recap  Exciting times in Hadoop, lots of new and exciting capabilities  Exciting times in world of webservers, visualization and real-time  Integration of both worlds is easier than it looks at first –  
 if you have a real-time filesystem MapR Data Platform is widely compatible – – – Use legacy code & applications without having to re-write Access traditional data stores more easily Connect to new technologies directly ©MapR Technologies - Confidential 37

Editor's Notes

  1. Add your contact information including “MapR Job Title” starting with the word “MapR” such as MapR Solutions Architect etc.Add the #hashtag of the particular meet-up and chose one or more of the others BUT NOT ALL OF THEM
  2. Reduce cost and time
  3. This slide and the following demos sets the stage to introduce the overall big idea of talk: that there are many cool advances in these different areas, but without the ideas introduced here, it’s hard to connect these technologies together easily and reliably –
  4. This slide and the following demos sets the stage to introduce the overall big idea of talk: that there are many cool advances in these different areas, but without the ideas introduced here, it’s hard to connect these technologies together easily and reliably –
  5. Gives up random access read on filesGives up strong authentication / authorization modelGives up random access write / append on files
  6. http://trends.truliablog.com/vis/metro-movers/
  7. Main point: these applications can be served directly from MapR filesystem because it presents as a POSIX filesystem using NFS protocol.Non-bigdata applications like Node.JS and D3.js can work directly off of MapR, and don’t care that it is also a highly scalable DFS
  8. Talk track: Let’s look at another example of new technologies on Hadoop in more detailReal-time analysis has many uses, such as Sentiment AnalysisWhat is sentiment analysis and why would you use it? (then start into slide details)
  9. Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  10. Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  11. Talk track: This diagram shows what you are doing at each step in the process. Of course the final steps are interpretation of the output of your application by humans to gain insight that will power business decisions.
  12. Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  13. Note to speaker: listen to explanation at about 23min into Ted’s video for this slide and next slide showing the sliding window
  14. Note to speaker: here is the option if you don’t have a distributed system with a reatlime replicated filesystem (such as what MapR has). Alternative is to use something like Kafka. Much harder to build etc.
  15. “redundant redundancy” is about too many data copies = inefficient design and risk of inconsistency between systems
  16. Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  17. Talk track: Let’s look at another example of new technologies on Hadoop in more detailReal-time analysis has many uses, such as Sentiment AnalysisWhat is sentiment analysis and why would you use it? (then start into slide details)
  18. Allen: Is this useful? Is itMapR specific?
  19. What is NAS? Should the little elephant say MapR? Remember this slide was from a sales pitch, so you may have to make clear it’s talking about MapR
  20. Allen: Do we want to use the images including Korean bowl or better to avoid possible negative reaction and just use a text slide for transition to thinking about legacy applications?
  21. Another example of the same thing (this slide can be suppressed – just another view of the inaccuracies)
  22. I hid this one and went with the other format of your slide.
  23. Again, I’ve hidden this version and gone with the one that is formatted differently. Same content, though.
  24. Talk track: In this presentation, we are going to look in more detail at the tools and technologies used for this middle stage, that is done by machine, the processing and visualization/reporting steps
(Allen: don’t know if you want to say more details here, but I suspect this slide is just a transition to get audience focused on the processing & the viz/reporting part of the work flow.
  25. Note to speaker: Slide just sets up transition from business goals to architecture diagram slide Don’t need a lot of detail, but introduce Storm, say a little about project & new to Apache AND MENTION MAPR’s Ted Dunning is one of the PROJECT MENTORSTalk track: Now that you see what is the business goal of using real-time sentiment analysis, let’s look at the architecture for a sample project
 Here are some technologies you’ll need
{explain] and then say “and on the next slide we see a diagram of the architectural design – or call it work flow?
  26. Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  27. Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  28. Allen: This is the MapR view. You can do it first or the non-mapr view first. I think you said you preferred to do non-mapR first?
  29. Allen: This is the MapR view. You can do it first or the non-mapr view first. DOES THE WEB SerVER/WEB DATA ALSO COME OFF if not a MapR cluster?????
  30. Allen: This is the MapR view. You can do it first or the non-mapr view first. DOES THE WEB SerVER/WEB DATA ALSO COME OFF if not a MapR cluster?????
  31. The distinction is not clear to me
 Attempt 2 meaning different slide to show same idea? Confusing to mix symbols of actions/ ideas with components in work flow
  32. This is my start on the non-MapR view
  33. Do you want to include this information?
  34. MapR enables integration by providing industry-standard interfacesMore 3rd party solutions work with MapR than any other distributionProprietary connectors not neededNFSAll file-based applications can read and write dataExamples: Linux utilities, file browsers, Informatica UltraMessagingODBC 3.52All BI applications can leverage HiveExamples: Excel, Crystal Reports, Tableau, MicroStrategyLinux PAMAny authentication provider can be usedExamples: LDAP, Kerberos, 3rd party
  35. NNote from Ted’s video hints: HBase storage here can be handy but is much more of a sideline to the main idea.