The webinar on Big Data and Hadoop titled " Ways to Succeed with Hadoop in 2015 " conducted by Edureka in association with TechGig.com on 29th December 2014
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Webinar: Ways to Succeed with Hadoop in 2015
1. www.edureka.co/big-data-and-hadoop
Hadoop IN 2015
View Big Data and Hadoop Course at: http://www.edureka.co/big-data-and-hadoop
For more details please contact us:
US : 1800 275 9730 (toll free)
INDIA : +91 88808 62004
Email Us : sales@edureka.co
For Queries:
Post on Twitter @edurekaIN: #askEdureka
Post on Facebook /edurekaIN
2. www.edureka.co/big-data-and-hadoopSlide 2
Objectives
At the end of this module, you will be able to:
Hadoop the Swiss Knife - Integration with tools and frameworks
» Spark Integration with Hadoop
» Cassandra Integration with Hadoop
» Pentaho Integration with Hadoop
From Batch to Real-time Processing
Lambda Architecture
New and Upcoming Tools
4. Slide 4 www.edureka.co/big-data-and-hadoop
Monte Zweben
Co-founder and CEO of Splice Machine
There will be "strong demand" for
Hadoop to become more real-time
and transactional, as it becomes a
viable alternative for traditional
database vendors like Oracle MySQL.
Gary Nakamura
CEO of Concurrent, Inc
As the market continues to catch up
to the hype, 2015 will be the year
that Hadoop becomes a worldwide
phenomenon. As part of this, expect
to see more Hadoop-related
acquisitions, IPOs and the rise of new
jobs.
Neil Mendelson
Oracle's VP of Big Data and Advanced Analytics
Hadoop and NoSQL will graduate
from mostly experimental pilots to
standard components of enterprise
data management, taking their place
alongside relational databases.
Predictions for Hadoop in 2015
5. Slide 5 www.edureka.co/big-data-and-hadoop
Predictions for Hadoop in 2015
Big Data movement will generate 4.4 million new IT
jobs globally by 2015.
SQL, the data querying language tool used by
application developers, will become one of the most
prolific use cases in the Hadoop ecosystem.
6. Slide 6 www.edureka.co/big-data-and-hadoop
Hadoop – The Swiss Knife Of 21st Century
Hadoop can be integrated with multiple analytic tools to get the best out of it, like M-Learning, R, Python,
Spark, MongoDB etc.
7. Slide 7 www.edureka.co/big-data-and-hadoop
Spark can be used along with Hadoop 2.x
Spark can use Yarn as the Cluster resource Manager in Spark – Yarn mode
Spark has a different build for Yarn specific integration
Yarn spawns the spark program as Yarn process, where the App Master process is actually the driver program and
Yarn child processes are the Spark workers
This integration is preferable mode if the data size per node is way more than the memory available to cache the
RDDs
On lower volumes of the data per node, launching Spark without yarn yields better results
Spark Integration with Hadoop
9. Slide 9 www.edureka.co/big-data-and-hadoop
EASE OF
DEVELOPMENT
COMBINE
WORKFLOWS
IN-MEMORY
PERFORMANCE
UNLIMITED
SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
The Combination of Spark on Hadoop
Operational Applications
Augmented by In-Memory
Performance
Spark + Hadoop
10. Slide 10 www.edureka.co/big-data-and-hadoop
Spark can use Hadoop as Storage
» Spark can use Hadoop as storage as well as cluster manager
» HDFS provides distributed storage of large datasets
» High Availability is assured natively through HDFS
» No extra software installation is required
» Compatible with Hadoop 1.x also. Using HDFS as storage doesn’t require Hadoop 2.x
» Data Loss during computation is handled by HDFS itself
Using Hadoop as Storage
11. Slide 11 www.edureka.co/big-data-and-hadoop
Spark can use Hadoop as execution engine
» Spark can be integrated with Yarn for it’s execution
» Spark can be used with other engines (like Mesos, Spark Clsuter manager) also
» Yarn integration automatically provides processing scalability to Spark
» Spark needs Hadoop 2.0+ versions in order to use it for execution
» Every node in Hadoop cluster need Spark also to be installed
» Using Hadoop cluster for Spark processes, requires RAM upgrading of data nodes
» The integration distribution of Spark is quite new and still in the process of stabilization
Using Hadoop as Execution Engine
14. Slide 14 www.edureka.co/big-data-and-hadoop
Cassandra Integration with Hadoop
Stand Alone Model
» Stand Alone Independent Clusters
» Existing Cassandra and Hadoop Platforms
» Different Environments
» Different Business Units
» Exposing For B2B Consumption
Slave Node 1
Task Tracker
Map Reduce
Slave Node 2
Task Tracker
Map Reduce
Slave Node 3
Task Tracker
Map Reduce
Task Tracker
Map Reduce
Master Node
Job Tracker
15. Slide 15 www.edureka.co/big-data-and-hadoop
Real-time Application and Analytics in One Cluster with
Resource Isolation
Cassandra Integration with Hadoop
Hybrid Model
» Single & Hybrid Cluster
» Shared Infrastructure
» Shared Workload
» Dedicated groups
» Run Cassandra & Hadoop on same cluster
» No SPOF
Replica
Group 1
Cassandra
Node
Replica
Group 2
Write Replication
Hadoop Task Tracker
Hadoop Job Tracker
Cassandra
Node
Cassandra
Node
Cassandra
Node
Cassandra
Node
Cassandra
Node
Cassandra
Node
Cassandra
Node
16. Slide 16 www.edureka.co/big-data-and-hadoop
Integration of Hadoop with Cassandra give a remarkable performance for business improvement in
companies using big data
Hadoop integration with Cassandra includes support for MapReduce, PIG, HIVE, Oozie
Hadoop provides
distributed processing
and high scalability
Cassandra gives us
linear scalability and
high availability
Together Hadoop and
Cassandra helps us to
process and manage
big data easily
Cassandra Integration with Hadoop
17. Slide 17 www.edureka.co/big-data-and-hadoop
Pentaho is distributing big data plugin along with standard products only
Hadoop configurations within PDI are collections of the Hadoop libraries
required to communicate with a specific version of Hadoop and Hadoop-
related tools, such as Hive, HBase, Sqoop, or Pig
The Hadoop distribution configuration can be found at this
location: plugins/pentaho-big-data-plugin/plugin.properties
As of PDI 5.1, it supports standard hadoop distros CDH4.2 and 5.0,
MapR3.1 and HDP 2.0.
Pentaho Integration with Hadoop
18. Slide 18 www.edureka.co/big-data-and-hadoop
Pentaho Integration with Hadoop
Map Reduce can be written in traditional languages like java, but needs that specific skillset
PDI provides a powerful alternative to create your MapReduce jobs with minimal technical skill
When compared to traditional coding style and ETL approaches, Pentaho’s visual development
tools reduce the time to design, develop and deploy Hadoop analytics solutions by 15x
Joe Nicholson
Pentaho’s Vice President of Product
Marketing
Our goal is Hadoop with practically zero
programming, so we can simplify the use
of Hadoop for analytics, including file
input and output steps as well as
managing Hadoop jobs.
19. Slide 19 www.edureka.co/big-data-and-hadoop
Real-time Analytics in…
$
» Proactive Maintenance» Fraud
Detection/Prevention
» Cell tower diagnostics
» Bandwidth Allocation
» Brand Sentiment
Analysis
» Localized, Personalized
Promotions
Financial
Services
Retail Telecom Manufacturing
Healthcare
Utilities, Oil
& Gas
Public
Sector
» Monitor patient vitals
» Patient care and safety
» Reduce re-admittance
rates
» Smart meter stream
analysis
» Proactive equipment
repair
» Power and
consumption matching
» Network intrusion
detection and
prevention
» Disease outbreak
detection
» Unsafe driving detection and
monitoring
Transportation
20. Slide 20 www.edureka.co/big-data-and-hadoop
All data entering the system is dispatched to both the batch layer and the speed layer for processing.
New Data
Batch Layer
1
Serving Layer
Speed Layer
Lambda Architecture
21. Slide 21 www.edureka.co/big-data-and-hadoop
The batch layer has two functions:
» managing the master dataset (an immutable, append-only set of raw data), and
» to pre-compute the batch views. The serving layer indexes the batch views so that they can be queried in
low-latency, ad-hoc way.
Batch View
Batch View
Master
Dataset
New Data
Batch Layer Serving Layer
Speed Layer
1
2
3
Lambda Architecture
22. Slide 22 www.edureka.co/big-data-and-hadoop
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Batch View
Batch View
Real-time
View
Master
Dataset
New Data
Real-time
View
Batch Layer Serving Layer
Speed Layer
1
2
3
4
Lambda Architecture
23. Slide 23 www.edureka.co/big-data-and-hadoop
Any incoming query can be answered by merging results from batch views and real-time views.
Batch View
Batch View
Real-time
View
Master
Dataset
New Data
Query
Query
Real-time
View
Batch Layer Serving Layer
Speed Layer
1
2
3
4
5
Lambda Architecture
24. Slide 24 www.edureka.co/big-data-and-hadoop
New and Upcoming Tools
Apache Tez: Application framework which allows for a complex directed-acyclic-graph of tasks for
processing data.
Apache Accumulo: Sorted, distributed key/value store is a robust, scalable, high performance data
storage and retrieval system
Apache Kafka: Distributed, partitioned, replicated commit log service. It provides the functionality of
a messaging system, but with a unique design
Apache Nutch: High extensible and highly scalable web crawler
Apache Knox Gateway: System that provides a single point of authentication and access for
Apache™ Hadoop® services in a cluster
Apache S4: General-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows
programmers to easily develop applications for processing continuous unbounded streams of data
27. Slide 27 www.edureka.co/big-data-and-hadoop
Module 1
» Understanding Big Data and Hadoop
Module 2
» Hadoop Architecture and HDFS
Module 3
» Hadoop MapReduce Framework - I
Module 4
» Hadoop MapReduce Framework - II
Module 5
» Advance MapReduce
Module 6
» PIG
Module 7
» HIVE
Module 8
» Advance HIVE and HBase
Module 9
» Advance HBase
Module 10
» Oozie and Hadoop Project
Course Topics
28. LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
Slide 28 www.edureka.co/big-data-and-hadoop
How it Works?