SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How YARN Enables Multiple Data
Processing Engines in Hadoop
We Do Hadoop
Eric Mizell - Director, Solution Engineering
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• HDFS Overview - Storage
• YARN 101 - Compute
– Yet Another Resource Negotiator
• Enabling a Modern Data Architecture
• YARN in action
– Demo of streaming application
• Hadoop Tools
– Demos
• Sample Code - https://github.com/emizell/HBase-Code-Samples
2
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS Overview
3
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDFS Overview
4
•  Typical Hardware for DataNodes
–  2@8 Core
–  256GB RAM
–  2@24TB Disk
–  10 GbE
•  Hadoop is rack aware
–  Data is replicated across racks to ensure no data loss
•  Scale up or down
–  Add or remove DataNodes and HDFS auto rebalances
•  HDFS is a file system
–  Store any kind of data
–  Inexpensive storage
–  Replica of 3 by default (can be changed)
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Concepts
• Application
– Application is a job submitted to the framework
– Example – MapReduce Job
• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource types (memory, cpu,
disk, network, gpu etc.)
– container_0 = 2GB, 1CPU
– container_1 = 1GB, 6 CPU
– Replaces the fixed map/reduce slots
5
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
– Application management
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master
6
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RackN
NodeManager
NodeManager
NodeManager
Rack2
NodeManager
NodeManager
NodeManager
Rack1
NodeManager
NodeManager
NodeManager
C2.1
C1.4
AM2
C2.2 C2.3
AM1
C1.3
C1.2
C1.1
Hadoop Client 1
Hadoop Client 2
create app2
submit app1
submit app2
create app1
ASM Scheduler
queues
ASM Containers
NM ASM
Scheduler Resources
.......negotiates.......
.......reports to.......
.......partitions.......
ResourceManager
status report
YARN – Running Apps
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2.x Stack – Enabled by YARN
Hadoop
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV
Engines
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Deployment ChoiceLinux Windows On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 2.2.x Stack – Versions
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enabling a Modern Data Architecture
with Apache Hadoop
Hortonworks. We do Hadoop.
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Existing Siloed Data Architectures Under PressureAPPLICATIONS	
  DATA	
  	
  SYSTEM	
  SOURCES	
  
Business	
  	
  
Analy:cs	
  
Custom	
  
Applica:ons	
  
Packaged	
  
Applica:ons	
  
Exis:ng	
  Sources	
  	
  
(CRM,	
  ERP,	
  Clickstream,	
  Logs)	
  
SILO	
  
SILO	
  
RDBMS	
  
SILO	
   SILO	
  
SILO	
   SILO	
  
EDW	
   MPP	
  
Data	
  growth:	
  New	
  Data	
  Types	
  
OLTP,	
  ERP,	
  CRM	
  Systems	
  
Unstructured	
  docs,	
  emails	
  
Clickstream	
  
Server	
  logs	
  
Social/Web	
  Data	
  
Sensor.	
  Machine	
  Data	
  
Geoloca:on	
  
85% 
Source: IDC
??
"   Can’t manage new
data paradigm
"   Constrains data to
specific schema
" Siloed data
"   Limited scalability
"   Economically
unfeasible
"   Limited analytics
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP2 and YARN enable the Modern Data Architecture
Hortonworks architected and 

led development of YARN
Common data set, multiple applications
•  Optionally land all data in a single cluster
•  Batch, interactive & real-time use cases
•  Support multi-tenant access, processing
& segmentation of data
YARN: Architectural center of Hadoop
•  Consistent security, governance & operations
•  Ecosystem applications certified 

by Hortonworks to run natively in Hadoop
SOURCES
EXISTING	
  
Systems	
  
Clickstream	
   Web	
  	
  
&Social	
  
Geoloca:on	
   Sensor	
  	
  
&	
  Machine	
  
Server	
  	
  
Logs	
  
Unstructured	
  
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS EDW MPP YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
Interactive Real-TimeBatch
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN in Action
Hortonworks. We do Hadoop.
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Truck Sensors
Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Microsoft
Excel
Interactive Query
(Hive on Tez)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
Real-time Serving
(HBase)
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Components of the Topology
• 9 Node HDP 2.2 Cluster with Storm and HBase on YARN
• 4 Node 0.8 Kafka Cluster
• 1 Node ActiveMQ with Stomp Protocol Enabled
• Spring 4.0 WebMVC Web Using SocketJS & ActiveMQ over STOMP
Page 15
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Topology Architecture
Page 16
Truck
Simulator
T(1)
T(2)
T(N)
Truck Stream Generator via AKKA
Kafka
Collector
Kafka Grid - Captures all Driving Events
BR(1) BR(2) BR(3)
BR(4) BR(5)
ZK
truck_events
TOPIC
Storm on YARN on HDP
Kafka Spout
HBase
Bolt
Monitoring
Bolt
WebSocket
Bolt
HBase on HDP
HBase
driver
dangerous
events
driver
dangerous
events
count
Email
Alerts
ActiveMQ
Alert
Topic
Spring WebApp with SockJS WebSockets
Real-Time
Streaming Driver
Monitoring App
ActiveMQ
Violation
Events
Topic
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Tools
Hortonworks. We do Hadoop.
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
•  The Basics
•  MapReduce & Java
•  Pig
•  Hive
•  HBase, Solr & Spark
•  Abstractions: .net, cascading and Spring XD
•  Intro to the Sandbox
•  Basic Hello World Using Hive and Pig
•  HBase and Phoenix demo and code discussion
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Deployment ChoiceLinux Windows Cloud
YARN is the architectural
center of HDP
•  Common data set across all
applications
•  Batch, interactive & real-time
workloads
•  Multi-tenant access & processing
Provides comprehensive
enterprise capabilities
•  Governance
•  Security
•  Operations
Enables broad
ecosystem adoption
•  ISVs can plug directly into Hadoop
The widest range of deployment options
•  Linux & Windows
•  On premises & cloud
Others
ISV
Engines
On-Premises
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
SECURITY OPERATIONS
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Deployment ChoiceLinux Windows CloudOn-Premises
YARN: Data Operating System
(Cluster Resource Management)
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
 Slider
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Others
ISV
Engines
We will cover:
•  What it is & where it is used
•  Basic elements
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
MapReduce
MapReduce is a framework for writing
applications that process large amounts
of structured and unstructured data in
parallel across a cluster of thousands of
machines, in a reliable and fault-tolerant
manner
Developers use it to…
•  They don’t have to anymore
•  Many tools have been created
to abstract this complexity
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pig
•  Apache™ Pig allows you to write complex
MapReduce transformations using a simple
scripting language.
•  Pig Latin (the language) defines a set of
transformations on a data set such as
aggregate, join and sort.
•  Pig Latin is sometimes extended using UDFs
(User Defined Functions), in Java or a scripting
language and then call directly from the Pig
Latin.
Developers use Pig for
•  ETL
•  Basic “spreadsheet” functions
•  Prepare data for data science
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example
RAW_LOGS	
  =	
  LOAD	
  '/user/paul/data/apache/access'	
  USING	
  TextLoader	
  
as	
  (line:chararray);	
  
	
  
CLICKS_RAW	
  =	
  LOAD	
  '$input'	
  USING	
  PigStorage('|')	
  as	
  
(sls_key:chararray,	
  sls_item_ln_id:int,	
  chn_id:int,	
  loc_id:int,	
  
all_chnl_rpt_chn_id:int,	
  all_chnl_rpt_loc_id:int,	
  
sls_bsns_dt:chararray,	
  sku_id:int);	
  
	
  
RECORDS	
  =	
  load	
  'config'	
  using	
  
org.apache.hcatalog.pig.HCatLoader();	
  
	
  
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Pig Operators
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive
•  Apache Hive is the defacto standard for SQL
queries over petabytes of data in Hadoop
•  Created by a team at Facebook.
•  Provides a standard SQL interface to data
stored in Hadoop.
•  Quickly find value in raw data files.
•  Proven at petabyte scale.
•  Compatible with every popular BI tools such
as Tableau, Excel, MicroStrategy, Business
Objects, etc.
Developers use it to:
•  Perform SQL queries
•  Interface with existing tools via
JDBC/ODBC
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample SQL with Hive
SELECT [ALL | DISTINCT] select_expr, select_expr, ...!
FROM table_reference!
[WHERE where_condition]!
[GROUP BY col_list]!
[HAVING having_condition]!
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY
!col_list]]!
[LIMIT number] ; !
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive - Select Syntax
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Demonstration
HDP Sandbox
•  Up and running with a Hadoop
environment in minutes
•  Basic and advanced tutorials with
discreet learning paths
•  Ecosystem partner tutorials
hortonworks.com/sandbox
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase
•  Apache™ HBase is a non-relational (NoSQL)
database that runs on top of the Hadoop®
Distributed File System (HDFS).
•  It is columnar and provides fault-tolerant
storage and quick access to large quantities
of sparse data.
•  It also adds transactional capabilities to
Hadoop, allowing users to conduct updates,
inserts and deletes.
•  HBase was created for hosting very large
tables with billions of rows and millions of
columns.
Developers use it to:
•  Provide low latency access to
massive amounts of data (eg.
Recommendation engine
results)
•  Document store
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Phoenix
•  Apache™ Phoenix is a high performance
relational database layer over HBase for low
latency applications.
•  SQL queries are compiled into a series of
HBase scans producing regular JDBC result
sets.
•  Table metadata is stored in an HBase table
and versioned and can be queried by version.
•  Query performance in the millisecond to low
seconds range.
•  Largest know table size is a Trillion+ rows
with query response times in the 30 second
range.
Developers use it for:
•  Low latency queries
•  SQL skin on HBase
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Phoenix Functions
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HBase/Phoenix Demonstration
HDP Sandbox
•  Up and running with a Hadoop
environment in minutes
•  Basic and advanced tutorials with
discreet learning paths
•  Ecosystem partner tutorials
hortonworks.com/sandbox
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Storm
•  Apache™ Storm is a distributed real-time
computation system for processing fast, large
streams of data. Storm adds reliable real-time
data processing capabilities to Hadoop.
•  Storm is extremely fast, with the ability to
process over a million records per second per
node on a cluster of modest size.
•  Apache Kafka is a publish-subscribe
messaging system that works well with
Storm.
Developers use it to:
•  Analyze stream data in real-
time
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Solr
•  Apache Solr provides full-text search and
near real-time indexing for data stored in
Hadoop.
•  Whether users search for tabular, text, geo-
location or sensor data in Hadoop, they find it
quickly with Apache Solr.
•  Apache Solr indexes via XML, JSON, CSV or
binary over HTTP. Users can query petabytes
of data via HTTP GET and receive XML, JSON,
CSV or binary results.
Developers use it to:
•  Provide search capability for a
cluster
•  Data Scientist often use to
explore data found in HDFS
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark
•  Spark is a general-purpose engine for ad-hoc
interactive analytics, iterative machine-
learning, and other use cases well-suited to
interactive, in-memory data processing of GB
to TB sized datasets.
•  Spark loads data into memory so it can be
queried repeatedly. It can create a “shadow”
of data that can be used in the next iteration
of a query
•  Spark provides simple APIs for data scientists
and engineers familiar with Scala
(programming language) to build applications
•  Spark is YARN-ready – another engine on
YARN!
Developers use it to:
•  Data Science: machine learning
and iterative analytics
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Cascading
•  Cascading is an application development
framework for building data applications.
Converts applications into MapReduce jobs.
•  The Cascading SDK provides a collection of
tools, documentation, libraries, tutorials and
example projects.
•  Lingual. Simplifies systems integration through ANSI
SQL compatibility and a JDBC driver
•  Pattern. Enables various machine learning scoring
algorithms through PMML compatibility
•  Scalding. Enables development with Scala, a
powerful language for solving functional problems
•  Cascalog. Enables development with Clojure, a Lisp
dialect
Developers use it to:
•  Build complex native Hadoop
applications without getting
into MapReduce.
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
.net
•  The Microsoft .NET SDK for Hadoop provides
API access to HDP and Microsoft HDInsight
including HDFS, HCatalog, Oozie and Ambari,
and also some Powershell scripts for cluster
management.
•  There are also libraries for MapReduce and
LINQ to Hive. The latter is really interesting as
it builds on the established technology
for .NET developers to access most data
sources to deliver the capabilities of the de
facto standard for Hadoop data query.
Developers use it to:
•  Build complex MSFT .net
Hadoop applications.
Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Java & Spring XD
•  Spring for Apache Hadoop (SHDP) provides a
developer API for Pig, Hive, Cascading and
provides extensions to Spring Batch for
orchestrating Hadoop based workflows.
•  It integrates with other Spring ecosystem
project such as Spring Integration and Spring
Batch
•  These foundational parts of Spring IO
platform make Hadoop development more
accessible to a wider range of Java
developers.
Developers use it to:
•  Build complex Hadoop
applications using Java and the
Spring framework
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop Summit 2015
Page 40
Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank You!
Eric Mizell – Director, Solutions Engineering
emizell@hortonworks.com
@ericmizell

Weitere ähnliche Inhalte

Was ist angesagt?

Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in HadoopRommel Garcia
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Mich Talebzadeh (Ph.D.)
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 

Was ist angesagt? (20)

Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 

Andere mochten auch

"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013Kai Wähner
 
Search Engine Optimization (SEO) Trends 2015
Search Engine Optimization (SEO) Trends 2015Search Engine Optimization (SEO) Trends 2015
Search Engine Optimization (SEO) Trends 2015Venchito Tampon
 
Exorcise the NIMBY Within
Exorcise the NIMBY WithinExorcise the NIMBY Within
Exorcise the NIMBY Withinacohenhnk
 
Acceptable behaviour? Government intervention on unhealthy foods
Acceptable behaviour? Government intervention on unhealthy foodsAcceptable behaviour? Government intervention on unhealthy foods
Acceptable behaviour? Government intervention on unhealthy foodsIpsos UK
 
What happens to the artist when you pirate
What happens to the artist when you pirateWhat happens to the artist when you pirate
What happens to the artist when you pirateUtsab Bandopadhyay
 
Does Your Business Need to be Using Social Media
Does Your Business Need to be Using Social MediaDoes Your Business Need to be Using Social Media
Does Your Business Need to be Using Social MediaHall Internet Marketing
 
What is Google+ and why should we care? (2013 edition)
What is Google+ and why should we care? (2013 edition) What is Google+ and why should we care? (2013 edition)
What is Google+ and why should we care? (2013 edition) Kamber
 
Taylor Milbun Estate Agents In Essex Who Help With Mortgage
Taylor Milbun Estate Agents In Essex Who Help With MortgageTaylor Milbun Estate Agents In Essex Who Help With Mortgage
Taylor Milbun Estate Agents In Essex Who Help With MortgageMark Joseph
 
Enseñanza de la me canica
Enseñanza de la me canicaEnseñanza de la me canica
Enseñanza de la me canicamvaldes0127
 
Year 13 parents' evening presentation - October 2015
Year 13 parents' evening presentation - October 2015Year 13 parents' evening presentation - October 2015
Year 13 parents' evening presentation - October 2015rpalmerratcliffe
 
Presentazione turismo pellegrino
Presentazione turismo pellegrinoPresentazione turismo pellegrino
Presentazione turismo pellegrinoClaudio Cheirasco
 
1 plan del buen vivir 2009 2013-octubre 20_2010
1 plan del buen vivir 2009 2013-octubre 20_20101 plan del buen vivir 2009 2013-octubre 20_2010
1 plan del buen vivir 2009 2013-octubre 20_2010ubertocortez
 

Andere mochten auch (20)

"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
 
Search Engine Optimization (SEO) Trends 2015
Search Engine Optimization (SEO) Trends 2015Search Engine Optimization (SEO) Trends 2015
Search Engine Optimization (SEO) Trends 2015
 
Exorcise the NIMBY Within
Exorcise the NIMBY WithinExorcise the NIMBY Within
Exorcise the NIMBY Within
 
Acceptable behaviour? Government intervention on unhealthy foods
Acceptable behaviour? Government intervention on unhealthy foodsAcceptable behaviour? Government intervention on unhealthy foods
Acceptable behaviour? Government intervention on unhealthy foods
 
Historiadeladn
HistoriadeladnHistoriadeladn
Historiadeladn
 
What happens to the artist when you pirate
What happens to the artist when you pirateWhat happens to the artist when you pirate
What happens to the artist when you pirate
 
شكر
شكرشكر
شكر
 
Angola
AngolaAngola
Angola
 
Earthsoft-Collection-Apr 2011
Earthsoft-Collection-Apr 2011Earthsoft-Collection-Apr 2011
Earthsoft-Collection-Apr 2011
 
Does Your Business Need to be Using Social Media
Does Your Business Need to be Using Social MediaDoes Your Business Need to be Using Social Media
Does Your Business Need to be Using Social Media
 
What is Google+ and why should we care? (2013 edition)
What is Google+ and why should we care? (2013 edition) What is Google+ and why should we care? (2013 edition)
What is Google+ and why should we care? (2013 edition)
 
Taylor Milbun Estate Agents In Essex Who Help With Mortgage
Taylor Milbun Estate Agents In Essex Who Help With MortgageTaylor Milbun Estate Agents In Essex Who Help With Mortgage
Taylor Milbun Estate Agents In Essex Who Help With Mortgage
 
Renevela16
Renevela16Renevela16
Renevela16
 
Enseñanza de la me canica
Enseñanza de la me canicaEnseñanza de la me canica
Enseñanza de la me canica
 
Year 13 parents' evening presentation - October 2015
Year 13 parents' evening presentation - October 2015Year 13 parents' evening presentation - October 2015
Year 13 parents' evening presentation - October 2015
 
จรรยาวิชาชีพวิจัย
จรรยาวิชาชีพวิจัยจรรยาวิชาชีพวิจัย
จรรยาวิชาชีพวิจัย
 
Presentazione turismo pellegrino
Presentazione turismo pellegrinoPresentazione turismo pellegrino
Presentazione turismo pellegrino
 
lingkaran
lingkaranlingkaran
lingkaran
 
1 plan del buen vivir 2009 2013-octubre 20_2010
1 plan del buen vivir 2009 2013-octubre 20_20101 plan del buen vivir 2009 2013-octubre 20_2010
1 plan del buen vivir 2009 2013-octubre 20_2010
 
Sesion 5
Sesion 5Sesion 5
Sesion 5
 

Ähnlich wie How YARN Enables Multiple Data Processing Engines in Hadoop

Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Rommel Garcia
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - finalHortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 

Ähnlich wie How YARN Enables Multiple Data Processing Engines in Hadoop (20)

Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Hoya for Code Review
Hoya for Code ReviewHoya for Code Review
Hoya for Code Review
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 

Mehr von POSSCON

Why Meteor.JS?
Why Meteor.JS?Why Meteor.JS?
Why Meteor.JS?POSSCON
 
Vagrant 101
Vagrant 101Vagrant 101
Vagrant 101POSSCON
 
Tools for Open Source Systems Administration
Tools for Open Source Systems AdministrationTools for Open Source Systems Administration
Tools for Open Source Systems AdministrationPOSSCON
 
Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...
Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...
Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...POSSCON
 
Accelerating Application Delivery with OpenShift
Accelerating Application Delivery with OpenShiftAccelerating Application Delivery with OpenShift
Accelerating Application Delivery with OpenShiftPOSSCON
 
Openstack 101
Openstack 101Openstack 101
Openstack 101POSSCON
 
Community Building: The Open Source Way
Community Building: The Open Source WayCommunity Building: The Open Source Way
Community Building: The Open Source WayPOSSCON
 
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayI Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayPOSSCON
 
Software Defined Networking (SDN) for the Datacenter
Software Defined Networking (SDN) for the DatacenterSoftware Defined Networking (SDN) for the Datacenter
Software Defined Networking (SDN) for the DatacenterPOSSCON
 
Application Security on a Dime: A Practical Guide to Using Functional Open So...
Application Security on a Dime: A Practical Guide to Using Functional Open So...Application Security on a Dime: A Practical Guide to Using Functional Open So...
Application Security on a Dime: A Practical Guide to Using Functional Open So...POSSCON
 
Why Your Open Source Story Matters
Why Your Open Source Story MattersWhy Your Open Source Story Matters
Why Your Open Source Story MattersPOSSCON
 
Google Summer of Code
Google Summer of CodeGoogle Summer of Code
Google Summer of CodePOSSCON
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
How to Use Cryptography Properly: The Common Mistakes People Make When Using ...
How to Use Cryptography Properly: The Common Mistakes People Make When Using ...How to Use Cryptography Properly: The Common Mistakes People Make When Using ...
How to Use Cryptography Properly: The Common Mistakes People Make When Using ...POSSCON
 
Cyber Security and Open Source
Cyber Security and Open SourceCyber Security and Open Source
Cyber Security and Open SourcePOSSCON
 
Intro to AngularJS
Intro to AngularJSIntro to AngularJS
Intro to AngularJSPOSSCON
 
Docker 101: An Introduction
Docker 101: An IntroductionDocker 101: An Introduction
Docker 101: An IntroductionPOSSCON
 
Graph the Planet!
Graph the Planet!Graph the Planet!
Graph the Planet!POSSCON
 
Software Freedom Licensing: What You Must Know
Software Freedom Licensing: What You Must KnowSoftware Freedom Licensing: What You Must Know
Software Freedom Licensing: What You Must KnowPOSSCON
 
Contributing to an Open Source Project 101
Contributing to an Open Source Project 101Contributing to an Open Source Project 101
Contributing to an Open Source Project 101POSSCON
 

Mehr von POSSCON (20)

Why Meteor.JS?
Why Meteor.JS?Why Meteor.JS?
Why Meteor.JS?
 
Vagrant 101
Vagrant 101Vagrant 101
Vagrant 101
 
Tools for Open Source Systems Administration
Tools for Open Source Systems AdministrationTools for Open Source Systems Administration
Tools for Open Source Systems Administration
 
Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...
Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...
Assembling an Open Source Toolchain to Manage Public, Private and Hybrid Clou...
 
Accelerating Application Delivery with OpenShift
Accelerating Application Delivery with OpenShiftAccelerating Application Delivery with OpenShift
Accelerating Application Delivery with OpenShift
 
Openstack 101
Openstack 101Openstack 101
Openstack 101
 
Community Building: The Open Source Way
Community Building: The Open Source WayCommunity Building: The Open Source Way
Community Building: The Open Source Way
 
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayI Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
 
Software Defined Networking (SDN) for the Datacenter
Software Defined Networking (SDN) for the DatacenterSoftware Defined Networking (SDN) for the Datacenter
Software Defined Networking (SDN) for the Datacenter
 
Application Security on a Dime: A Practical Guide to Using Functional Open So...
Application Security on a Dime: A Practical Guide to Using Functional Open So...Application Security on a Dime: A Practical Guide to Using Functional Open So...
Application Security on a Dime: A Practical Guide to Using Functional Open So...
 
Why Your Open Source Story Matters
Why Your Open Source Story MattersWhy Your Open Source Story Matters
Why Your Open Source Story Matters
 
Google Summer of Code
Google Summer of CodeGoogle Summer of Code
Google Summer of Code
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
How to Use Cryptography Properly: The Common Mistakes People Make When Using ...
How to Use Cryptography Properly: The Common Mistakes People Make When Using ...How to Use Cryptography Properly: The Common Mistakes People Make When Using ...
How to Use Cryptography Properly: The Common Mistakes People Make When Using ...
 
Cyber Security and Open Source
Cyber Security and Open SourceCyber Security and Open Source
Cyber Security and Open Source
 
Intro to AngularJS
Intro to AngularJSIntro to AngularJS
Intro to AngularJS
 
Docker 101: An Introduction
Docker 101: An IntroductionDocker 101: An Introduction
Docker 101: An Introduction
 
Graph the Planet!
Graph the Planet!Graph the Planet!
Graph the Planet!
 
Software Freedom Licensing: What You Must Know
Software Freedom Licensing: What You Must KnowSoftware Freedom Licensing: What You Must Know
Software Freedom Licensing: What You Must Know
 
Contributing to an Open Source Project 101
Contributing to an Open Source Project 101Contributing to an Open Source Project 101
Contributing to an Open Source Project 101
 

Kürzlich hochgeladen

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

How YARN Enables Multiple Data Processing Engines in Hadoop

  • 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved How YARN Enables Multiple Data Processing Engines in Hadoop We Do Hadoop Eric Mizell - Director, Solution Engineering
  • 2. Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • HDFS Overview - Storage • YARN 101 - Compute – Yet Another Resource Negotiator • Enabling a Modern Data Architecture • YARN in action – Demo of streaming application • Hadoop Tools – Demos • Sample Code - https://github.com/emizell/HBase-Code-Samples 2
  • 3. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDFS Overview 3
  • 4. Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDFS Overview 4 •  Typical Hardware for DataNodes –  2@8 Core –  256GB RAM –  2@24TB Disk –  10 GbE •  Hadoop is rack aware –  Data is replicated across racks to ensure no data loss •  Scale up or down –  Add or remove DataNodes and HDFS auto rebalances •  HDFS is a file system –  Store any kind of data –  Inexpensive storage –  Replica of 3 by default (can be changed)
  • 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN Concepts • Application – Application is a job submitted to the framework – Example – MapReduce Job • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc.) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU – Replaces the fixed map/reduce slots 5
  • 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN Architecture • Resource Manager – Global resource scheduler – Hierarchical queues – Application management • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master 6
  • 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved RackN NodeManager NodeManager NodeManager Rack2 NodeManager NodeManager NodeManager Rack1 NodeManager NodeManager NodeManager C2.1 C1.4 AM2 C2.2 C2.3 AM1 C1.3 C1.2 C1.1 Hadoop Client 1 Hadoop Client 2 create app2 submit app1 submit app2 create app1 ASM Scheduler queues ASM Containers NM ASM Scheduler Resources .......negotiates....... .......reports to....... .......partitions....... ResourceManager status report YARN – Running Apps
  • 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop 2.x Stack – Enabled by YARN Hadoop YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Cluster: Ranger Deployment ChoiceLinux Windows On-Premises Cloud YARN is the architectural center of HDP Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities The widest range of deployment options
  • 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop 2.2.x Stack – Versions
  • 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Enabling a Modern Data Architecture with Apache Hadoop Hortonworks. We do Hadoop.
  • 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Existing Siloed Data Architectures Under PressureAPPLICATIONS  DATA    SYSTEM  SOURCES   Business     Analy:cs   Custom   Applica:ons   Packaged   Applica:ons   Exis:ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   SILO   SILO   RDBMS   SILO   SILO   SILO   SILO   EDW   MPP   Data  growth:  New  Data  Types   OLTP,  ERP,  CRM  Systems   Unstructured  docs,  emails   Clickstream   Server  logs   Social/Web  Data   Sensor.  Machine  Data   Geoloca:on   85% Source: IDC ?? "   Can’t manage new data paradigm "   Constrains data to specific schema " Siloed data "   Limited scalability "   Economically unfeasible "   Limited analytics
  • 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP2 and YARN enable the Modern Data Architecture Hortonworks architected and 
 led development of YARN Common data set, multiple applications •  Optionally land all data in a single cluster •  Batch, interactive & real-time use cases •  Support multi-tenant access, processing & segmentation of data YARN: Architectural center of Hadoop •  Consistent security, governance & operations •  Ecosystem applications certified 
 by Hortonworks to run natively in Hadoop SOURCES EXISTING   Systems   Clickstream   Web     &Social   Geoloca:on   Sensor     &  Machine   Server     Logs   Unstructured   APPLICATIONSDATASYSTEM Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Interactive Real-TimeBatch
  • 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN in Action Hortonworks. We do Hadoop.
  • 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Truck Sensors Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Microsoft Excel Interactive Query (Hive on Tez) Alerts & Events (ActiveMQ) Real-Time User Interface Real-time Serving (HBase)
  • 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Components of the Topology • 9 Node HDP 2.2 Cluster with Storm and HBase on YARN • 4 Node 0.8 Kafka Cluster • 1 Node ActiveMQ with Stomp Protocol Enabled • Spring 4.0 WebMVC Web Using SocketJS & ActiveMQ over STOMP Page 15
  • 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Topology Architecture Page 16 Truck Simulator T(1) T(2) T(N) Truck Stream Generator via AKKA Kafka Collector Kafka Grid - Captures all Driving Events BR(1) BR(2) BR(3) BR(4) BR(5) ZK truck_events TOPIC Storm on YARN on HDP Kafka Spout HBase Bolt Monitoring Bolt WebSocket Bolt HBase on HDP HBase driver dangerous events driver dangerous events count Email Alerts ActiveMQ Alert Topic Spring WebApp with SockJS WebSockets Real-Time Streaming Driver Monitoring App ActiveMQ Violation Events Topic
  • 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo
  • 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop Tools Hortonworks. We do Hadoop.
  • 19. Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda •  The Basics •  MapReduce & Java •  Pig •  Hive •  HBase, Solr & Spark •  Abstractions: .net, cascading and Spring XD •  Intro to the Sandbox •  Basic Hello World Using Hive and Pig •  HBase and Phoenix demo and code discussion
  • 20. Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks Data Platform 2.2 HDP Delivers Enterprise Hadoop YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Tez Tez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Deployment ChoiceLinux Windows Cloud YARN is the architectural center of HDP •  Common data set across all applications •  Batch, interactive & real-time workloads •  Multi-tenant access & processing Provides comprehensive enterprise capabilities •  Governance •  Security •  Operations Enables broad ecosystem adoption •  ISVs can plug directly into Hadoop The widest range of deployment options •  Linux & Windows •  On premises & cloud Others ISV Engines On-Premises
  • 21. Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks Data Platform 2.2 HDP Delivers Enterprise Hadoop 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SECURITY OPERATIONS Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Deployment ChoiceLinux Windows CloudOn-Premises YARN: Data Operating System (Cluster Resource Management) Script Pig SQL Hive Tez Tez Java Scala Cascading Tez Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Others ISV Engines We will cover: •  What it is & where it is used •  Basic elements
  • 22. Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved MapReduce MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner Developers use it to… •  They don’t have to anymore •  Many tools have been created to abstract this complexity M M M R R M M R M M R M M R HDFS HDFS HDFS
  • 23. Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Pig •  Apache™ Pig allows you to write complex MapReduce transformations using a simple scripting language. •  Pig Latin (the language) defines a set of transformations on a data set such as aggregate, join and sort. •  Pig Latin is sometimes extended using UDFs (User Defined Functions), in Java or a scripting language and then call directly from the Pig Latin. Developers use Pig for •  ETL •  Basic “spreadsheet” functions •  Prepare data for data science
  • 24. Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example RAW_LOGS  =  LOAD  '/user/paul/data/apache/access'  USING  TextLoader   as  (line:chararray);     CLICKS_RAW  =  LOAD  '$input'  USING  PigStorage('|')  as   (sls_key:chararray,  sls_item_ln_id:int,  chn_id:int,  loc_id:int,   all_chnl_rpt_chn_id:int,  all_chnl_rpt_loc_id:int,   sls_bsns_dt:chararray,  sku_id:int);     RECORDS  =  load  'config'  using   org.apache.hcatalog.pig.HCatLoader();    
  • 25. Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Pig Operators
  • 26. Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive •  Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop •  Created by a team at Facebook. •  Provides a standard SQL interface to data stored in Hadoop. •  Quickly find value in raw data files. •  Proven at petabyte scale. •  Compatible with every popular BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc. Developers use it to: •  Perform SQL queries •  Interface with existing tools via JDBC/ODBC
  • 27. Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Sample SQL with Hive SELECT [ALL | DISTINCT] select_expr, select_expr, ...! FROM table_reference! [WHERE where_condition]! [GROUP BY col_list]! [HAVING having_condition]! [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY !col_list]]! [LIMIT number] ; !
  • 28. Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive - Select Syntax
  • 29. Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hive Demonstration HDP Sandbox •  Up and running with a Hadoop environment in minutes •  Basic and advanced tutorials with discreet learning paths •  Ecosystem partner tutorials hortonworks.com/sandbox
  • 30. Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase •  Apache™ HBase is a non-relational (NoSQL) database that runs on top of the Hadoop® Distributed File System (HDFS). •  It is columnar and provides fault-tolerant storage and quick access to large quantities of sparse data. •  It also adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. •  HBase was created for hosting very large tables with billions of rows and millions of columns. Developers use it to: •  Provide low latency access to massive amounts of data (eg. Recommendation engine results) •  Document store
  • 31. Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Phoenix •  Apache™ Phoenix is a high performance relational database layer over HBase for low latency applications. •  SQL queries are compiled into a series of HBase scans producing regular JDBC result sets. •  Table metadata is stored in an HBase table and versioned and can be queried by version. •  Query performance in the millisecond to low seconds range. •  Largest know table size is a Trillion+ rows with query response times in the 30 second range. Developers use it for: •  Low latency queries •  SQL skin on HBase
  • 32. Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Phoenix Functions
  • 33. Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase/Phoenix Demonstration HDP Sandbox •  Up and running with a Hadoop environment in minutes •  Basic and advanced tutorials with discreet learning paths •  Ecosystem partner tutorials hortonworks.com/sandbox
  • 34. Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Storm •  Apache™ Storm is a distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Hadoop. •  Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size. •  Apache Kafka is a publish-subscribe messaging system that works well with Storm. Developers use it to: •  Analyze stream data in real- time
  • 35. Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Solr •  Apache Solr provides full-text search and near real-time indexing for data stored in Hadoop. •  Whether users search for tabular, text, geo- location or sensor data in Hadoop, they find it quickly with Apache Solr. •  Apache Solr indexes via XML, JSON, CSV or binary over HTTP. Users can query petabytes of data via HTTP GET and receive XML, JSON, CSV or binary results. Developers use it to: •  Provide search capability for a cluster •  Data Scientist often use to explore data found in HDFS
  • 36. Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Spark •  Spark is a general-purpose engine for ad-hoc interactive analytics, iterative machine- learning, and other use cases well-suited to interactive, in-memory data processing of GB to TB sized datasets. •  Spark loads data into memory so it can be queried repeatedly. It can create a “shadow” of data that can be used in the next iteration of a query •  Spark provides simple APIs for data scientists and engineers familiar with Scala (programming language) to build applications •  Spark is YARN-ready – another engine on YARN! Developers use it to: •  Data Science: machine learning and iterative analytics
  • 37. Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Cascading •  Cascading is an application development framework for building data applications. Converts applications into MapReduce jobs. •  The Cascading SDK provides a collection of tools, documentation, libraries, tutorials and example projects. •  Lingual. Simplifies systems integration through ANSI SQL compatibility and a JDBC driver •  Pattern. Enables various machine learning scoring algorithms through PMML compatibility •  Scalding. Enables development with Scala, a powerful language for solving functional problems •  Cascalog. Enables development with Clojure, a Lisp dialect Developers use it to: •  Build complex native Hadoop applications without getting into MapReduce.
  • 38. Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved .net •  The Microsoft .NET SDK for Hadoop provides API access to HDP and Microsoft HDInsight including HDFS, HCatalog, Oozie and Ambari, and also some Powershell scripts for cluster management. •  There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query. Developers use it to: •  Build complex MSFT .net Hadoop applications.
  • 39. Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Java & Spring XD •  Spring for Apache Hadoop (SHDP) provides a developer API for Pig, Hive, Cascading and provides extensions to Spring Batch for orchestrating Hadoop based workflows. •  It integrates with other Spring ecosystem project such as Spring Integration and Spring Batch •  These foundational parts of Spring IO platform make Hadoop development more accessible to a wider range of Java developers. Developers use it to: •  Build complex Hadoop applications using Java and the Spring framework
  • 40. Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop Summit 2015 Page 40
  • 41. Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! Eric Mizell – Director, Solutions Engineering emizell@hortonworks.com @ericmizell