SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Crash Course
Summer 2015
Version 1.0
Hadoop Interest Group
Jules Damji
jdamji@hortonworks.com
@2twitme
Rafael Coss
rafael@hortonworks.com
@racoss
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Crash Course
 Why Hadoop?
 Hadoop Ecosystem & Distribution
 Store Data (HDFS)
 Process Data in Hadoop 1 (MapReduce)
 Process Data in Hadoop 2 (Yarn + MapReduce/Tez)
 Data Access
 Lab
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What disrupted the data center?
?
Data?
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
New Data Paradigm Opens Up New Opportunity
2.8 zettabytes
in 2012
44 zettabytes
in 2020
N E W
1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research
Clickstream
ERP, CRM, SCM
Web & social
Geolocation
Internet of Things
Server logs
Files, emails
Transform every industry via
full fidelity of data and analytics
Opportunity
T R A D I T I O N A L
LAGGARDS
LEADERS
Ability to
Consume Data
Enterprise
Blind Spot
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop YARN-based Architecture Unlocks Opportunity
Consolidates all data sets
Delivers real-time insights
Integrates with data center
Scalable and affordable
T U R N A L L O F Y O U R D ATA I N T O VA L U E
| Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Two Paths in a Customer’s Journey to a Data LakeSCALE
SCOPE
Goal:
• Centralized Architecture
• Data-driven Business
DATA LAKE
Journey to the Data Lake with Hadoop
Systems of Insight
The journey begins
with either:
1. Cost Optimization (Data
Architecture Optimization)
2. Advanced Analytic
Applications
Leaders are Data Driven
Advanced Analytic
Apps
Cost
Optimization
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Ecosystem
runs on
ETL
RDBMS Import/Export
Distributed Storage & Processing Framework
Secure NoSQL DB
SQL on HBase
NoSQL DB
Workflow Management
SQL
Streaming Data Ingestion
Cluster System Operations
Secure Gateway
Distributed Registry
ETL
Search & Indexing
Even Faster Data Processing
Data Management
Machine Learning
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Architecture
Data Access Engines
Distributed Reliable Storage
Distributed Compute Framework
Resource Mgt, Data Locality
Data Operating System
Batch Interactive Streaming
Governance Security
Apps
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Key Services
Hortonworks Data Platform
Multi-tenant data platform built on a centralized
architecture of shared enterprise services
YARN: data operating system
Governance Security
Operations
Resource management
Existing
applications
New
analytics
Partner
applications
Data access: batch, interactive, real-time
Storage
Key Services
Resource and workload management
Scalable tiered storage
Consistent operations
Comprehensive security
Trusted data governance
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HORTONWORKS DATA PLATFORM
Hadoop&
YARN
HDP 2.3 is Apache Hadoop; not “based on” Hadoop
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
2.2.0
2.4.0
2.6.0
HDP 2.3
July 2015
2.7.1
Ongoing Innovation in Apache
HDFS
YARN
MapReduce
Hadoop Core
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HORTONWORKS DATA PLATFORM
Hadoop&
YARN
HDP 2.3 is Apache Hadoop; not “based on” Hadoop
Flume
Oozie
Pig
Hive
Tez
Sqoop
Cloudbreak
Ambari
Slider
Kafka
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.23.4.5
0.4.00.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2
4.0.04.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
2.2.0
2.4.0
2.6.0
0.96.1
0.98.0 0.9.1
0.8.1
HDP 2.3
July 2015
1.3.12.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 4.2.0
Ongoing Innovation in Apache
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
Ensure consistent enterprise services are applied across the Hadoop stack
Vertical
Integration with
YARN and HDFS
Ensure engines can
run reliably and
respectfully in a YARN
based cluster
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Java
Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV
Engines
1 ° ° ° ° ° ° ° ° ° ° ° ° ° °
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
Tez Slider SliderTez Tez
OPERATIONS
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
`
+
/directory/structure/in/memory.txt
Resource management + schedulingDisk, CPU, Memory
Core
NameNode
HDFS
ResourceManager
YARN
Hadoop daemon
User application
NN
RM
DataNode
HDFS
NodeManager
YARN
Worker Node
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Joys of Real Hardware (Jeff Dean)
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Distributed File System (HDFS)
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
The DataNodes
“I’m still here! This is my
latest heartbeat.”
“I’m here too! And here is
my latest heartbeat.”
123
“Hey DataNode1,
Replicate block 123 to
DataNode 3.”
NameNode
DataNode 1 DataNode 3 DataNode 4
123 123
DataNode 1
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Batch Processing in Hadoop
MapReduce
Batch Access to Data
Original data access mechanism for Hadoop
• Framework
Made for developing distributed applications to
process vast amounts of data in-parallel on large
clusters
• Proven
Reliable interface to Hadoop which works from
GB to PB. But, batch oriented – Speed is not it’s
strong point.
• Ecosystem
Ported to Hadoop 2 to run on YARN. Supports
original investments in Hadoop by customers and
partner ecosystem.
DataNode1
Mapper
Data is shuffled
across the network
& sorted
Map
Phase
Shuffle/Sort Reduce Phase
MapReduce Job Lifecycle
Saying that MapReduce is dead is
preposterous
- Would limits us to only new workloads
- ALL Hadoop clusters use map reduce
- Why rewrite everything immediately?
DataNode2
Mapper
DataNode3
Mapper
DataNode1
Reducer
DataNode2
Reducer
DataNode3
Reducer
YARN: Data Operating System
Interactive Real-TimeBatch
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is MapReduce?
Break a large problem into sub-solutions
Map
• Iterate over a large # of records
• Extract something of interest from
each record
Shuffle
• Sort Intermediate results
Reduce
• Aggregate, summarize, filter or
transform intermediate results
• Generate final output
Map
Process
Map
Process
Map
Process
Map
Process
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Map
Process
Reduce
Process
Reduce
Process
Data
Read & ETL
Shuffle & Sort
Aggregation
Data
Data
Data
Data
Data
Data
Data
Data
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
1st Gen Hadoop: Cost Effective Batch at Scale
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
Silos created for distinct
use casesSingle App
BATCH
HDFS
Single App
ONLINE
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop emerged as foundation of new data architecture
Apache Hadoop is an open source data platform for
managing large volumes of high velocity and variety of data
• Built by Yahoo! to be the heartbeat of its ad & search business
• Donated to Apache Software Foundation in 2005 with rapid adoption by
large web properties & early adopter enterprises
• Incredibly disruptive to current platform economics
Traditional Hadoop Advantages
 Manages new data paradigm
 Handles data at scale
 Cost effective
 Open source
Traditional Hadoop Had Limitations
Batch-only architecture
Single purpose clusters, specific data sets
Difficult to integrate with existing investments
Not enterprise-grade
Application
Storage
HDFS
Batch Processing
MapReduce
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What does iOS 6 and Windows 3.1 have in common?
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop Beyond Batch with YARN
HDFS
MapReduce
Pig
(data flow)
Hive
(SQL)
Others
API,
Engine, and
System
Hadoop 1
MapReduce as the Base
HDFS
(redundant, reliable storage)
YARN
(Data Operating System: resource management, etc.)
Tez
(modern execution engine)
Data Flow
Pig
SQL
Hive
Java Apps
Cascading
Batch
MapReduce
Hadoop 2
Apache Yarn as a Base
System
Engine
API’s
Single Use Sysztem
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
A shift from the old to the new…
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Tez is a critical innovation of the Stinger Initiative.
• Along with YARN, Tez not only improves
Hive, but improves all things batch and
interactive for Hadoop; Pig, Cascading…
• More Efficient Processing than MapReduce
• Reduce operations and complexity of back end processing
• Allows for Map Reduce Reduce which saves hard disk operations
• Implements a “service” which is always on, decreasing start times
of jobs
• Allows Caching of Data in Memory
YARN
Dev
Cascading/S
calding
Why is Tez Important?
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File
System)
Scriptin
g
Pig
SQL
Hive
Tez Tez
Applications
Tez
YARN: Data Operating System
Interactive Real-TimeBatch
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez
Hive – MapReduce Hive – Tez
SELECT a.state, COUNT(*), AVG(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVG(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP delivers a Centralized Architecture
YARN
Other Pure Play Vendors
A siloed “with” YARN architecture
Disjoint, Siloed Clusters
• Inefficient use of resources, single tenant, duplicate storage & processing
• Multiple implementations of governance, security and operations
• New applications require new clusters
Hortonworks Data Platform
A centralized architecture built on YARN
Cluster1
Application
Security
Storage
YARN
Governance
Operations
Batch
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
Existing
Applications
New
Analytics
Partner
Applications
(ie. SAS)
Cluster2
Application
Security
Storage
Governance
Operations
ClusterN
Application
Security
Storage
Governance
Operations
…
Interactive
Dedicated
Resource mgt
Real-time
Dedicated
Resource mgt
Single cluster, multiple applications
• Efficient storage, processing
• Centralized Security, Operations, Governance
• Run a variety of applications simultaneously
Data Access: Batch, Interactive & Real-time
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
{Processing + Storage}
=
{MapReduce/YARN + HDFS}
=
{Core Hadoop}
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Modern Data Architecture emerges to unify data & processing
Modern Data Architecture
• Enable applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Data Access?
Data Access defines ALL the channels
through which data can be accessed,
analyzed, cleansed and consumed within
Hadoop. Each channel can be categorized
into THREE core patterns; Batch, Interactive
and Real-time.
Multiple engines provide
optimized access to your mission
critical data.
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Access patterns enabled by YARN
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Interactive Real-TimeBatch
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Projects Enable Access Patterns
• Various Open Source
projects have incubated
in order to meet these
access pattern needs
• Today, they can all run
on a single cluster on a
Single set of data
because of YARN!
• ALL powered by a
BROAD Open
Community
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
(Hadoop Distributed File System)
Batch
MapReduce
Pig
Hive
Interactive
Solr
Spark
Hive
Kafka
Real-Time
HBase
Accumulo
Storm
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Scripting Data Flow & ETL
Apache Pig
• Data flow engine and scripting language (Pig Latin)
• Allows you to transform data and datasets
Advantages over MapReduce
• Reduces time to write jobs
• Community support
• Piggybank has a significant number of UDF’s to help adoption
• There are a large number of existing shops using PIG
YARN: Data Operating System
Interactive Real-TimeBatch
Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pig Latin
• Pig executes in a unique fashion:
o During execution, each statement is processed by the Pig
interpreter
o If a statement is valid, it gets added to a logical plan built by the
interpreter
o The steps in the logical plan do not actually execute until a
DUMP or STORE command is used
Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why use Pig?
• Maybe we want to join two datasets, from different sources, on a
common value, and want to filter, and sort, and get top 5 sites
Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: THE defacto standard for SQL in Hadoop
• What?
• Treat your data in Hadoop as tables
• Provides a standard SQL 92 interface to data in Hadoop
• Why?
• Shipped in every distribution… you already have it (although some do not
ship complete versions) Quickly find value in raw data files
• Proven at petabyte scale for both batch and interactive queries
• Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy,
Business Objects, etc…
Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Architecture
User issues SQL query
Hive parses and plans query
Query converted to
MapReduce/Tez and
executed on Hadoop
2
3
Web UI
JDBC /
ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez
Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce or Tez Job
Data DataData
Hadoop 3
Data-local processing
Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using Tez for Hive Queries
Set the following property in either hive-site.xml or in
your script:
set hive.execution.engine=tez;
Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL Compliance
Evolution of SQL Compliance in Hive
SQL Datatypes SQL Semantics
INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT
FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING
BOOLEAN JOIN on explicit join key
ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins
STRING Sub-queries in the FROM clause
BINARY ROLLUP and CUBE
TIMESTAMP UNION
DECIMAL Standard aggregations (sum, avg, etc.)
DATE Custom Java UDFs
VARCHAR Windowing functions (OVER, RANK, etc.)
CHAR Advanced UDFs (ngram, XPath, URL)
Interval Types Sub-queries for IN/NOT IN, HAVING
JOINs in WHERE Clause
INSERT/UPDATE/DELETE
Legend
Hive 10 or earlier
Roadmap
Hive 11
Hive 12
Hive 13
YARN: Data Operating System
Interactive Real-TimeBatch
Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview of Stinger
Base Optimizations
Generate simplified DAGs
In-memory Hash Joins
Vector Query Engine
Optimized for modern processor
architectures
Tez
Express tasks more simply
Eliminate disk writes
Pre-warmed Containers
ORCFile
Column Store
High Compression
Predicate / Filter Pushdowns
YARN
Next-gen Hadoop data processing
framework
100X+ Faster Time to
Insight
+ +
Deeper Analytical Capabilities
Performance Optimizations
Query Planner
Intelligent Cost-Based Optimizer
Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
System
Engine
API
YARN : Data Operating System
°1 ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
Batch
MapReduce
Real-Time
Slider
Direct
Java
.NET
Scripting
Pig
SQL
Hive
Cascading
Java
Scala
NoSQL
HBase
Accumulo
Stream
Storm
Other
ISV
Other
ISV
Applications
Others
Spark
Other ISV
HDP 2.2 HDP 2.2
HDP 2.2 HDP 2.2
HDP 2.2TezTezTez Tez
YARN: Resource Manager for Hadoop 2.0
Flexible
Enables other purpose-built data processing
models beyond MapReduce (batch), such
as interactive and streaming
Efficient
Double processing IN Hadoop on the same
hardware while providing predictable
performance & quality of service
Shared
Provides a stable, reliable, secure
foundation and shared operational
services across multiple workloads
Data Processing Engines Run Natively IN Hadoop
Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive & Pig
Hive & Pig work well together
and many customers use both
Hive is a good choice:
• if you are familiar with SQL
• when you want to query data
• when you need an answer to
specific questions
Pig is a good choice:
• For ETL (Extract, Transform, Load)
• for preparing data for analysis
• when you have a long series of
steps to perform
YARN: Data Operating System
Interactive Real-TimeBatch
Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pig and Hive Sample Scenario
Hadoop Distributed
File System
Structured
Data
Raw
Data
1. Put the data into HDFS
in its raw format
Answers to
questions = $$
2. Use Pig to explore and
transform
3. Data analysts use Hive to
query the data
4. Data scientists use MapReduce,
R, Mahout and Spark to mine the
data
Hidden gems = $$
Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data ETL Life Cycle
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich with
metadata
3. Stream real-time data
8. Explore & validate data
4. Mask sensitive
data
2. Replicate changed data &
schemas
Visualization
& Analytics
11. Subscribe to datasets
Data Mart
1. Load or archive batch
data
Data Access &
Query
5. Access customer “golden
record
MDM
10. Correlate real-time events
with historical patterns & trends
6. Transform & refine
data
7. Move results to
EDW
Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDP: Any Data, Any Application, Anywhere
Any Application
• Deep integration with ecosystem
partners to extend existing
investments and skills
• Broadest set of applications through
the stable of YARN-Ready applications
Any Data
Deploy applications fueled by clickstream, sensor,
social, mobile, geo-location, server log, and other new
paradigm datasets with existing legacy datasets.
Anywhere
Implement HDP naturally across the complete
range of deployment options
Clickstream Web
& Social
Geolocation Internet of
Things
Server
Logs
Files, emailsERP CRM SCM
hybrid
commodity appliance cloud
Over 70 Hortonworks Certified YARN Apps
Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What next? -> developer.hortonworks.com
Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you!
rafael@hortonworks.com
@racoss
Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
IoT Data Discovery Lab
• A trucking company has over 100 trucks.
• The geolocation data collected from the trucks contains events generated
while the truck drivers are driving.
• The company’s goal with Hadoop is to Mitigate Risk:
o Understand correlations between miles driven and events
o Compute the risk factor for each driver based on mileage & events
o Lab Env
o Sandbox 2.3 TP
o Lab Doc
o URL: http://ow.ly/Qv1JM
o Load Data
o Query Data
o Process Data
Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Move Data Into Hadoop
Geolocation.csv
trucks.csv
Geolocation_stage Geolocation
Trucks_stage Trucks
csv
csv ORC
ORC
SQL
SQL
move
LOAD
Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Geolocation
Trucks
ORC
ORC
SQL
SQL
PIG
Risk Calculation
Truck_mileage
ORC
Avg_mileage
ORC
DriverMileage
ORC
RiskFactor
ORC
Events
ORC
Trucking Risk Analysis – Hadoop ELT
Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Calculate Risk
Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cautionary Statement Regarding Forward-Looking Statements
This presentation contains forward-looking statements involving risks and uncertainties. Such
forward-looking statements in this presentation generally relate to future events, our ability to
increase the number of support subscription customers, the growth in usage of the Hadoop
framework, our ability to innovate and develop the various open source projects that will enhance
the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general
business outlook. In some cases, you can identify forward-looking statements because they
contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,”
“target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or
similar terms or expressions that concern our expectations, strategy, plans or intentions. You
should not rely upon forward-looking statements as predictions of future events. We have based
the forward-looking statements contained in this presentation primarily on our current expectations
and projections about future events and trends that we believe may affect our business, financial
condition and prospects. We cannot assure you that the results, events and circumstances
reflected in the forward-looking statements will be achieved or occur, and actual results, events, or
circumstances could differ materially from those described in the forward-looking statements.
The forward-looking statements made in this prospectus relate only to events as of the date on
which the statements are made and we undertake no obligation to update any of the information in
this presentation.
Trademarks
Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other
names used herein may be trademarks of their respective owners.
Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
A Definition of Open Enterprise Hadoop
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and
manage
according
to policy
Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
SECURITYGOVERNANCE
Deploy and
effectively
manage the
platform
° ° ° ° ° ° ° ° ° ° ° ° ° ° °
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
YARN: Data Operating System
(Cluster Resource Management)
HDFS
(Hadoop Distributed File System)
OPERATIONS
Batch Interactive Real-Time
Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data ETL Life Cycle
Mobile Apps
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
9. Govern & enrich with
metadata
3. Stream real-time data
8. Explore & validate data
4. Mask sensitive
data
2. Replicate changed data &
schemas
Visualization
& Analytics
11. Subscribe to datasets
Data Mart
1. Load or archive batch
data
Data Access &
Query
5. Access customer “golden
record
MDM
10. Correlate real-time events
with historical patterns & trends
6. Transform & refine
data
7. Move results to
EDW
Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
EDW
Data
Data
Data
Data
Data
Data
Data Data
DataSchemaData
Data
Data
ETL ETL
ETL ETL
Fragile workflows make supporting the analytical
models you want expensive and time-consuming.
Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Options for Data Input
MapReduce
WebHDFS
hadoop fs -put
Vendor Connectors
Hadoop
nfs gateway
Hue Explorer
Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Risk Factors Viewed in a Graph
Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Risk Factors Viewed on a Map

Weitere ähnliche Inhalte

Was ist angesagt?

Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun ConnollyHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...Hortonworks
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveHortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...Hortonworks
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Hortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Hortonworks
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 

Was ist angesagt? (20)

Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun Connolly
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 

Andere mochten auch

Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambariHortonworks
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Hortonworks
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksHortonworks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Hortonworks
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
E-commerce, the makeover
E-commerce, the makeoverE-commerce, the makeover
E-commerce, the makeoverPromptCloud
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache HadoopInfoFarm
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 

Andere mochten auch (20)

Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Vmware Basics
Vmware BasicsVmware Basics
Vmware Basics
 
E-commerce, the makeover
E-commerce, the makeoverE-commerce, the makeover
E-commerce, the makeover
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Sqoop
SqoopSqoop
Sqoop
 

Ähnlich wie Hadoop crashcourse v3

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetThiago Santiago
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 

Ähnlich wie Hadoop crashcourse v3 (20)

Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
OOP 2014
OOP 2014OOP 2014
OOP 2014
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Social Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and SupersetSocial Media Monitoring with NiFi, Druid and Superset
Social Media Monitoring with NiFi, Druid and Superset
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 

Mehr von Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Kürzlich hochgeladen

CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Kürzlich hochgeladen (20)

CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

Hadoop crashcourse v3

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Crash Course Summer 2015 Version 1.0 Hadoop Interest Group Jules Damji jdamji@hortonworks.com @2twitme Rafael Coss rafael@hortonworks.com @racoss
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Crash Course  Why Hadoop?  Hadoop Ecosystem & Distribution  Store Data (HDFS)  Process Data in Hadoop 1 (MapReduce)  Process Data in Hadoop 2 (Yarn + MapReduce/Tez)  Data Access  Lab
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What disrupted the data center? ? Data?
  • 4. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved New Data Paradigm Opens Up New Opportunity 2.8 zettabytes in 2012 44 zettabytes in 2020 N E W 1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research Clickstream ERP, CRM, SCM Web & social Geolocation Internet of Things Server logs Files, emails Transform every industry via full fidelity of data and analytics Opportunity T R A D I T I O N A L LAGGARDS LEADERS Ability to Consume Data Enterprise Blind Spot
  • 5. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop YARN-based Architecture Unlocks Opportunity Consolidates all data sets Delivers real-time insights Integrates with data center Scalable and affordable T U R N A L L O F Y O U R D ATA I N T O VA L U E | Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation
  • 6. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Two Paths in a Customer’s Journey to a Data LakeSCALE SCOPE Goal: • Centralized Architecture • Data-driven Business DATA LAKE Journey to the Data Lake with Hadoop Systems of Insight The journey begins with either: 1. Cost Optimization (Data Architecture Optimization) 2. Advanced Analytic Applications Leaders are Data Driven Advanced Analytic Apps Cost Optimization
  • 7. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Ecosystem runs on ETL RDBMS Import/Export Distributed Storage & Processing Framework Secure NoSQL DB SQL on HBase NoSQL DB Workflow Management SQL Streaming Data Ingestion Cluster System Operations Secure Gateway Distributed Registry ETL Search & Indexing Even Faster Data Processing Data Management Machine Learning
  • 8. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Architecture Data Access Engines Distributed Reliable Storage Distributed Compute Framework Resource Mgt, Data Locality Data Operating System Batch Interactive Streaming Governance Security Apps
  • 9. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Key Services Hortonworks Data Platform Multi-tenant data platform built on a centralized architecture of shared enterprise services YARN: data operating system Governance Security Operations Resource management Existing applications New analytics Partner applications Data access: batch, interactive, real-time Storage Key Services Resource and workload management Scalable tiered storage Consistent operations Comprehensive security Trusted data governance
  • 10. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HORTONWORKS DATA PLATFORM Hadoop& YARN HDP 2.3 is Apache Hadoop; not “based on” Hadoop DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 2.2.0 2.4.0 2.6.0 HDP 2.3 July 2015 2.7.1 Ongoing Innovation in Apache HDFS YARN MapReduce Hadoop Core
  • 11. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HORTONWORKS DATA PLATFORM Hadoop& YARN HDP 2.3 is Apache Hadoop; not “based on” Hadoop Flume Oozie Pig Hive Tez Sqoop Cloudbreak Ambari Slider Kafka Knox Solr Zookeeper Spark Falcon Ranger HBase Atlas Accumulo Storm Phoenix 4.10.2 DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 0.12.0 0.12.0 0.12.1 0.13.0 0.4.0 1.4.4 1.4.4 3.3.23.4.5 0.4.00.5.0 0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2 4.0.04.7.2 1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0 1.4.0 1.5.1 4.0.0 1.3.1 1.5.1 1.4.4 3.4.5 2.2.0 2.4.0 2.6.0 0.96.1 0.98.0 0.9.1 0.8.1 HDP 2.3 July 2015 1.3.12.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 4.2.0 Ongoing Innovation in Apache
  • 12. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services Ensure consistent enterprise services are applied across the Hadoop stack Vertical Integration with YARN and HDFS Ensure engines can run reliably and respectfully in a YARN based cluster Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) Tez Slider SliderTez Tez OPERATIONS
  • 13. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ` + /directory/structure/in/memory.txt Resource management + schedulingDisk, CPU, Memory Core NameNode HDFS ResourceManager YARN Hadoop daemon User application NN RM DataNode HDFS NodeManager YARN Worker Node
  • 14. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Joys of Real Hardware (Jeff Dean) Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc
  • 15. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Distributed File System (HDFS) Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  • 16. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The DataNodes “I’m still here! This is my latest heartbeat.” “I’m here too! And here is my latest heartbeat.” 123 “Hey DataNode1, Replicate block 123 to DataNode 3.” NameNode DataNode 1 DataNode 3 DataNode 4 123 123 DataNode 1
  • 17. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Batch Processing in Hadoop MapReduce Batch Access to Data Original data access mechanism for Hadoop • Framework Made for developing distributed applications to process vast amounts of data in-parallel on large clusters • Proven Reliable interface to Hadoop which works from GB to PB. But, batch oriented – Speed is not it’s strong point. • Ecosystem Ported to Hadoop 2 to run on YARN. Supports original investments in Hadoop by customers and partner ecosystem. DataNode1 Mapper Data is shuffled across the network & sorted Map Phase Shuffle/Sort Reduce Phase MapReduce Job Lifecycle Saying that MapReduce is dead is preposterous - Would limits us to only new workloads - ALL Hadoop clusters use map reduce - Why rewrite everything immediately? DataNode2 Mapper DataNode3 Mapper DataNode1 Reducer DataNode2 Reducer DataNode3 Reducer YARN: Data Operating System Interactive Real-TimeBatch
  • 18. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is MapReduce? Break a large problem into sub-solutions Map • Iterate over a large # of records • Extract something of interest from each record Shuffle • Sort Intermediate results Reduce • Aggregate, summarize, filter or transform intermediate results • Generate final output Map Process Map Process Map Process Map Process Data Data Data Data Data Data Data Data Data Data Data Data Data Map Process Reduce Process Reduce Process Data Read & ETL Shuffle & Sort Aggregation Data Data Data Data Data Data Data Data
  • 19. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved 1st Gen Hadoop: Cost Effective Batch at Scale HADOOP 1.0 Built for Web-Scale Batch Apps Single App BATCH HDFS Single App INTERACTIVE Single App BATCH HDFS Silos created for distinct use casesSingle App BATCH HDFS Single App ONLINE
  • 20. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages  Manages new data paradigm  Handles data at scale  Cost effective  Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  • 21. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What does iOS 6 and Windows 3.1 have in common?
  • 22. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hadoop Beyond Batch with YARN HDFS MapReduce Pig (data flow) Hive (SQL) Others API, Engine, and System Hadoop 1 MapReduce as the Base HDFS (redundant, reliable storage) YARN (Data Operating System: resource management, etc.) Tez (modern execution engine) Data Flow Pig SQL Hive Java Apps Cascading Batch MapReduce Hadoop 2 Apache Yarn as a Base System Engine API’s Single Use Sysztem Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … A shift from the old to the new…
  • 23. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Tez is a critical innovation of the Stinger Initiative. • Along with YARN, Tez not only improves Hive, but improves all things batch and interactive for Hadoop; Pig, Cascading… • More Efficient Processing than MapReduce • Reduce operations and complexity of back end processing • Allows for Map Reduce Reduce which saves hard disk operations • Implements a “service” which is always on, decreasing start times of jobs • Allows Caching of Data in Memory YARN Dev Cascading/S calding Why is Tez Important? °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Scriptin g Pig SQL Hive Tez Tez Applications Tez YARN: Data Operating System Interactive Real-TimeBatch
  • 24. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tez Hive – MapReduce Hive – Tez SELECT a.state, COUNT(*), AVG(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 25. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP delivers a Centralized Architecture YARN Other Pure Play Vendors A siloed “with” YARN architecture Disjoint, Siloed Clusters • Inefficient use of resources, single tenant, duplicate storage & processing • Multiple implementations of governance, security and operations • New applications require new clusters Hortonworks Data Platform A centralized architecture built on YARN Cluster1 Application Security Storage YARN Governance Operations Batch Storage YARN: Data Operating System Governance Security Operations Resource Management Existing Applications New Analytics Partner Applications (ie. SAS) Cluster2 Application Security Storage Governance Operations ClusterN Application Security Storage Governance Operations … Interactive Dedicated Resource mgt Real-time Dedicated Resource mgt Single cluster, multiple applications • Efficient storage, processing • Centralized Security, Operations, Governance • Run a variety of applications simultaneously Data Access: Batch, Interactive & Real-time
  • 26. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved {Processing + Storage} = {MapReduce/YARN + HDFS} = {Core Hadoop}
  • 27. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Modern Data Architecture emerges to unify data & processing Modern Data Architecture • Enable applications to have access to all your enterprise data through an efficient centralized platform • Supported with a centralized approach governance, security and operations • Versatile to handle any applications and datasets no matter the size or type Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES Existing Systems ERP CRM SCM ANALYTICS Data Marts Business Analytics Visualization & Dashboards ANALYTICS Applications Business Analytics Visualization & Dashboards ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN: Data Operating System Interactive Real-TimeBatch Partner ISVBatch BatchMP P EDW
  • 28. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Data Access? Data Access defines ALL the channels through which data can be accessed, analyzed, cleansed and consumed within Hadoop. Each channel can be categorized into THREE core patterns; Batch, Interactive and Real-time. Multiple engines provide optimized access to your mission critical data.
  • 29. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Access patterns enabled by YARN Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time. YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Interactive Real-TimeBatch
  • 30. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Projects Enable Access Patterns • Various Open Source projects have incubated in order to meet these access pattern needs • Today, they can all run on a single cluster on a Single set of data because of YARN! • ALL powered by a BROAD Open Community YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Batch MapReduce Pig Hive Interactive Solr Spark Hive Kafka Real-Time HBase Accumulo Storm
  • 31. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Scripting Data Flow & ETL Apache Pig • Data flow engine and scripting language (Pig Latin) • Allows you to transform data and datasets Advantages over MapReduce • Reduces time to write jobs • Community support • Piggybank has a significant number of UDF’s to help adoption • There are a large number of existing shops using PIG YARN: Data Operating System Interactive Real-TimeBatch
  • 32. Page36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pig Latin • Pig executes in a unique fashion: o During execution, each statement is processed by the Pig interpreter o If a statement is valid, it gets added to a logical plan built by the interpreter o The steps in the logical plan do not actually execute until a DUMP or STORE command is used
  • 33. Page37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why use Pig? • Maybe we want to join two datasets, from different sources, on a common value, and want to filter, and sort, and get top 5 sites
  • 34. Page38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: THE defacto standard for SQL in Hadoop • What? • Treat your data in Hadoop as tables • Provides a standard SQL 92 interface to data in Hadoop • Why? • Shipped in every distribution… you already have it (although some do not ship complete versions) Quickly find value in raw data files • Proven at petabyte scale for both batch and interactive queries • Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc…
  • 35. Page39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive Architecture User issues SQL query Hive parses and plans query Query converted to MapReduce/Tez and executed on Hadoop 2 3 Web UI JDBC / ODBC CLI Hive SQL 1 1 HiveServer2 Hive MR/Tez Compiler Optimizer Executor 2 Hive MetaStore (MySQL, Postgresql, Oracle) MapReduce or Tez Job Data DataData Hadoop 3 Data-local processing
  • 36. Page40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Using Tez for Hive Queries Set the following property in either hive-site.xml or in your script: set hive.execution.engine=tez;
  • 37. Page41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL Compliance Evolution of SQL Compliance in Hive SQL Datatypes SQL Semantics INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING BOOLEAN JOIN on explicit join key ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins STRING Sub-queries in the FROM clause BINARY ROLLUP and CUBE TIMESTAMP UNION DECIMAL Standard aggregations (sum, avg, etc.) DATE Custom Java UDFs VARCHAR Windowing functions (OVER, RANK, etc.) CHAR Advanced UDFs (ngram, XPath, URL) Interval Types Sub-queries for IN/NOT IN, HAVING JOINs in WHERE Clause INSERT/UPDATE/DELETE Legend Hive 10 or earlier Roadmap Hive 11 Hive 12 Hive 13 YARN: Data Operating System Interactive Real-TimeBatch
  • 38. Page42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Overview of Stinger Base Optimizations Generate simplified DAGs In-memory Hash Joins Vector Query Engine Optimized for modern processor architectures Tez Express tasks more simply Eliminate disk writes Pre-warmed Containers ORCFile Column Store High Compression Predicate / Filter Pushdowns YARN Next-gen Hadoop data processing framework 100X+ Faster Time to Insight + + Deeper Analytical Capabilities Performance Optimizations Query Planner Intelligent Cost-Based Optimizer
  • 39. Page43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved System Engine API YARN : Data Operating System °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch MapReduce Real-Time Slider Direct Java .NET Scripting Pig SQL Hive Cascading Java Scala NoSQL HBase Accumulo Stream Storm Other ISV Other ISV Applications Others Spark Other ISV HDP 2.2 HDP 2.2 HDP 2.2 HDP 2.2 HDP 2.2TezTezTez Tez YARN: Resource Manager for Hadoop 2.0 Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively IN Hadoop
  • 40. Page44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hive & Pig Hive & Pig work well together and many customers use both Hive is a good choice: • if you are familiar with SQL • when you want to query data • when you need an answer to specific questions Pig is a good choice: • For ETL (Extract, Transform, Load) • for preparing data for analysis • when you have a long series of steps to perform YARN: Data Operating System Interactive Real-TimeBatch
  • 41. Page45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pig and Hive Sample Scenario Hadoop Distributed File System Structured Data Raw Data 1. Put the data into HDFS in its raw format Answers to questions = $$ 2. Use Pig to explore and transform 3. Data analysts use Hive to query the data 4. Data scientists use MapReduce, R, Mahout and Spark to mine the data Hidden gems = $$
  • 42. Page46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data ETL Life Cycle Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Mart 1. Load or archive batch data Data Access & Query 5. Access customer “golden record MDM 10. Correlate real-time events with historical patterns & trends 6. Transform & refine data 7. Move results to EDW
  • 43. Page47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDP: Any Data, Any Application, Anywhere Any Application • Deep integration with ecosystem partners to extend existing investments and skills • Broadest set of applications through the stable of YARN-Ready applications Any Data Deploy applications fueled by clickstream, sensor, social, mobile, geo-location, server log, and other new paradigm datasets with existing legacy datasets. Anywhere Implement HDP naturally across the complete range of deployment options Clickstream Web & Social Geolocation Internet of Things Server Logs Files, emailsERP CRM SCM hybrid commodity appliance cloud Over 70 Hortonworks Certified YARN Apps
  • 44. Page48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What next? -> developer.hortonworks.com
  • 45. Page49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank you! rafael@hortonworks.com @racoss
  • 46. Page50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT Data Discovery Lab • A trucking company has over 100 trucks. • The geolocation data collected from the trucks contains events generated while the truck drivers are driving. • The company’s goal with Hadoop is to Mitigate Risk: o Understand correlations between miles driven and events o Compute the risk factor for each driver based on mileage & events o Lab Env o Sandbox 2.3 TP o Lab Doc o URL: http://ow.ly/Qv1JM o Load Data o Query Data o Process Data
  • 47. Page51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Move Data Into Hadoop Geolocation.csv trucks.csv Geolocation_stage Geolocation Trucks_stage Trucks csv csv ORC ORC SQL SQL move LOAD
  • 48. Page52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Geolocation Trucks ORC ORC SQL SQL PIG Risk Calculation Truck_mileage ORC Avg_mileage ORC DriverMileage ORC RiskFactor ORC Events ORC Trucking Risk Analysis – Hadoop ELT
  • 49. Page53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Calculate Risk
  • 50. Page54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Cautionary Statement Regarding Forward-Looking Statements This presentation contains forward-looking statements involving risks and uncertainties. Such forward-looking statements in this presentation generally relate to future events, our ability to increase the number of support subscription customers, the growth in usage of the Hadoop framework, our ability to innovate and develop the various open source projects that will enhance the capabilities of the Hortonworks Data Platform, anticipated customer benefits and general business outlook. In some cases, you can identify forward-looking statements because they contain words such as “may,” “will,” “should,” “expects,” “plans,” “anticipates,” “could,” “intends,” “target,” “projects,” “contemplates,” “believes,” “estimates,” “predicts,” “potential” or “continue” or similar terms or expressions that concern our expectations, strategy, plans or intentions. You should not rely upon forward-looking statements as predictions of future events. We have based the forward-looking statements contained in this presentation primarily on our current expectations and projections about future events and trends that we believe may affect our business, financial condition and prospects. We cannot assure you that the results, events and circumstances reflected in the forward-looking statements will be achieved or occur, and actual results, events, or circumstances could differ materially from those described in the forward-looking statements. The forward-looking statements made in this prospectus relate only to events as of the date on which the statements are made and we undertake no obligation to update any of the information in this presentation. Trademarks Hortonworks is a trademark of Hortonworks, Inc. in the United States and other jurisdictions. Other names used herein may be trademarks of their respective owners.
  • 51. Page55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved A Definition of Open Enterprise Hadoop Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° BATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) OPERATIONS Batch Interactive Real-Time
  • 52. Page56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data ETL Life Cycle Mobile Apps Transactions, OLTP, OLAP Social Media, Web Logs Machine Device, Scientific Documents and Emails 9. Govern & enrich with metadata 3. Stream real-time data 8. Explore & validate data 4. Mask sensitive data 2. Replicate changed data & schemas Visualization & Analytics 11. Subscribe to datasets Data Mart 1. Load or archive batch data Data Access & Query 5. Access customer “golden record MDM 10. Correlate real-time events with historical patterns & trends 6. Transform & refine data 7. Move results to EDW
  • 53. Page57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved EDW Data Data Data Data Data Data Data Data DataSchemaData Data Data ETL ETL ETL ETL EDW Data Data Data Data Data Data Data Data DataSchemaData Data Data ETL ETL ETL ETL Fragile workflows make supporting the analytical models you want expensive and time-consuming.
  • 54. Page58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Options for Data Input MapReduce WebHDFS hadoop fs -put Vendor Connectors Hadoop nfs gateway Hue Explorer
  • 55. Page59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Risk Factors Viewed in a Graph
  • 56. Page60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Risk Factors Viewed on a Map

Hinweis der Redaktion

  1. Our goal since our inception has been very simple: to enable a Modern Data Architecture with Enterprise Hadoop. Everything we do is with this architectural goal in mind.
  2. The majority of enterprise data has traditionally come from large scale ERP, CRM, and other applications.  Each application has become siloed without the ability to gain insights across ALL the data. Now the enterprise must rationalize existing data silos but also gain value from the explosion of data that is being generated from the new paradigm sources. The challenge is the existing data management platforms have become both architecturally and financially impractical. Architecturally - these systems were not designed to store or process vast quantities of data Financially – the licensing structures with the traditional approach are no longer feasible These challenges and the rate at which data is being produced require a completely new approach to managing data. If we fast-forward another 3 to 5 years, more than 50% of the data under management within the enterprise will be from these new data paradigm sources. We have come to an inflection point on how the enterprise can manage their data. [NEXT SLIDE]
  3. What has created this inflection point is the growth and value from the new paradigm data. New data paradigm sources have put tremendous pressure on existing platforms but have also created tremendous opportunities. Exponential Growth. 85% year over year growth. Varied Nature. The incoming data can have little or no structure, or structure that changes too frequently for reliable schema creation at time of ingest. Value at High Volumes. The incoming data can have little or no value as individual, or small groups of, records. But at high volumes and longer historical perspectives can be inspected for patterns and used for advanced analytic applications. This New Data Paradigm opens up the Opportunity for both an architectural and business transformation that applies to virtually every industry.   [NEXT SLIDE]
  4. In today’s data-rich world, overlooked insight translates into missed opportunity.   The opportunities afforded by the age of Big Data have given rise to a new ultra-competitive breed of business that consumes the full spectrum of its data, transforming immense volumes and varieties of data into currency.   Our customers are investing in next-generation “systems of insight,” with advanced analytic apps providing a single, holistic view of customers and processes, and delivering predictive analytics around business performance and discovery through machine learning.   Underpinning these capabilities is a YARN-based architecture that delivers huge new processing power, scale, and efficiency especially when it’s properly integrated with existing operational and data warehousing systems.   HDP usage typically begins by creating new analytic applications fueled by the data that was not previously being captured.   As more and more applications are created, more opportunity is unlocked across ALL data sets, from the new types of data from sensors/machines, server logs, clickstreams, and other traditional sources like ERP and CRM.   Ultimately, HDP’s YARN-based architecture acts as a shared service for delivering deep insight across a large, broad, diverse set of data at efficient scale in a way that existing enterprise systems and tools can integrate with.   [NEXT SLIDE]
  5. Ultimately, most organizations that adopt Hadoop, aspire to create a data lake where multiple applications use a shared set of resources, for both storage and processing all with a consistent level of service.   The value in the data lake ultimately results in delivery of “systems of insight” where advanced algorithms and applications that access multiple data sets allow organizations to derive brand new value from data that was once unable to be investigated or simply to complex to combine and analyze. Hadoop doesn’t just create a Data Lake—it opens the platform for analysts to view multiple data sources in multiple dimensions and reduce time to insight. This journey from apps to lake is only possible with HDP and its YARN based architecture.
  6. http://hortonworks.com/solutions/data-architecture-optimization/ http://hortonworks.com/solutions/advanced-analytic-apps/#single-view-customer http://hortonworks.com/solutions/advanced-analytic-apps/#predictive-analytics http://hortonworks.com/solutions/advanced-analytic-apps/#data-discovery BAWAG Bank, KPN, Daimler, ING, British Ga
  7. Since starting the company, one of our core missions was to make Hadoop an enterprise viable data platform. With HDP and its YARN-based architecture, the market now has a multi-tenant data platform built on a centralized architecture that provides the shared enterprise services of Resource Management, Operations, Security, Governance in a consistent manner for all Data Access patterns, for batch, interactive, or real-time applications.   These enterprise readiness capabilities help enable HDP to be used everywhere.   While it’s clear that HDP is ready for the enterprise, that doesn’t mean that we stop our work on enterprise readiness.   In fact, it’s just the opposite. There are more security, governance and operational advancements taking place in the Hadoop ecosystem now than ever before. And we continue to advance all of the services with the community.   [NEXT SLIDE]
  8. From Jeff Dean http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
  9. Outlines stays the same      Map and Reduce change to fit the problem
  10. Enter Hadoop. Faced with this challenge the team at yahoo conceived and created apache hadoop to address the challenge. They then were convinced that contribution of this platform into an open community would speed innovation. They open sourced the technology and did so within the governance of the Apache Software Foundation. (ASF) This introduced two distinct significant advantages. Not only could they manage new data types at scale but the now had a commercially feasible approach. However, there will still significant challenges. The first generation of Hadoop was: - designed and optimized for Batch only workloads, - it required dedicated clusters for each application, and, - it didn’t integrate easily with many of the existing technologies present in the data center. Also, like any emerging technology, Hadoop was required to meet a certain level of readiness required by the enterprise. After running Hadoop at scale at yahoo, the team spun out to form Hortonworks with the intent to address these challenges and make Hadoop enterprise ready.
  11. Access, Execution, Resource Mgt
  12. Since HDP provides a centralized architecture that is built on YARN with common services for security, operations, and governance, it enables the enterprise to run a wide range of applications simultaneously with well managed service levels. More applications and more data can run in the same shared cluster which simplifies the security, operations, and governance. Since the other pure play vendors have NOT built their products from the ground-up on a centralized YARN architecture, their platform architectures are disjoint. Without a consistent set of services applied to all applications and workloads, users are forced to silo their clusters in order to achieve predictable performance and service levels – which is more complex and costly. And since the critical services for security, operations, and governance are implemented as bolt-ons, the deployment architecture is further complicated.
  13. In 2011, Hortonworks was founded with the 24 original Hadoop architects and engineers from Yahoo! This original team had been working on a technology called YARN (Yet Another Resource Negotiator) that enable multiple applications to have access to all your enterprise data through an efficient centralized platform. It is the data operating system for hadoop that provides the versatility to handle any application and dataset no matter the size or type. Moreover, YARN provided the centralized architecture around which the critical enterprise services of Security, Operations, and Governance could be centrally addressed and integrate with existing enterprise policies. This work allowed for a new approach to data to emerge, the modern data architecture. At the heart of this approach is the capability for Hadoop to unify data and processing in an efficient data platform
  14. Pig-Latin, a language intended to sit between the two Provides standard relational transforms (join, sort, etc.) Schemas are optional, used when available, can be defined at runtime User Defined Functions are first class citizens An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs
  15. Pig executes in a unique fashion: some commands build on previous commands, while certain commands trigger a MapReduce job.
  16. Interactive queries at scale Originally created by a team at Facebook
  17. HDP 2.x ships with HiveServer2, a Thrift-based implementation that allows multiple concurrent connections and also supports Kerberos authentication.
  18. Note that this property is set to mr by default.
  19. The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it. The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the resource manager for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”. [CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future. For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  20. You have talked about the components of Hadoop, now this slide talks about the various roles of Hadoop professionals.
  21. HDP is versatile to handle any data for any application and anywhere ANY DATA Hadoop was initially designed to store and process vast quantities of data and is still the optimal platformj to do so. With YARN and the introduction of all types of access methids from batch to interactive and real time, access to process and analyze this data has become even easier. ANY APPLICATION YARN also opens up Hadoop so that it can extend the value of linear scale storage and processing to existing applications. This also allows you to reuse your existing skillsets and resources, but with hadop as a foundation. To date, Hortonworks has certified over 70 ISVs to be YARN ready and the list is growing. ANYWHERE As a key part of the modern data architecture, Hadoop needs to be available across a wide range of deployment choices, and we enable the widest choice in the industry. In 2011, we established our partnership with Microsoft based on a shared vision of a hybrid world where Hadoop can run on-premises on Windows Server or Linux, within turnkey appliances, and in the cloud as a fully managed service or simply running within virtual machines on infrastructure-as-a-service clouds. Our work with Microsoft brought Hadoop to the Windows Server ecosystem and we’re the only vendor serving that market opportunity today. While most of our customers are deploying on-premises Hadoop clusters, we are uniquely positioned to support a hybrid architecture as enterprises embrace cloud for specific use cases.
  22. This is a great use case, but only spend 3-4 minutes on it. Run Hive Queries to Refine the Trucks data to get the average mileage Compute the risk factor for each driver (milage
  23. truck_mileage, avg_mileage
  24. Power Pivot again – this time demonstrating which driver’s had the most incidents.
  25. Power Pivot map again – this time showing the areas where the incidents occurred.