Presentation on how to chat with PDF using ChatGPT code interpreter
OOP 2014
1. Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
Emil A. Siemes
esiemes@hortonworks.com
Solution Engineer
January 2014
2. Enable your Modern Data Architecture by
Our Mission: Delivering Enterprise Apache Hadoop
Our Commitment
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Trusted Partners
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
3. APPLICATIONS
A Traditional Approach Under Pressure
Custom
Applications
Business
Analytics
Packaged
Applications
DATA SYSTEM
2.8 ZB in 2012
85% from New Data Types
RDBMS
EDW
MPP
REPOSITORIES
15x Machine Data by 2020
40 ZB by 2020
SOURCES
Source: IDC
Existing Sources
Emerging Sources
(CRM, ERP, Clickstream, Logs)
(Sensor, Sentiment, Geo, Unstructured)
Page 3
4. APPLICATIONS
Emerging Modern Data Architecture
Custom
Applications
Business
Analytics
Packaged
Applications
DEV & DATA
TOOLS
SOURCES
DATA SYSTEM
BUILD &
TEST
OPERATIONAL
TOOLS
RDBMS
EDW
MANAGE &
MONITOR
MPP
REPOSITORIES
Existing Sources
Emerging Sources
(CRM, ERP, Clickstream, Logs)
(Sensor, Sentiment, Geo, Unstructured)
Page 4
5. Drivers of Hadoop Adoption
New Business
Applications
From NEW types of
Data (or existing
types for longer)
Page 5
6. Most Common NEW TYPES OF DATA
1. Sentiment
Understand how your customers feel about your brand and
products – right now
2. Clickstream
Capture and analyze website visitors’ data trails and
optimize your website
3. Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines
4. Geographic
Value
Analyze location-based data to manage operations where
they occur
5. Server Logs
Research logs to diagnose process failures and prevent
security breaches
6. Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents
+ Keep existing
data longer!
7. Drivers of Hadoop Adoption
Architectural
A Modern Data
Architecture
New Business
Applications
Complement your existing data
systems: the right workload in the
right place
Page 7
8. Let’s build a Data Lake…
Instructions on:
hadoopwrangler.com
Page 8
9. HDP Data Lake Solution Architecture
Manage Steps 1-4: Data Lifecycle with Falcon
FALCON (data pipeline & flow management)
Downstream
Data Sources
Oozie (Batch scheduler)
Step 4: Schedule and Orchestrate
HIVE
SOURCE
DATA
PIG
Step 3: Transform, Aggregate & Materialize
EDW
ClickStream
Data
HCATALOG
File
Sales
Transaction/
Data
JMS
Ingestion
Step 1:Extract & Load
REST
HTTP
Social Data
Sqoop/Hiv
e
EDW
(Teradata)
Step 2: Model/Apply Metadata
INTERACTIVE
SQOOP
compute
&
storage
.
Storm
Web
HDFS
Query/
Analytics/Repor
ting Tools
Hive Server
(Tez/Stinger)
Tableau/Excel
YARN
.
MR2
.
NFS
Marketing/I
nventory
HBase
Client
OLTP
HBase
Use Case Type 1:
Materialize & Exchange
(table & user-defined metadata)
FLUME
Product
Data
Mahout
(data processing)
Exchange
.
.
.
Elastic
Search
.
TEZ
.
.
.
SAS
compute
&
storage
Use Case Type 2:
Explore/Visualize
Datameer/Platfo
ra/SAP
Stream Processing,
Real-time Search,
MPI
AMBARI
Streaming
YARN
Apps
Data Lake HDP Grid
Knox – Perimeter Level Security
Opens up Hadoop to
many new use cases
Page 9
10. Hadoop 2: The Introduction of YARN
Store all date in a single place, interact in multiple ways
Single Use System
Multi Use Data Platform
Batch Apps
Batch, Interactive, Online, Streaming, …
1st Gen of
Hadoop
HADOOP 2
Standard Query
Processing
(cluster resource management
& data processing)
HDFS
(redundant, reliable storage)
Real Time Stream
Processing
Hive, Pig
MapReduce
Online Data
Processing
HBase, Accumulo
Storm
Batch
…
Interactive
MapReduce
others
Tez
Efficient Cluster Resource
Management & Shared Services
(YARN)
Redundant, Reliable Storage
(HDFS)
Page 10
11. Let’s start simple…
• A solution unifying all data sources of a mobile App
– Allowing analytics over all data in one place
– In real time and long term
• Mobile Apps have multiple channels for data:
– Data created on the handset (e.g. geo location)
– Data created on servers accessed by the mobile app (e.g. app
data, logs)
– Data from backend services (e.g. RDBMS)
– Store data (e.g. iTunes Connect, Google Play)
– Social data (Twitter, App Reviews, etc.)
Page 11
12. Why Should We Care?
• How much revenue did I made?
(Not that easy to answer as one could think)
• Where are my customers now?
• Can you fulfill requirements from the business like: ”Tell me when our
customers are in a coffee shop so we can offer them e.g. Wifi”
• What are my customers thinking about my app/brand?
• Are the ones complaining really using it (correct)?
• How can I support marketing activities?
• How can I evaluate local marketing activities?
• Does positive/negative sentiment effect my downloads?
• Will my servers be able to deal with the load in 3 months
• …
Page 12
13. Design Goals
• Use as much as we have in our stack as possible
• Minimize dependencies on stacks beyond Hadoop
– Still make it useful and complete
• Make it fit into a 8GB MacBook/Laptop
• Release early & release often
Page 13
15. Types Of Data For iiCaptain
• Geo location data
• Store Data
• iTunes Connect, Google Play, Amazon via AppAnnie
• Twitter
• RDBMS (Sqoop)
• Logs
Page 15
19. SQL Interactive Query & Apache Hive
Key Services
Apache Hive
Platform, operational and
data services essential for
the enterprise
• The defacto standard for Hadoop SQL access
• Used by your current data center partners
• Built for batch AND interactive query
Skills
SQL
Leverage your existing
skills: development,
analytics, operations
Stinger Initiative
Integration
Interoperable with existing
data center investments
Broad, community based effort to deliver the
next generation of Apache Hive
Speed
Scale
SQL
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
Page 19
21. Roadmap
-
Servlet Engine in YARN
Project Savanna: Continuous Delivery end-2-end
Sentiment Analysis with Flume/Hive and App Reviews
Knox
Falcon
Phoenix
Page 21
22. HDP 2.0: Enterprise Hadoop Platform
OPERATIONAL
OPERATIONAL
SERVICES
SERVICES
AMBARI
Cluster
AMBARI Dataset
Mgmnt FALCON
FALCON*
Mgmnt
Schedule
OOZIE
OOZIE
Hortonworks
Data Platform (HDP)
DATA
DATA
SERVICES
SERVICES
FLUME
FLUME
Data
Movement
SQOOP
SQOOP
LOAD &
LOAD &
EXTRACT
EXTRACT
NFS
NFS
CORE
CORE SERVICES
WebHDFS
CORE
CORE SERVICES
SERVICES
KNOX*
KNOX*
WebHDFS
HIVE
HBASEData Access HIVE&
PIG
HCATALOG
HBASE
MAP
Process
REDUCE
TEZ
TEZ
ResourceYARN
Management
Cloud
• Integrates full range of
enterprise-ready services
HDFS
Storage
HDFS
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OS/VM
• The ONLY 100% open source
and most current platform
• Certified and tested at scale
• Engineered for deep
ecosystem interoperability
Appliance
Page 22
23. Hortonworks: The Value of “Open” for You
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-bystep business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Page 23
Editor's Notes
Hello Today I’m going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
Founded just 2.5 years ago from the original hadoop team members a yahoo.Hortonworks emerged as the leader in open source Hadoop.We are commited to ensure H is an enterprise viable data platform ready for your modern data architectureOur team is probably the largest assembled team of Hadoop experts and active leaders in the communityWe not only make sure Hadoop meets all your enterprise requirements likeOperations, reliablity & SecurityIt also needs to bePackaged & Tested and we do this.It has to work with what you have Make Hadoop an enterprise data platform. Make the market function.Innovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
On left hands side you see the traditional sources of data you have. Data base or Web applications may be growing by 8% year over year. And while there is also a lot of innovation happening in this space Over the last couple of years we see more and more pressure on these traditional systems as new datasources bringing a massive growth of new data:Smartphones, Sensors, Internet of things, logs. No efficient way to store and analyze this data.
Then Hadoop entered the scene. What we’re seeing in most organizations is that they bring Hadoop in to the datacenter not to replace the existing systems but to augment or support them. They use the right tool at the right place for the right type of data.Hadoop really is the Landing spot for these new data sources we discussed before. It provides a way to store and process these types of new data in a very cost effective manor. While it’s very cost effective it also scales horizontally and linear which was a key requirement when it was invented at Yahoo: When you need to index the web you better know how to scale and you better can handle the distributed nature of a cluster.
Net New Analytic applications.How to extract value from the new sources60-70% of Hadoop installations are of this type.
Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners