[Hadoop Summit 2016 Tokyo] Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. Join this session to understand the nitty-gritties of implementing Hadoop on the cloud and the various options therein. Hadoop + Cloud is definitely a deadly combination.
Speaker: Naoki SATO, Microsoft Japan
http://hadoopsummit.org/tokyo/
https://satonaoki.wordpress.com/2016/11/04/hadoop-summit/
https://docs.com/satonaoki/8352/hadoop-summit-2016-tokyo-hadoop-in-the-cloud-the
http://www.slideshare.net/satonaoki/20161026hadoopsummithadoopinthecloud
http://www.slideshare.net/HadoopSummit/hadoop-in-the-cloud-the-what-why-and-how-from-the-experts-67926008
8. Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Hadoop Clusters
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
8
9. Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Cloud
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
9
10. Distributed Storage
• Files split across storage
• Files replicated
• Nearest node responds
• Abstracted Administration
Hadoop in the Cloud
Extensible
• APIs to extend functionality
• Add new capabilities
• Allow for inclusion in custom
environments
Automated Failover
• Unmonitored failover to replicated data
• Built for resiliency
• Metadata stored for later retrieval
Hyper-Scale
• Add resources as desired
• Built to include commodity configs
• Direct correlation of performance and
resources
Distributed Compute
• Distributed processing
• Resource Utilization
• Cost-Efficient method calls
10
16. 16
Azure
HDInsight
Hadoop and Spark
as a Service on Azure
Fully managed Hadoop and Spark for the cloud
100% Open Source Hortonworks Data Platform
Clusters up and running in minutes
Managed, monitored and supported by Microsoft
with the industry’s best enterprise SLA
Use familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower total cost of ownership than deploy
your own Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
21. 21
Azure
Data Lake Store
A hyper scale
repository for big data
analytics workloads
Hadoop File System (HDFS) for the cloud
No limits to scale
Store any data in its native format
Enterprise grade access control and encryption
Optimized for analytic workload performance
22. Customize
cluster?
HDInsight cluster provisioning states
RDP to cluster, update
config files (non-durable)
Ad hoc
Cluster customization options
Hive/Oozie Metastore
Storage accounts & VNET’s
ScriptAction
Via Azure portal
Ready for
deployment
Accepted
Cluster
storage
provisioned
AzureVM
configuration
Running
Timed Out
Error
Cluster
operational
Configuring
HDInsight
Cluster
customization
(custom script
running)
Config values
JAR file placement in
cluster
Via scripting / SDK
No
Yes
23. Cluster integration options
Each cluster surfaces a REST endpoint for integration,
secured via basic authN over SSL
/thrift – ODBC & JDBC
/Templeton – Job Submission,
Metadata management
/ambari – Cluster health,
monitoring
/oozie – Job orchestration,
scheduling
26. Introducing Cortana Intelligence Suite
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
27. Where Big Data is a cornerstone
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
28. Excel BI
Power BI
Mahout
HiveQL
HIVE
Sqoop Pig
Azure Data Lake Analytics
HBase on
Azure
HDInsight
Big Data Sources
(Raw Unstructured)
Log files
Storm for Azure
HDInsight
Azure
Stream Analytics
Spark Streaming
for Azure
HDInsight
Spark SQL
Spark MLib
Azure Data
Lake Store
U-SQL
Data Orchestration/
Workflow
Azure Data Factory
Oozie for Azure
HDInsight
Kafka for Azure
HDInsight
(future)
SQL Server
Integration Services
Azure
Machine
Learning
R ServerSQL Server
R Services
SSRS
SharePoint
BI
Transactional systems
Azure
SQL DW
SQL Server APS
ETL
Azure
Event Hubs
Data Generation Streaming ConsumptionProcessingStorage
OperationalAnalytical/Exploratory
Data Warehouse
Azure
Website
SSAS
Spark
MLLib