Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.
Presented @ MSDEVMTL on Saturday February , 2015
2. Who am I?
My name is Stéphane Fréchette
SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data
|NoSQL | Data Science. Drums, good food and fine wine.
Founder @TEDxGatineau
I have a passion for architecting, designing and building solutions that
matter.
Twitter: @sfrechette
Blog: stephanefrechette.com
Email: stephanefrechette@ukubu.com
3. Topics
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Microsoft Azure HDInsight
• Demos
• Summary
• Resources
• Q&A
4. “Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture, curate, manage, and process
data within a tolerable elapsed time…”
- Wikipedia
8. Hadoop
• Apache Hadoop is for big data
• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models
• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage
10. HDFS
• Hadoop Distributed File System (HDFS) is a Java-based file system that
provides scalable and reliable data storage that is designed to span large
clusters of commodity servers.
HDFS ≠ Database
11. MapReduce
• MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner.
Processing function:
- Mapping
- Reducing
15. HDInsight
• HDInsight is a Hadoop-based service that brings a 100 % Apache Hadoop
solution that runs on the Microsoft Azure platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
18. Now what?
Working with your HDInsight cluster - running jobs, import/export data,
viewing and consuming data…
• .NET
• Java
• Pig
• Hive
• Sqoop
• Excel
• Others
19. What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools
http://hive.apache.org
20. What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)
• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly
http://pig.apache.org
21. What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop
and relational datastores
http://sqoop.apache.org
Key attributes:
Open source
Highly scalable
Runs on commodity hardware
Redundant and reliable (no data loss)
Batch processing centric – using “Map-Reduce” processing paradigm
HDFS can replicate the data to multiple nodes, and it uses a name node daemon to track where the data is and how it is (or isn't) replicated.
HDFS allows data to be split across multiple systems, which solves one problem in a large-scale data environment. But moving the data into various places creates another problem. How do you move the computing function to where the data is?
Along comes MapReduce…
The HDInsight service can actually access two types of storage: HDFS (as in standard Hadoop) and the Azure Storage system. When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well. The option of using Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure or even from other cloud providers can access the data.