SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Big Data
Hadoop
Tomy Rhymond | Sr. Consultant | HMB Inc. | ttr@hmbnet.com | 614.432.9492
“Torture the data, and it will confess to anything.”
-Ronald Coase, Economics, Nobel Prize Laureate
"The goal is to turn data into information, and information
into insight." – Carly Fiorina
"Data are becoming the new raw material of business." – Craig Mundie,
Senior Advisor to the CEO at Microsoft.
“In God we trust. All others must bring data.” – W. Edwards
Deming, statistician, professor, author, lecturer, and consultant.
Agenda
• Data
• Big Data
• Hadoop
• Microsoft Azure HDInsight
• Hadoop Use Cases
• Demo
• Configure Hadoop Cluster / Azure Storage
• C# MapReduce
• Load and Analyze with Hive
• Use Pig Script for Analyze data
• Excel Power Query
Huston, we have a “Data” problem.
• IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by
2020, which is 50-fold growth from the beginning of 2010.
• By 2020, emerging markets will supplant the developed world as the main
producer of the world’s data.
• This flood of data is coming from many source.
• The New York Stock Exchange generates about 1 terabytes of trade data per day
• Facebook hosts approximately one petabyte of storage
• The Hadron Collider produce about 15 petabytes of data per year
• Internet Archives stores around 2 petabytes of data and growing at a rate of 20
terabytes per month.
• Mobile devices and Social Network attribute to the exponential growth of
the data.
85% Unstructured, 15% Structured
• The data as we know is structured.
• Structured data refers to information with a high degree of organization, such as inclusion in a
relational database is seamless and readily searchable.
• Not all data we collect conform to a specific, pre-defined data model.
• It tends to be the human-generated and people-oriented content that does not fit neatly into
database tables
• 85 percent of business-relevant information originates in unstructured form, primarily
text.
• Lack of structure make compilation a time and energy-consuming task.
• These data are so large and complex that it becomes difficult to process using on-hand
management tools or traditional data processing applications.
• These type of data is being generated by everything around us at all times.
• Every digital process and social media exchange produces it. Systems, sensors and mobile devices
transmit it.
Data Types
Relational Data – SQL Data
Un-Structured Data – Twitter Feed
Semi-Structured Data – Json Un-Structured Data – Amazon Review
So What is Big Data?
• Big Data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
• Capturing and managing lot of information; Working with many new types of data.
• Exploiting these masses of information and new data types of applications and
extract meaningful value from big data
• The process of applying serious computing to seriously massive and often highly
complex sets of information.
• Big data is arriving from multiple sources at an alarming velocity, volume and
variety.
• More data lead to more accurate analyses. More accurate analysis may lead to
more confident decision making.
VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron Collider generates
1 PETA BYTES
Of Data are create per year
Estimated
100 TERA BYTES
Of Data per US Company
IDC Estimate
40 ZETABYTES
Of Data by 2020
500 MILLION
TWEETS
Per day
100 MILLION VIDEO
600 Years of Video
13 Hours of video uploaded
per minute
20 BILLION NETWORK
CONNECTIONS
By 2016
NY Stock Exchange generates
1 TERRA BYTES
Of Trade Data per day
Poor Data Quality cost businesses
600 BILLION A YEAR
30% OF DATA COLLECTED
By marketers are not usable for
real-time decision making
Poor data across business and the
government costs the US economy
3.1 TRILLION DOLLARS
a year
1 IN 3 LEADERS
Don t trust the information they
user to make decision
MAP
REDUCE
RESULT
200 BILLIONS PHOTOS
Facebook has
1 PETTA
BYTES
Of Storage
1.8 BILLION
SMARTPHONES
Estimated
6 BILLION
PEOPLE
Have a cell Phone
Global Healthcare data
150 EXABYTES
2.4 EXABYTES per year
Growth
2.5 QUINTILLION
BYTES
of Data are Created each Day
Big Data
The 4 V’s of Big Data
Volume: We currently see the exponential growth in the data
storage as the data is now more than text data. There are videos,
music and large images on our social media channels. It is very
common to have Terabytes and Petabytes of the storage system for
enterprises.
Velocity: Velocity describes the frequency at which data is
generated, captured and shared. Recent developments mean that
not only consumers but also businesses generate more data in
much shorter cycles.
Variety: Today’s data no longer fits into neat, easy to consume
structures. New types include content, geo-spatial, hardware data
points, location based, log data, machine data, metrics, mobile,
physical data points, process, RFID etc.
Veracity: This refers to the uncertainty of the data available.
Veracity isn’t just about data quality, it’s about data
understandability. Veracity has an impact on the confidence data.
Big Data vs Traditional Data
Traditional Big Data
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch
Updates Read and Write many times Write once, read many times
Structure Static Schema Dynamic Schema
Integrity High Low
Scaling Nonlinear Linear
Data Storage
• Storage capacity of the hard drives have increased massively over the years
• On the other hand, the access speeds of the drives have not kept up.
• Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/s
• can read all the data in about 5 mins.
• Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/s
• Take more than two and half hours to read all the data
• Writing is even slower
• The obvious ways to reduce time is to read from multiple disks at once
• Have 100 disks each holding one hundredth of data. Working in parallel, we could read all the
data in under 2 minutes.
• Move Computing to Data rather than bring data to computing.
Why big data should matter to you
• The real issue is not that you are acquiring large amounts of data. It's what you do with
the data that counts. The hopeful vision is that organizations will be able to take data
from any source, harness relevant data and analyze it to find answers that enable
• cost reductions
• time reductions
• new product development and optimized offerings
• smarter business decision making.
• By combining big data and high-powered analytics, it is possible to:
• Determine root causes of failures, issues and defects in near-real time, potentially saving billions of
dollars annually.
• Send tailored recommendations to mobile devices while customers are in the right area to take
advantage of offers.
• Quickly identify customers who matter the most.
• Generate retail coupons at the point of sale based on the customer's current and past purchases.
Ok I Got BigData, Now what?
• The huge influx of data raises many challenges.
• Process of inspecting, cleaning, transforming, and modeling data with the
goal of discovering useful information
• To analyze and extract meaningful value from these massive amounts of
data, we need optimal processing power.
• We need parallel processing and therefore requires many pieces of hardware
• When we use many pieces of hardware, the chances that one will fail is fairly high.
• Common way to avoiding data loss is through replication
• Redundant copies of data are kept
• Data analysis tasks need to combine data
• The Data from one disk may need to combine with data from 99 other disks
Challenges of Big Data
• Information Growth
• Over 80% of the data in the enterprise consists of unstructured data, growing much
faster pace than traditional data
• Processing Power
• The approach to use single, expensive, powerful computer to crunch information
doesn’t scale for Big Data
• Physical Storage
• Capturing and managing all this information can consume enormous resources
• Data Issues
• Lack of data mobility, proprietary formats and interoperability obstacle can make
working with Big Data complicated
• Costs
• Extract, transform and load (ETL) processes for Big Data can be expensive and time
consuming
Hadoop
• Apache™ Hadoop® is an open source software project that enables the
distributed processing of large data sets across clusters of commodity
servers.
• It is designed to scale up from a single server to thousands of machines, with
a very high degree of fault tolerance.
• All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common
and thus should be automatically handled in software by the framework.
• Hadoop consists of the Hadoop Common package, which provides filesystem
and OS level abstractions, a MapReduce engine and the Hadoop Distributed
File System (HDFS).
History of Hadoop
• Hadoop is not an acronym; it’s a made-up name.
• Named after stuffed an yellow elephant of Doug Cutting’s (Project Creator) son.
• 2002-2004 : Nutch Project - web-scale, open source, crawler-based search
engine.
• 2003-2004: Google released GFS (Google File System) & MapReduce
• 2005-2006: Added GFS (Google File System) & MapReduce impl to Nutch
• 2006-2008: Yahoo hired Doug Cutting and his team. They spun out storage and
processing parts of Nutch to form Hadoop.
• 2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100 TB in 173
Minutes (on 3400 nodes)
Hadoop Modules
• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
• Other Related Modules;
• Cassandra - scalable multi-master database with no single points of failure.
• HBase - A scalable, distributed database that supports structured data storage for large tables.
• Pig - A high-level data-flow language and execution framework for parallel computation.
• Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Zookeeper - A high-performance coordination service for distributed applications.
HDFS – Hadoop Distributed File System
• The heart of Hadoop is the HDFS.
• The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware.
• HDFS is designed on the following assumptions and goals:
• Hardware failure is norm rather than exception.
• HDFS is designed more for batch processing rather than interactive use by users.
• Application that run on HDFS have large data sets. A typical file in HDFS is in gigabytes
to terabytes in size.
• HDFS application uses a write-once-read-many access model. A file once created,
written and closed need not be changed.
• A computation requested by an application is much more efficient if it is executed near
the data it operated on. On other words, Moving computation is cheaper than moving
data.
• Easily portable from one platform to another.
Hadoop Architecture
• NameNode:
• NameNode is the node which stores the filesystem metadata
i.e. which file maps to what block locations and which blocks
are stored on which datanode.
• Secondary NameNode:
• NameNode is the single point of failure.
• DataNode:
• The data node is where the actual data resides.
• All datanodes send a heartbeat message to the namenode
every 3 seconds to say that they are alive.
• The data nodes can talk to each other to rebalance data,
move and copy data around and keep the replication high.
• Job Tracker/Task Tracker:
• The primary function of the job tracker is resource
management (managing the task trackers), tracking resource
availability and task life cycle management (tracking its
progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker with
its progress status periodically.
Name Node
Secondary Name
Node
text
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
text
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
Rack 1 Rack 2
Client
Client
Read
Metadata Ops
Replication
HDFS - InputSplit
• InputFormat
• Split the input blocks and files into logical
chunks of type InputSplit, each of which is
assigned to a map task for processing.
• RecordReader
• A RecordReader uses the data within the
boundaries created by the input split to
generate key/value pairs.
MapReduce
• Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• It is this programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster.
• The term MapReduce actually refers to two separate and
distinct tasks that Hadoop programs perform.
• The first is the map job, which takes a set of data and
converts it into another set of data, where individual
elements are broken down into tuples (key/value
pairs).
• The reduce job takes the output from a map as input
and combines those data tuples into a smaller set of
tuples.
BIG DATA
MAP REDUCE RESULT
(tattoo, 1)
Hadoop Map-Reduce
Hadoop Distributions
• Microsoft Azure HDInsight
• IBM InfoSphere BigInsights
• Hortonworks
• Amazon Elastic MapReduce
• Cloudera CDH
Hadoop Meets The Mainframe
• BMC
• Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was
born in 1987 as an automated mainframe job scheduler.
• Compuware
• APM is an application performance management suite that also spans the arc of data
enterprise data center computing distributed commodity servers.
• Syncsort
• Syncsort offers Hadoop Connectivity to move data between Hadoop and other platforms
including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform
data integration.
• Informatica
• HParser can run its data transformation services as a distributed application on Hadoop’s
MapReduce engine.
Azure HDInsight
• HDInsight makes Apache Hadoop available as a service in the cloud.
• Process, analyze, and gain new insights from big data using the power
of Apache Hadoop
• Drive decisions by analyzing unstructured data with Azure HDInsight,
a big data solution powered by Apache Hadoop.
• Build and run Hadoop clusters in minutes.
• Analyze results with Power Pivot and Power View in Excel.
• Choose your language, including Java and .NET. Query and transform
data through Hive.
Azure HDInsight
Scale elastically on demand
Crunch all data –
structured, semi-
structured,
unstructured
Develop in your
favorite language
No hardware to
acquire or
maintain
Connect on-
premises Hadoop
clusters with the
cloud
Use Excel to
visualize your
Hadoop data
Includes NoSQL
transactional capabilities
Azure HDInsight
HDInsight Ecosystem
HDFS (Hadoop Distributed File System)
MapReduce (Job Scheduling / Execution)
Pig (Data Flow) Hive (SQL) Sqoop
ETL Tools BI Tools RDBMS
HDInsight
• The combination of Azure Storage and HDInsight provides an
ultimate framework for running MapReduce jobs.
• Creating an HDInsight cluster is quick and easy: log in to Azure,
select the number of nodes, name the cluster, and set
permissions.
• The cluster is available on demand, and once a job is completed,
the cluster can be deleted but the data remains in Azure Storage.
• Use Powershell to submit MapReduce Jobs
• Use C# to create MapReduce Programs
• Support Pig Latin, Avro, Sqoop and more.
Use cases
• A 360 degree view of the customer
• Business want to know to utilize social media postings to improve
revenue.
• Utilities: Predict power consumption
• Marketing: Sentiment analysis
• Customer service: Call monitoring
• Retail and marketing: Mobile data and location-based targeting
• Internet of Things (IoT)
• Big Data Service Refinery
Demo
• Configure HDInsight Cluster
• Create Mapper and Reducer Program using Visual Studio C#
• Upload Data to Blob Storage using Azure Storage Explorer
• Run Hadoop Job
• Export output to Power Query for Excel
• Hive Example with HDInsight
• Pig Script with HDInsight
VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron Collider generates
1 PETA BYTES
Of Data are create per year
Estimated
100 TERA BYTES
Of Data per US Company
IDC Estimate
40 ZETABYTES
Of Data by 2020
500 MILLION
TWEETS
Per day
100 MILLION VIDEO
600 Years of Video
13 Hours of video uploaded
per minute
20 BILLION NETWORK
CONNECTIONS
By 2016
NY Stock Exchange generates
1 TERRA BYTES
Of Trade Data per day
Poor Data Quality cost businesses
600 BILLION A YEAR
30% OF DATA COLLECTED
By marketers are not usable for
real-time decision making
Poor data across business and the
government costs the US economy
3.1 TRILLION DOLLARS
a year
1 IN 3 LEADERS
Don t trust the information they
user to make decision
MAP
REDUCE
RESULT
200 BILLIONS PHOTOS
Facebook has
1 PETTA
BYTES
Of Storage
1.8 BILLION
SMARTPHONES
Estimated
6 BILLION
PEOPLE
Have a cell Phone
Global Healthcare data
150 EXABYTES
2.4 EXABYTES per year
Growth
2.5 QUINTILLION
BYTES
of Data are Created each Day
Big Data
Resources for HDInsight for Windows Azure
Microsoft: HDInsight
• Welcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure.
• Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation.
• Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure.
Microsoft: Windows and SQL Database
• Windows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications.
• MSDN SQL- MSDN documentation for SQL Database
• Management Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud.
• Adventure Works for SQL Database - Download page for SQL Database sample database.
Microsoft: Business Intelligence
• Microsoft BI PowerPivot- a powerful data mashup and data exploration tool.
• SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights.
• SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise.
Apache Hadoop:
• Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of computers.
• HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.
• Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
Hortonworks:
• Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.
About Me
Tomy Rhymond
Sr. Consultant, HMB, Inc.
ttr@hmbnet.com
http://tomyrhymond.wordpress.com
@trhymond
614.432.9492 (m)

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop Edureka!
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data ConceptsAhmed Salman
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big DataHarshdeep Kaur
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in actionMahmoud Yassin
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Was ist angesagt? (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Ă„hnlich wie Big Data and Hadoop

Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its TrendsDr.K.Sreenivas Rao
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxdickonsondorris
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxTanguturiAvinash
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big DataMd. Salman Ahmed
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big dataVedanand Singh
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptxkalai75
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01nayanbhatia2
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigManish Chopra
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
big-data-notes1.ppt
big-data-notes1.pptbig-data-notes1.ppt
big-data-notes1.pptSutanuGhosal1
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
 

Ă„hnlich wie Big Data and Hadoop (20)

Big Data
Big DataBig Data
Big Data
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Big data
Big dataBig data
Big data
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
Big data.pptx
Big data.pptxBig data.pptx
Big data.pptx
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
big-data-notes1.ppt
big-data-notes1.pptbig-data-notes1.ppt
big-data-notes1.ppt
 
Big Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data ManagementBig Data, NoSQL, NewSQL & The Future of Data Management
Big Data, NoSQL, NewSQL & The Future of Data Management
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 

KĂĽrzlich hochgeladen

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

KĂĽrzlich hochgeladen (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 âś“Call Girls In Kalyan ( Mumbai ) secure service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Big Data and Hadoop

  • 1. Big Data Hadoop Tomy Rhymond | Sr. Consultant | HMB Inc. | ttr@hmbnet.com | 614.432.9492 “Torture the data, and it will confess to anything.” -Ronald Coase, Economics, Nobel Prize Laureate "The goal is to turn data into information, and information into insight." – Carly Fiorina "Data are becoming the new raw material of business." – Craig Mundie, Senior Advisor to the CEO at Microsoft. “In God we trust. All others must bring data.” – W. Edwards Deming, statistician, professor, author, lecturer, and consultant.
  • 2. Agenda • Data • Big Data • Hadoop • Microsoft Azure HDInsight • Hadoop Use Cases • Demo • Configure Hadoop Cluster / Azure Storage • C# MapReduce • Load and Analyze with Hive • Use Pig Script for Analyze data • Excel Power Query
  • 3. Huston, we have a “Data” problem. • IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by 2020, which is 50-fold growth from the beginning of 2010. • By 2020, emerging markets will supplant the developed world as the main producer of the world’s data. • This flood of data is coming from many source. • The New York Stock Exchange generates about 1 terabytes of trade data per day • Facebook hosts approximately one petabyte of storage • The Hadron Collider produce about 15 petabytes of data per year • Internet Archives stores around 2 petabytes of data and growing at a rate of 20 terabytes per month. • Mobile devices and Social Network attribute to the exponential growth of the data.
  • 4. 85% Unstructured, 15% Structured • The data as we know is structured. • Structured data refers to information with a high degree of organization, such as inclusion in a relational database is seamless and readily searchable. • Not all data we collect conform to a specific, pre-defined data model. • It tends to be the human-generated and people-oriented content that does not fit neatly into database tables • 85 percent of business-relevant information originates in unstructured form, primarily text. • Lack of structure make compilation a time and energy-consuming task. • These data are so large and complex that it becomes difficult to process using on-hand management tools or traditional data processing applications. • These type of data is being generated by everything around us at all times. • Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it.
  • 5. Data Types Relational Data – SQL Data Un-Structured Data – Twitter Feed Semi-Structured Data – Json Un-Structured Data – Amazon Review
  • 6. So What is Big Data? • Big Data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. • Capturing and managing lot of information; Working with many new types of data. • Exploiting these masses of information and new data types of applications and extract meaningful value from big data • The process of applying serious computing to seriously massive and often highly complex sets of information. • Big data is arriving from multiple sources at an alarming velocity, volume and variety. • More data lead to more accurate analyses. More accurate analysis may lead to more confident decision making.
  • 7. VARIETY BIG DATA VOLUME VERACITYVELOCITY Scale of Data Different Forms of Data Analysis of Data Uncertainty of Data Hadron Collider generates 1 PETA BYTES Of Data are create per year Estimated 100 TERA BYTES Of Data per US Company IDC Estimate 40 ZETABYTES Of Data by 2020 500 MILLION TWEETS Per day 100 MILLION VIDEO 600 Years of Video 13 Hours of video uploaded per minute 20 BILLION NETWORK CONNECTIONS By 2016 NY Stock Exchange generates 1 TERRA BYTES Of Trade Data per day Poor Data Quality cost businesses 600 BILLION A YEAR 30% OF DATA COLLECTED By marketers are not usable for real-time decision making Poor data across business and the government costs the US economy 3.1 TRILLION DOLLARS a year 1 IN 3 LEADERS Don t trust the information they user to make decision MAP REDUCE RESULT 200 BILLIONS PHOTOS Facebook has 1 PETTA BYTES Of Storage 1.8 BILLION SMARTPHONES Estimated 6 BILLION PEOPLE Have a cell Phone Global Healthcare data 150 EXABYTES 2.4 EXABYTES per year Growth 2.5 QUINTILLION BYTES of Data are Created each Day Big Data The 4 V’s of Big Data Volume: We currently see the exponential growth in the data storage as the data is now more than text data. There are videos, music and large images on our social media channels. It is very common to have Terabytes and Petabytes of the storage system for enterprises. Velocity: Velocity describes the frequency at which data is generated, captured and shared. Recent developments mean that not only consumers but also businesses generate more data in much shorter cycles. Variety: Today’s data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID etc. Veracity: This refers to the uncertainty of the data available. Veracity isn’t just about data quality, it’s about data understandability. Veracity has an impact on the confidence data.
  • 8. Big Data vs Traditional Data Traditional Big Data Data Size Gigabytes Petabytes Access Interactive and Batch Batch Updates Read and Write many times Write once, read many times Structure Static Schema Dynamic Schema Integrity High Low Scaling Nonlinear Linear
  • 9. Data Storage • Storage capacity of the hard drives have increased massively over the years • On the other hand, the access speeds of the drives have not kept up. • Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/s • can read all the data in about 5 mins. • Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/s • Take more than two and half hours to read all the data • Writing is even slower • The obvious ways to reduce time is to read from multiple disks at once • Have 100 disks each holding one hundredth of data. Working in parallel, we could read all the data in under 2 minutes. • Move Computing to Data rather than bring data to computing.
  • 10. Why big data should matter to you • The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable • cost reductions • time reductions • new product development and optimized offerings • smarter business decision making. • By combining big data and high-powered analytics, it is possible to: • Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually. • Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers. • Quickly identify customers who matter the most. • Generate retail coupons at the point of sale based on the customer's current and past purchases.
  • 11. Ok I Got BigData, Now what? • The huge influx of data raises many challenges. • Process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information • To analyze and extract meaningful value from these massive amounts of data, we need optimal processing power. • We need parallel processing and therefore requires many pieces of hardware • When we use many pieces of hardware, the chances that one will fail is fairly high. • Common way to avoiding data loss is through replication • Redundant copies of data are kept • Data analysis tasks need to combine data • The Data from one disk may need to combine with data from 99 other disks
  • 12. Challenges of Big Data • Information Growth • Over 80% of the data in the enterprise consists of unstructured data, growing much faster pace than traditional data • Processing Power • The approach to use single, expensive, powerful computer to crunch information doesn’t scale for Big Data • Physical Storage • Capturing and managing all this information can consume enormous resources • Data Issues • Lack of data mobility, proprietary formats and interoperability obstacle can make working with Big Data complicated • Costs • Extract, transform and load (ETL) processes for Big Data can be expensive and time consuming
  • 13. Hadoop • Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. • It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. • All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. • Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).
  • 14. History of Hadoop • Hadoop is not an acronym; it’s a made-up name. • Named after stuffed an yellow elephant of Doug Cutting’s (Project Creator) son. • 2002-2004 : Nutch Project - web-scale, open source, crawler-based search engine. • 2003-2004: Google released GFS (Google File System) & MapReduce • 2005-2006: Added GFS (Google File System) & MapReduce impl to Nutch • 2006-2008: Yahoo hired Doug Cutting and his team. They spun out storage and processing parts of Nutch to form Hadoop. • 2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100 TB in 173 Minutes (on 3400 nodes)
  • 15. Hadoop Modules • Hadoop Common: The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. • Other Related Modules; • Cassandra - scalable multi-master database with no single points of failure. • HBase - A scalable, distributed database that supports structured data storage for large tables. • Pig - A high-level data-flow language and execution framework for parallel computation. • Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying. • Zookeeper - A high-performance coordination service for distributed applications.
  • 16. HDFS – Hadoop Distributed File System • The heart of Hadoop is the HDFS. • The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. • HDFS is designed on the following assumptions and goals: • Hardware failure is norm rather than exception. • HDFS is designed more for batch processing rather than interactive use by users. • Application that run on HDFS have large data sets. A typical file in HDFS is in gigabytes to terabytes in size. • HDFS application uses a write-once-read-many access model. A file once created, written and closed need not be changed. • A computation requested by an application is much more efficient if it is executed near the data it operated on. On other words, Moving computation is cheaper than moving data. • Easily portable from one platform to another.
  • 17. Hadoop Architecture • NameNode: • NameNode is the node which stores the filesystem metadata i.e. which file maps to what block locations and which blocks are stored on which datanode. • Secondary NameNode: • NameNode is the single point of failure. • DataNode: • The data node is where the actual data resides. • All datanodes send a heartbeat message to the namenode every 3 seconds to say that they are alive. • The data nodes can talk to each other to rebalance data, move and copy data around and keep the replication high. • Job Tracker/Task Tracker: • The primary function of the job tracker is resource management (managing the task trackers), tracking resource availability and task life cycle management (tracking its progress, fault tolerance etc.) • The task tracker has a simple function of following the orders of the job tracker and updating the job tracker with its progress status periodically. Name Node Secondary Name Node text B1 B2 B3 B4 B5 B6 B1 Data Node B1 B2 B3 B4 B5 B6 B1 Data Node text B1 B2 B3 B4 B5 B6 B1 Data Node B1 B2 B3 B4 B5 B6 B1 Data Node B1 B2 B3 B4 B5 B6 B1 Data Node Rack 1 Rack 2 Client Client Read Metadata Ops Replication
  • 18. HDFS - InputSplit • InputFormat • Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing. • RecordReader • A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs.
  • 19. MapReduce • Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. • It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. • The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. • The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). • The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. BIG DATA MAP REDUCE RESULT (tattoo, 1) Hadoop Map-Reduce
  • 20. Hadoop Distributions • Microsoft Azure HDInsight • IBM InfoSphere BigInsights • Hortonworks • Amazon Elastic MapReduce • Cloudera CDH
  • 21. Hadoop Meets The Mainframe • BMC • Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was born in 1987 as an automated mainframe job scheduler. • Compuware • APM is an application performance management suite that also spans the arc of data enterprise data center computing distributed commodity servers. • Syncsort • Syncsort offers Hadoop Connectivity to move data between Hadoop and other platforms including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform data integration. • Informatica • HParser can run its data transformation services as a distributed application on Hadoop’s MapReduce engine.
  • 22. Azure HDInsight • HDInsight makes Apache Hadoop available as a service in the cloud. • Process, analyze, and gain new insights from big data using the power of Apache Hadoop • Drive decisions by analyzing unstructured data with Azure HDInsight, a big data solution powered by Apache Hadoop. • Build and run Hadoop clusters in minutes. • Analyze results with Power Pivot and Power View in Excel. • Choose your language, including Java and .NET. Query and transform data through Hive.
  • 23. Azure HDInsight Scale elastically on demand Crunch all data – structured, semi- structured, unstructured Develop in your favorite language No hardware to acquire or maintain Connect on- premises Hadoop clusters with the cloud Use Excel to visualize your Hadoop data Includes NoSQL transactional capabilities Azure HDInsight
  • 24. HDInsight Ecosystem HDFS (Hadoop Distributed File System) MapReduce (Job Scheduling / Execution) Pig (Data Flow) Hive (SQL) Sqoop ETL Tools BI Tools RDBMS
  • 25. HDInsight • The combination of Azure Storage and HDInsight provides an ultimate framework for running MapReduce jobs. • Creating an HDInsight cluster is quick and easy: log in to Azure, select the number of nodes, name the cluster, and set permissions. • The cluster is available on demand, and once a job is completed, the cluster can be deleted but the data remains in Azure Storage. • Use Powershell to submit MapReduce Jobs • Use C# to create MapReduce Programs • Support Pig Latin, Avro, Sqoop and more.
  • 26. Use cases • A 360 degree view of the customer • Business want to know to utilize social media postings to improve revenue. • Utilities: Predict power consumption • Marketing: Sentiment analysis • Customer service: Call monitoring • Retail and marketing: Mobile data and location-based targeting • Internet of Things (IoT) • Big Data Service Refinery
  • 27. Demo • Configure HDInsight Cluster • Create Mapper and Reducer Program using Visual Studio C# • Upload Data to Blob Storage using Azure Storage Explorer • Run Hadoop Job • Export output to Power Query for Excel • Hive Example with HDInsight • Pig Script with HDInsight
  • 28. VARIETY BIG DATA VOLUME VERACITYVELOCITY Scale of Data Different Forms of Data Analysis of Data Uncertainty of Data Hadron Collider generates 1 PETA BYTES Of Data are create per year Estimated 100 TERA BYTES Of Data per US Company IDC Estimate 40 ZETABYTES Of Data by 2020 500 MILLION TWEETS Per day 100 MILLION VIDEO 600 Years of Video 13 Hours of video uploaded per minute 20 BILLION NETWORK CONNECTIONS By 2016 NY Stock Exchange generates 1 TERRA BYTES Of Trade Data per day Poor Data Quality cost businesses 600 BILLION A YEAR 30% OF DATA COLLECTED By marketers are not usable for real-time decision making Poor data across business and the government costs the US economy 3.1 TRILLION DOLLARS a year 1 IN 3 LEADERS Don t trust the information they user to make decision MAP REDUCE RESULT 200 BILLIONS PHOTOS Facebook has 1 PETTA BYTES Of Storage 1.8 BILLION SMARTPHONES Estimated 6 BILLION PEOPLE Have a cell Phone Global Healthcare data 150 EXABYTES 2.4 EXABYTES per year Growth 2.5 QUINTILLION BYTES of Data are Created each Day Big Data
  • 29. Resources for HDInsight for Windows Azure Microsoft: HDInsight • Welcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure. • Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation. • Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure. Microsoft: Windows and SQL Database • Windows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications. • MSDN SQL- MSDN documentation for SQL Database • Management Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud. • Adventure Works for SQL Database - Download page for SQL Database sample database. Microsoft: Business Intelligence • Microsoft BI PowerPivot- a powerful data mashup and data exploration tool. • SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights. • SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise. Apache Hadoop: • Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of computers. • HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. • Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. Hortonworks: • Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.
  • 30. About Me Tomy Rhymond Sr. Consultant, HMB, Inc. ttr@hmbnet.com http://tomyrhymond.wordpress.com @trhymond 614.432.9492 (m)