SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Presented by: Jon Bloom
Senior Consultant, Agile Bay, Inc.
Jon Bloom
Blog: http://www.bloomconsultingbi.com
Twitter: @sqljon
Email: jbloom@AgileBay.com
Certification: Microsoft Certified Solutions Associate MCSA
SQL Server 2012
Customers & Partners
w w w . a g i l e b a y. c o m
Session Agenda
 What is Hadoop?
 Version
 1.0
 2.0
 Demo:
Hadoop
 Apache Foundation
 Open Source
 Batch Processing
 Parallel, Reliable, Scalable
 Distributed Stores 3 copies
 Commodity Hardware
 Large Unstructured Data Sets
 Eventually Consistent
What is Hadoop
 Ecosystem
 Comprised of multiple Projects
• MapReduce
• Hive
• Pig
• Scoop
• Oozie
• Flume
• ZooKeeper
• Tez
• Mahout
• HBase
• Ambari
• Impala
Hadoop v1.0
 2004 Yahoo
 Doug Cutting (Cloudera)
• MapReduce
• Written 100% in Java
• Mappers
• Splits Rows into Chunks
• Reducers
• Aggregates the Chunks
• HDFS
• Distributed File System
• Java code is complex
Reason for Hadoop
 Data gets ingested into HDFS
 Java MapReduce Jobs run
 Parse out the Data
 Creates Output files
 Jobs can be re-run against Output files
 Run algorithms
 Handle Large, Complex Data Sets
 Look for “Insights”
 Raw Data (CSV, TXT, Binary, XML)
Name Nodes
 The “Brains” of Hadoop
 “Master” Server
 Single Point of Failure
Data Nodes
 “Slaves”
 Commodity Hardware
 Stores the Actual Data
 Runs Java Jar Files
Secondary Named
Nodes
 Redundant Server
 Serves as Backup
 Typically on own Server
Job Tracker
 Keeps track of “Jobs” Resources
 Heartbeat
Task Tracker
 Keeps track of “Task” Resources
 Communicates with Job Tracker
Ingest Data
 When thinking about Hadoop, we think of
data. How to get data into HDFS and how
to get data out of HDFS. Luckily, Hadoop
has some popular processes to accomplish
this.
SQOOP
 SQOOP was created to move data back and forth
easily from an External Database or flat file into
HDFS or HIVE. There are some standard commands
for moving data by Importing and Exporting
data. When data is moved to HDFS, it creates files
on the HDFS folder system. Those folders can be
partitioned in a variety of ways. Data can be
appended to the files through SQOOP jobs. And
you can add a WHERE clause to pull just certain
data, for example, just bring in data from yesterday,
run the SQOOP job daily to populate Hadoop.
Hive
 Once data gets moved to Hadoop HDFS, you
can add a layer of HIVE on top which
structures the data into relational
format. Once applied, the data can be queried
by HIVE SQL. If creating a table, in the HIVE
database schema, you can create an External
table which is basically a metadata layer pass
through which points to the actual data. So if
you drop the External table, the data remains
in tact.
PIG
 In addition, you can use a Hadoop language
called PIG (not making this up), to massage
the data into a structure series of steps, a
form of ETL.
MapReduce
 HIVE and PIG allow easier access to the data
 However, they still get translated to M/R
ODBC
 From HIVE SQL, the tables are exposed to
ODBC to allow data to be accessed via
Reports, Databases, ETL, etc.
So as you can see from the basic description
above, if you can move data back and forth
easily between Hadoop and your Relational
Database (or flat files).
Connect to Data
 Once data is stored in HDW, it can be
consumed by users via HIVE ODBC or
Microsoft PowerBI, Tableau, Qlikview or
SAP HANA or a variety of other tools sitting
on top of the data layer, including Self
Service tools.
HCatalog
 Sometimes when developing, users don't know
where data is stored. And sometimes the data
can be stored in a variety of formats, because
HIVE, PIG and Map Reduce can have separate
data model types. So HCatalog was created to
alleviate some of the frustration. It's a table
abstraction layer, meta data service and a
shared schema for Pig, Hive and M/R. It
exposes info about the data to applications.
HBase
 Hbase allows a separate database to allow
random read/write access to the HDFS data,
and surprisingly it too sits with the HDFS
cluster. Data can be ingested to HBASE and
interpreted On Read, which Relational
Databases do not offer.
Accumulo
 A High performance Data Storage and
retrieval system with cell-level access
control, similar to Google’s “Big Table”
design.
OOZIE
 A Java Web application used to schedule
Hadoop jobs. Combines multiple jobs
sequentially into one logical unit of work.
Flume
 Distributed, reliable and available service for
efficiently collection, aggregating and
moving large amounts of streaming data
into HDFS (fault tolerant).
Solr
 Open Source platform for searches of data
stored in HDFS Hadoop including full text
search and near real time indexing.
Streaming
 And you can receive Steaming Data.
HUE
 Open Source Web Interface
 Aggregates most common components into
single web interface
 View HDFS File Structure
 Simplify user experience
WebHDFS
 A REST API
 Interface to expose complete File System
 Provides Read & Write access
 Supports all HDFS parameters
 Allows remote access via many languages
 Uses Kerbos for Authentication
Monitor
 There's Zookeeper which is a centralized
service to keep track of things. A high
performance coordination service for
distributed applications.
Machine Learning
 In addition, you could apply MAHOUT
Machine Learning algorithms to you
Hadoop cluster for Clustering, Classification
and Collaborative Filtering. And you can
run Statistical language analysis with a
language called Revolution Analytic R
version of Hadoop R.
Machine Learning
 Clustering
 Similarities between data points in Clusters
 Classification
 Learns from existing categories to assign
unassigned categories
 User Based Recommendations
 Predict future behavior based on user
preferences and behavior
Hadoop 2.0
 And with the latest Hadoop 2.0, there's the addition
of YARN which is a new layer that sits between
HDFS2 and the application layers. Although HDFS
Map Reduce was originally designed as the sole
batch oriented approach to getting data from HDFS,
it's no longer the sole way. HIVE SQL has been sped
up through Impala which completely bypasses Map
Reduce and the Stinger initiative which sits atop
Tez. Tez has ability to compress data with column
stores which allows the interaction to be sped up.
YARN
 Allows the separation of MapReduce layers
of Service and Framework
 Resource Manager
 Application Manager
 Node Manager
 Containers
 Separates Resources
YARN
 Traditional MapReduce
 Expensive
 Original M/R spawned many process
 Wrote to Disk intermediate data
 Sort / Shuffle
 Now we have Applications
 M/R, Tez, Giraff, Spark, Storm, etc.
 Compiled down to a lower level
 Single Strand w/ More Complexity
Tez
 Generalized data flow programming
framework, built on Hadoop YARN for batch
and interactive use cases, such as Pig, HIVE
and other frameworks. It has the potential
to replace the MapReduce execution engine.
Impala
 Cloudera Impala is runs massively parallel
processing (MPP) SQL query engine that
runs natively in Hadoop.
 Allows data querying without the need for
data movement or transformation
 It by-passes MapReduce
Graph
 And Girage, which allows Hadoop the ability
to process Graph connections between
nodes.
Ambari
 Ambari allows Hadoop Cluster
administration and has an API layer for 3rd
party tools to hook into.
Spark
 And Spark, provides a simple and expressive
programming model that supports ETL,
Machine Learning, stream processing and
graph computation.
Knox
 Provides a single point of authentication
and access to Hadoop services. Specifically
for Hadoop users who access the cluster data
and execute jobs, operators who control
access and manage the cluster.
Falcon
 Framework for simplifying data management
and pipeline processing in Hadoop. Enables
users to automate the movement and
processing of datasets for ingest, pipelines,
disaster recovery and data retention use cases.
It simplifies data management by removing
complex coding (out of the box).
More Apache
Projects
 Apache Kafka
 Next Generation Distributed Messaging
System
 Apache Avro
 Data Serialization System
 Apache Chukwa
 Data Collection System for Monitoring large
distributed systems
Cloud
 You can run your Hybrid Data Warehouse in
the Cloud with Microsoft Azure Blobstorage
HDInsight or Amazon Web Services.
On Premise
 You can run On Premise with IBM
Infosphere BigInsights, Cloudera,
Hortonworks and MapR.
Hybrid Data
Warehouse
 You can build a Hybrid Data Warehouse. As
Data Warehousing is a concept, a
documented framework to follow with
guidelines and rules. And storing the data
in Hadoop and Relational Databases is
typically known as a Hybrid Data
Warehouse.
BI vs. Hadoop
 Hadoop not a replacement of BI
 Extends BI capabilities
 BI = Scale up to 100s of Gigabytes
 Hadoop = From 100s of Gygabytes to Terabytes
(1,000s og Gygabytes) and Terabytes (1,000,000
Gigabytes)
Hadoop
Where’s Hadoop
Headed?
 Transactional Data?
 More Real Time?
 Integrate with Traditional Data Warehouses?
 Hadoop for the Masses?
 Artificial Intelligence?
 Turing Test
 Neural Networks
 Internet of Things
Basic Hadoop
Blog: www.bloomconsultingbi.com
Twitter: @sqljon
Linked-in: http://www.linkedin.com/in/BloomConsultingBI
Email: jbloom@agilebay.com

Weitere ähnliche Inhalte

Was ist angesagt?

Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionDataWorks Summit
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionXplenty
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 

Was ist angesagt? (20)

Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop AdoptionYARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 

Andere mochten auch

Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To HadoopAl Chin
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14iwrigley
 
Hadoop & Cloudera Workshop
Hadoop & Cloudera WorkshopHadoop & Cloudera Workshop
Hadoop & Cloudera WorkshopSerkan Sakınmaz
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 

Andere mochten auch (7)

Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
 
Hadoop & Cloudera Workshop
Hadoop & Cloudera WorkshopHadoop & Cloudera Workshop
Hadoop & Cloudera Workshop
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 

Ähnlich wie Intro to Hadoop

Ähnlich wie Intro to Hadoop (20)

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hive and querying data
Hive and querying dataHive and querying data
Hive and querying data
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 

Mehr von Jonathan Bloom

What is a Data Scientist?
What is a Data Scientist?What is a Data Scientist?
What is a Data Scientist?Jonathan Bloom
 
Installing Hortonworks Hadoop for Windows
Installing Hortonworks Hadoop for WindowsInstalling Hortonworks Hadoop for Windows
Installing Hortonworks Hadoop for WindowsJonathan Bloom
 
Intro to Report Developer Role
Intro to Report Developer RoleIntro to Report Developer Role
Intro to Report Developer RoleJonathan Bloom
 
Intro to Power BI for Office 365
Intro to Power BI for Office 365Intro to Power BI for Office 365
Intro to Power BI for Office 365Jonathan Bloom
 

Mehr von Jonathan Bloom (7)

What is a Data Scientist?
What is a Data Scientist?What is a Data Scientist?
What is a Data Scientist?
 
Installing Hortonworks Hadoop for Windows
Installing Hortonworks Hadoop for WindowsInstalling Hortonworks Hadoop for Windows
Installing Hortonworks Hadoop for Windows
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Intro to Report Developer Role
Intro to Report Developer RoleIntro to Report Developer Role
Intro to Report Developer Role
 
Intro to EDW
Intro to EDWIntro to EDW
Intro to EDW
 
Intro to Power BI for Office 365
Intro to Power BI for Office 365Intro to Power BI for Office 365
Intro to Power BI for Office 365
 
SSRS for DBA's
SSRS for DBA'sSSRS for DBA's
SSRS for DBA's
 

Kürzlich hochgeladen

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 

Kürzlich hochgeladen (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 

Intro to Hadoop

  • 1. Presented by: Jon Bloom Senior Consultant, Agile Bay, Inc.
  • 2. Jon Bloom Blog: http://www.bloomconsultingbi.com Twitter: @sqljon Email: jbloom@AgileBay.com Certification: Microsoft Certified Solutions Associate MCSA SQL Server 2012 Customers & Partners
  • 3. w w w . a g i l e b a y. c o m
  • 4. Session Agenda  What is Hadoop?  Version  1.0  2.0  Demo:
  • 5.
  • 6. Hadoop  Apache Foundation  Open Source  Batch Processing  Parallel, Reliable, Scalable  Distributed Stores 3 copies  Commodity Hardware  Large Unstructured Data Sets  Eventually Consistent
  • 7. What is Hadoop  Ecosystem  Comprised of multiple Projects • MapReduce • Hive • Pig • Scoop • Oozie • Flume • ZooKeeper • Tez • Mahout • HBase • Ambari • Impala
  • 8. Hadoop v1.0  2004 Yahoo  Doug Cutting (Cloudera) • MapReduce • Written 100% in Java • Mappers • Splits Rows into Chunks • Reducers • Aggregates the Chunks • HDFS • Distributed File System • Java code is complex
  • 9. Reason for Hadoop  Data gets ingested into HDFS  Java MapReduce Jobs run  Parse out the Data  Creates Output files  Jobs can be re-run against Output files  Run algorithms  Handle Large, Complex Data Sets  Look for “Insights”  Raw Data (CSV, TXT, Binary, XML)
  • 10. Name Nodes  The “Brains” of Hadoop  “Master” Server  Single Point of Failure
  • 11. Data Nodes  “Slaves”  Commodity Hardware  Stores the Actual Data  Runs Java Jar Files
  • 12. Secondary Named Nodes  Redundant Server  Serves as Backup  Typically on own Server
  • 13. Job Tracker  Keeps track of “Jobs” Resources  Heartbeat
  • 14. Task Tracker  Keeps track of “Task” Resources  Communicates with Job Tracker
  • 15. Ingest Data  When thinking about Hadoop, we think of data. How to get data into HDFS and how to get data out of HDFS. Luckily, Hadoop has some popular processes to accomplish this.
  • 16. SQOOP  SQOOP was created to move data back and forth easily from an External Database or flat file into HDFS or HIVE. There are some standard commands for moving data by Importing and Exporting data. When data is moved to HDFS, it creates files on the HDFS folder system. Those folders can be partitioned in a variety of ways. Data can be appended to the files through SQOOP jobs. And you can add a WHERE clause to pull just certain data, for example, just bring in data from yesterday, run the SQOOP job daily to populate Hadoop.
  • 17. Hive  Once data gets moved to Hadoop HDFS, you can add a layer of HIVE on top which structures the data into relational format. Once applied, the data can be queried by HIVE SQL. If creating a table, in the HIVE database schema, you can create an External table which is basically a metadata layer pass through which points to the actual data. So if you drop the External table, the data remains in tact.
  • 18. PIG  In addition, you can use a Hadoop language called PIG (not making this up), to massage the data into a structure series of steps, a form of ETL.
  • 19. MapReduce  HIVE and PIG allow easier access to the data  However, they still get translated to M/R
  • 20. ODBC  From HIVE SQL, the tables are exposed to ODBC to allow data to be accessed via Reports, Databases, ETL, etc. So as you can see from the basic description above, if you can move data back and forth easily between Hadoop and your Relational Database (or flat files).
  • 21. Connect to Data  Once data is stored in HDW, it can be consumed by users via HIVE ODBC or Microsoft PowerBI, Tableau, Qlikview or SAP HANA or a variety of other tools sitting on top of the data layer, including Self Service tools.
  • 22. HCatalog  Sometimes when developing, users don't know where data is stored. And sometimes the data can be stored in a variety of formats, because HIVE, PIG and Map Reduce can have separate data model types. So HCatalog was created to alleviate some of the frustration. It's a table abstraction layer, meta data service and a shared schema for Pig, Hive and M/R. It exposes info about the data to applications.
  • 23. HBase  Hbase allows a separate database to allow random read/write access to the HDFS data, and surprisingly it too sits with the HDFS cluster. Data can be ingested to HBASE and interpreted On Read, which Relational Databases do not offer.
  • 24. Accumulo  A High performance Data Storage and retrieval system with cell-level access control, similar to Google’s “Big Table” design.
  • 25. OOZIE  A Java Web application used to schedule Hadoop jobs. Combines multiple jobs sequentially into one logical unit of work.
  • 26. Flume  Distributed, reliable and available service for efficiently collection, aggregating and moving large amounts of streaming data into HDFS (fault tolerant).
  • 27. Solr  Open Source platform for searches of data stored in HDFS Hadoop including full text search and near real time indexing.
  • 28. Streaming  And you can receive Steaming Data.
  • 29. HUE  Open Source Web Interface  Aggregates most common components into single web interface  View HDFS File Structure  Simplify user experience
  • 30. WebHDFS  A REST API  Interface to expose complete File System  Provides Read & Write access  Supports all HDFS parameters  Allows remote access via many languages  Uses Kerbos for Authentication
  • 31. Monitor  There's Zookeeper which is a centralized service to keep track of things. A high performance coordination service for distributed applications.
  • 32. Machine Learning  In addition, you could apply MAHOUT Machine Learning algorithms to you Hadoop cluster for Clustering, Classification and Collaborative Filtering. And you can run Statistical language analysis with a language called Revolution Analytic R version of Hadoop R.
  • 33. Machine Learning  Clustering  Similarities between data points in Clusters  Classification  Learns from existing categories to assign unassigned categories  User Based Recommendations  Predict future behavior based on user preferences and behavior
  • 34. Hadoop 2.0  And with the latest Hadoop 2.0, there's the addition of YARN which is a new layer that sits between HDFS2 and the application layers. Although HDFS Map Reduce was originally designed as the sole batch oriented approach to getting data from HDFS, it's no longer the sole way. HIVE SQL has been sped up through Impala which completely bypasses Map Reduce and the Stinger initiative which sits atop Tez. Tez has ability to compress data with column stores which allows the interaction to be sped up.
  • 35. YARN  Allows the separation of MapReduce layers of Service and Framework  Resource Manager  Application Manager  Node Manager  Containers  Separates Resources
  • 36. YARN  Traditional MapReduce  Expensive  Original M/R spawned many process  Wrote to Disk intermediate data  Sort / Shuffle  Now we have Applications  M/R, Tez, Giraff, Spark, Storm, etc.  Compiled down to a lower level  Single Strand w/ More Complexity
  • 37. Tez  Generalized data flow programming framework, built on Hadoop YARN for batch and interactive use cases, such as Pig, HIVE and other frameworks. It has the potential to replace the MapReduce execution engine.
  • 38. Impala  Cloudera Impala is runs massively parallel processing (MPP) SQL query engine that runs natively in Hadoop.  Allows data querying without the need for data movement or transformation  It by-passes MapReduce
  • 39. Graph  And Girage, which allows Hadoop the ability to process Graph connections between nodes.
  • 40. Ambari  Ambari allows Hadoop Cluster administration and has an API layer for 3rd party tools to hook into.
  • 41. Spark  And Spark, provides a simple and expressive programming model that supports ETL, Machine Learning, stream processing and graph computation.
  • 42. Knox  Provides a single point of authentication and access to Hadoop services. Specifically for Hadoop users who access the cluster data and execute jobs, operators who control access and manage the cluster.
  • 43. Falcon  Framework for simplifying data management and pipeline processing in Hadoop. Enables users to automate the movement and processing of datasets for ingest, pipelines, disaster recovery and data retention use cases. It simplifies data management by removing complex coding (out of the box).
  • 44. More Apache Projects  Apache Kafka  Next Generation Distributed Messaging System  Apache Avro  Data Serialization System  Apache Chukwa  Data Collection System for Monitoring large distributed systems
  • 45. Cloud  You can run your Hybrid Data Warehouse in the Cloud with Microsoft Azure Blobstorage HDInsight or Amazon Web Services.
  • 46. On Premise  You can run On Premise with IBM Infosphere BigInsights, Cloudera, Hortonworks and MapR.
  • 47. Hybrid Data Warehouse  You can build a Hybrid Data Warehouse. As Data Warehousing is a concept, a documented framework to follow with guidelines and rules. And storing the data in Hadoop and Relational Databases is typically known as a Hybrid Data Warehouse.
  • 48. BI vs. Hadoop  Hadoop not a replacement of BI  Extends BI capabilities  BI = Scale up to 100s of Gigabytes  Hadoop = From 100s of Gygabytes to Terabytes (1,000s og Gygabytes) and Terabytes (1,000,000 Gigabytes)
  • 50. Where’s Hadoop Headed?  Transactional Data?  More Real Time?  Integrate with Traditional Data Warehouses?  Hadoop for the Masses?  Artificial Intelligence?  Turing Test  Neural Networks  Internet of Things
  • 52. Blog: www.bloomconsultingbi.com Twitter: @sqljon Linked-in: http://www.linkedin.com/in/BloomConsultingBI Email: jbloom@agilebay.com