A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Â
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
1. Hadoop Demystified
What is it? How does Microsoft fit in?
and⌠of course⌠some demos!
Presentation for ATL .NET User Group
(July, 2014)
Lester Martin
Page 1
2. Agenda
⢠Hadoop 101
âFundamentally, What is Hadoop?
âHow is it Different?
âHistory of Hadoop
⢠Components of the Hadoop Ecosystem
⢠MapReduce, Pig, and Hive Demos
âWord Count
âOpen Georgia Dataset Analysis
Page 2
3. Connection before Content
⢠Lester Martin
⢠Hortonworks â Professional Services
⢠lmartin@hortonworks.com
⢠http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 3
4. Š Hortonworks Inc. 2012
Scale-Out
Processing
Scalable, Fault Tolerant, Open Source Data Storage and Processing
Page 7
MapReduce
What is Core Apache Hadoop?
Flexibility to Store and Mine
Any Type of Data
ď§ Ask questions that were previously
impossible to ask or solve
ď§ Not bound by a single, fixed schema
Excels at
Processing Complex Data
ď§ Scale-out architecture divides
workloads across multiple nodes
ď§ Eliminates ETL bottlenecks
Scales
Economically
ď§ Deployed on âcommodityâ hardware
ď§ Open source platform guards
against vendor lock
Scale-Out
Storage
HDFS
Scale-Out
Resource Mgt
YARN
5. The Need for Hadoop
⢠Store and use all types of data
⢠Process ALL the data; not just a sample
⢠Scalability to 1000s of nodes
⢠Commodity hardware
Page 5
6. Relational Database vs. Hadoop
Relational Hadoop
Required on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
best fit use Data Discovery
Processing unstructured data
Massive storage/processing
P
7. Fundamentally, a Simple Algorithm
1. Review stack of quarters
2. Count each year that ends
in an even number
Page 7
9. Distributed Algorithm â Map:Reduce
Page 9
Map
(total number of quarters)
Reduce
(sum each personâs total)
10. A Brief History of Apache Hadoop
Page 10
2013
Focus on INNOVATION
2005: Hadoop created
at Yahoo!
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
âEnterprise Hadoopâ. Starts with 24
key Hadoop engineers from Yahoo
12. HDP: Enterprise Hadoop Platform
Page 12
Hortonworks
Data Platform (HDP)
⢠The ONLY 100% open source
and complete platform
⢠Integrates full range of
enterprise-ready services
⢠Certified and tested at scale
⢠Engineered for deep
ecosystem interoperability
OS/VM Cloud Appliance
PLATFORM
SERVICES
HADOOP
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
15. Hive
⢠Data warehousing package built on top of Hadoop
⢠Bringing structure to unstructured data
⢠Query petabytes of data with HiveQL
⢠Schema on read
1
â˘
â˘
â
â
16. Hive: SQL-Like Interface to Hadoop
⢠Provides basic SQL functionality using MapReduce to
execute queries
⢠Supports standard SQL clauses
INSERT INTO
SELECT
FROM ⌠JOIN ⌠ON
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
⢠Supports basic DDL
CREATE/ALTER/DROP TABLE, DATABASE
Page 17
17. Hortonworks Investment
in Apache Hive
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Page 18
Stinger Phase 3
⢠Hive on Apache Tez
⢠Query Service (always on)
⢠Buffer Cache
⢠Cost Based Optimizer (Optiq)
Stinger Phase 1:
⢠Base Optimizations
⢠SQL Types
⢠SQL Analytic Functions
⢠ORCFile Modern File Format
Stinger Phase 2:
⢠SQL Types
⢠SQL Analytic Functions
⢠Advanced Optimizations
⢠Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
Goals:
âŚ70% complete
in 6 monthsâŚall IN Hadoop
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
18. Stinger: Enhancing SQL Semantics
Page 19
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT Sub-queries in FROM clause
DOUBLE GROUP BY, ORDER BY
STRING CLUSTER BY, DISTRIBUTE BY
TIMESTAMP ROLLUP and CUBE
BINARY UNION
DECIMAL LEFT, RIGHT and FULL INNER/OUTER JOIN
ARRAY, MAP, STRUCT, UNION CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR INTERSECT, EXCEPT, UNION DISTINCT
DATE Sub-queries in HAVING
Sub-queries in WHERE (IN/NOT IN,
EXISTS/NOT EXISTS
Hive 0.10
Hive 12
Hive 0.11
Compete Subset
Hive 13
19. Pig
⢠Pig was created at Yahoo! to analyze data in HDFS without writing
Map/Reduce code.
⢠Two components:
â SQL like processing language called âPig Latinâ
â PIG execution engine producing Map/Reduce code
⢠Popular uses:
â ETL at scale (offloading)
â Text parsing and processing to Hive or HBase
â Aggregating data from multiple sources
â˘
â˘
â˘
20. Pig
Sample Code to find dropped call data:
4G_Data = LOAD â/archive/FDR_4G.txtâ using TextLoader();
Customer_Master = LOAD âmasterdb.customer_dataâ using
HCatLoader();
4G_Data_Full = JOIN 4G_Data by customerID, CustomerMaster by
customerID;
X = FILTER 4G_Data_Full BY State == âcall_droppedâ;
â˘
â˘
â˘
22. Powering the Modern Data Architecture
HADOOP 2.0
Multi Use Data Platform
Batch, Interactive, Online, Streaming, âŚ
Page 23
Interact with all data in
multiple ways simultaneously
Redundant, Reliable Storage
HDFS 2
Cluster Resource Management
YARN
Standard SQL
Processing
Hive
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
âŚ
HADOOP 1.0
HDFS 1
(redundant, reliable storage)
MapReduce
(distributed data processing
& cluster resource management)
Single Use System
Batch Apps
Data Processing
Frameworks
(Hive, Pig, Cascading, âŚ)
23. Word Counting Time!!
Hadoopâs âHello Whirledâ Example
A quick refresher of core elements of
Hadoop and then code walk-thrus with
Java MapReduce and Pig
Page 25
24. Core Hadoop Concepts
⢠Applications are written in high-level code
âDevelopers need not worry about network programming, temporal
dependencies or low-level infrastructure
⢠Nodes talk to each other as little as possible
âDevelopers should not write code which communicates between
nodes
ââShared nothingâ architecture
⢠Data is spread among machines in advance
âComputation happens where the data is stored, wherever possible
â Data is replicated multiple times on the system for increased
availability and reliability
Page 26
25. Hadoop: Very High-Level Overview
⢠When data is loaded in the system, it is split into
âblocksâ
âTypically 64MB or 128MB
⢠Map tasks (first part of MapReduce) work on relatively
small portions of data
âTypically a single block
⢠A master program allocates work to nodes such that a
Map tasks will work on a block of data stored locally
on that node whenever possible
âMany nodes work in parallel, each on their own part of the overall
dataset
Page 27
26. Fault Tolerance
⢠If a node fails, the master will detect that failure and
re-assign the work to a different node on the system
⢠Restarting a task does not require communication
with nodes working on other portions of the data
⢠If a failed node restarts, it is automatically added back
to the system and assigned new tasks
⢠If a nodes appears to be running slowly, the master
can redundantly execute another instance of the same
task
âResults from the first to finish will be used
âKnown as âspeculative executionâ
Page 28
27. Hadoop Components
⢠Hadoop consists of two core components
âThe Hadoop Distributed File System (HDFS)
âMapReduce
⢠Many other projects based around core Hadoop (the
âEcosystemâ)
âPig, Hive, Hbase, Flume, Oozie, Sqoop, Datameer, etc
⢠A set of machines running HDFS and MapReduce is
known as a Hadoop Cluster
âIndividual machines are known as nodes
âA cluster can have as few as one node, as many as several
thousand
â More nodes = better performance!
Page 29
28. Hadoop Components: HDFS
⢠HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
⢠Data is split into blocks and distributed across
multiple nodes in the cluster
âEach block is typically 64MB (the default) or 128MB in size
⢠Each block is replicated multiple times
âDefault is to replicate each block three times
âReplicas are stored on different nodes
â This ensures both reliability and availability
Page 30
30. HDFS *is* a File System
⢠Screenshot for âName Node UIâ
Page 32
31. Accessing HDFS
⢠Applications can read and write HDFS files directly via
a Java API
⢠Typically, files are created on a local filesystem and
must be moved into HDFS
⢠Likewise, files stored in HDFS may need to be moved
to a machineâs local filesystem
⢠Access to HDFS from the command line is achieved
with the hdfs dfs command
âProvides various shell-like commands as you find on Linux
âReplaces the hadoop fs command
⢠Graphical tools available like the Sandboxâs Hue File
Browser and Red Gateâs HDFS Explorer
Page 33
32. hdfs dfs Examples
⢠Copy file foo.txt from local disk to the userâs directory
in HDFS
âThis will copy the file to /user/username/fooHDFS.txt
⢠Get a directory listing of the userâs home directory in
HDFS
⢠Get a directory listing of the HDFS root directory
Page 34
hdfs dfs âput fooLocal.txt fooHDFS.txt
hdfs dfs âls
hdfs dfs âls /
33. hdfs dfs Examples (continued)
⢠Display the contents of a specific HDFS file
⢠Move that file back to the local disk
⢠Create a directory called input under the userâs home
directory
⢠Delete the HDFS directory input and all its contents
Page 35
hdfs dfs âcat /user/fred/fooHDFS.txt
hdfs dfs âmkdir input
hdfs dfs ârm âr input
hdfs dfs âget /user/fred/fooHDFS.txt barLocal.txt
34. Hadoop Components: MapReduce
⢠MapReduce is the system used to process data in the
Hadoop cluster
⢠Consists of two phases: Map, and then Reduce
âBetween the two is a stage known as the shuffle and sort
⢠Each Map task operates on a discrete portion of the
overall dataset
âTypically one HDFS block of data
⢠After all Maps are complete, the MapReduce system
distributes the intermediate data to nodes which
perform the Reduce phase
âSource code examples and live demo coming!
Page 36
35. Features of MapReduce
⢠Hadoop attempts to run tasks on nodes which hold
their portion of the data locally, to avoid network
traffic
⢠Automatic parallelization, distribution, and fault-
tolerance
⢠Status and monitoring tools
⢠A clean abstraction for programmers
âMapReduce programs are usually written in Java
â Can be written in any language using Hadoop Streaming
â All of Hadoop is written in Java
âWith âhousekeepingâ taken care of by the framework, developers
can concentrate simply on writing Map and Reduce functions
Page 37
38. MapReduce: The Mapper
⢠The Mapper reads data in the form of key/value pairs
(KVPs)
⢠It outputs zero or more KVPs
⢠The Mapper may use or completely ignore the input
key
âFor example, a standard pattern is to read a line of a file at a time
â The key is the byte offset into the file at which the line starts
â The value is the contents of the line itself
â Typically the key is considered irrelevant with this pattern
⢠If the Mapper writes anything out, it must in the form
of KVPs
âThis âintermediate dataâ is NOT stored in HDFS (local storage only
without replication)
Page 40
39. MapReducer: The Reducer
⢠After the Map phase is over, all the intermediate
values for a given intermediate key are combined
together into a list
⢠This list is given to a Reducer
âThere may be a single Reducer, or multiple Reducers
âAll values associated with a particular intermediate key are
guaranteed to go to the same Reducer
âThe intermediate keys, and their value lists, are passed in sorted
order
⢠The Reducer outputs zero or more KVPs
âThese are written to HDFS
âIn practice, the Reducer often emits a single KVP for each input
key
Page 41
40. MapReduce Example: Word Count
⢠Count the number of occurrences of each word in a
large amount of input data
Page 42
map(String input_key, String input_value)
foreach word in input_value:
emit(w,1)
reduce(String output_key, Iter<int> intermediate_vals)
set count = 0
foreach val in intermediate_vals:
count += val
emit(output_key, count)
41. MapReduce Example: Map Phase
Page 43
⢠Input to the Mapper
⢠Ignoring the key
â It is just an offset
⢠Output from the Mapper
⢠No attempt is made to optimize
within a record in this example
â This is a great use case for a
âCombinerâ
(8675, âI will not eat
green eggs and hamâ)
(8709, âI will not eat
them Sam I amâ)
(âIâ, 1), (âwillâ, 1),
(ânotâ, 1), (âeatâ, 1),
(âgreenâ, 1), (âeggsâ, 1),
(âandâ, 1), (âhamâ, 1),
(âIâ, 1), (âwillâ, 1),
(ânotâ, 1), (âeatâ, 1),
(âthemâ, 1), (âSamâ, 1),
(âIâ, 1), (âamâ, 1)
42. MapReduce Example: Reduce Phase
Page 44
⢠Input to the Reducer
⢠Notice keys are sorted and
associated values for same key
are in a single list
â Shuffle & Sort did this for us
⢠Output from the Reducer
⢠All done!
(âIâ, [1, 1, 1])
(âSamâ, [1])
(âamâ, [1])
(âandâ, [1])
(âeatâ, [1, 1])
(âeggsâ, [1])
(âgreenâ, [1])
(âhamâ, [1])
(ânotâ, [1, 1])
(âthemâ, [1])
(âwillâ, [1, 1])
(âIâ, 3)
(âSamâ, 1)
(âamâ, 1)
(âandâ, 1)
(âeatâ, 2)
(âeggsâ, 1)
(âgreenâ, 1)
(âhamâ, 1)
(ânotâ, 2)
(âthemâ, 1)
(âwillâ, 2)
43. Code Walkthru & Demo Time!!
⢠Word Count Example
âJava MapReduce
âPig
Page 45
45. Dataset: Open Georgia
⢠Salaries & Travel Reimbursements
âOrganization
â Local Boards of Education
â Several Atlanta-area districts; multiple years
â State Agencies, Boards, Authorities and Commissions
â Dept of Public Safety; 2010
Page 47
46. Format & Sample Data
Page 48
NAME (String) TITLE (String)
SALARY
(float)
ORG TYPE
(String)
ORG (String) YEAR (int)
ABBOTT,DEEDEE W
GRADES 9-12
TEACHER
52,122.10 LBOE
ATLANTA INDEPENDENT
SCHOOL SYSTEM
2010
ALLEN,ANNETTE D
SPEECH-LANGUAGE
PATHOLOGIST
92,937.28 LBOE
ATLANTA INDEPENDENT
SCHOOL SYSTEM
2010
BAHR,SHERREEN T GRADE 5 TEACHER 52,752.71 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
BAILEY,ANTOINETT
E R
SCHOOL
SECRETARY/CLERK
19,905.90 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
BAILEY,ASHLEY N
EARLY INTERVENTION
PRIMARY TEACHER
43,992.82 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
CALVERT,RONALD
MARTIN
STATE PATROL (SP) 51,370.40 SABAC
PUBLIC SAFETY, DEPARTMENT
OF
2010
CAMERON,MICHAE
L D
PUBLIC SAFETY TRN
(AL)
34,748.60 SABAC
PUBLIC SAFETY, DEPARTMENT
OF
2010
DAAS,TARWYN
TARA
GRADES 9-12
TEACHER
41,614.50 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2011
DABBS,SANDRA L
GRADES 9-12
TEACHER
79,801.59 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2011
E'LOM,SOPHIA L
IS PERSONNEL -
GENERAL ADMIN
75,509.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
EADDY,FENNER R SUBSTITUTE 13,469.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
EADY,ARNETTA A ASSISTANT PRINCIPAL 71,879.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
47. Simple Use Case
⢠For all loaded State of Georgia salary information
âProduce statistics for each specific job title
â Number of employees
â Salary breakdown
â Minimum
â Maximum
â Average
âLimit the data to investigate
â Fiscal year 2010
â School district employees
Page 49
48. Code Walkthru & Demo; Part Deux!
⢠Word Count Example
âJava MapReduce
âPig
âHive
Page 50
49. Demo Wrap-Up
⢠All code, test data, wiki pages, and blog posting can
be found, or linked to, from
âhttps://github.com/lestermartin/hadoop-exploration
⢠This deck can be found on SlideShare
âhttp://www.slideshare.net/lestermartin
⢠Questions?
Page 51
50. Thank You!!
⢠Lester Martin
⢠Hortonworks â Professional Services
⢠lmartin@hortonworks.com
⢠http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 52
Hinweis der Redaktion
Hadoop fills several important needs in your data storage and processing infrastructure
Store and use all types of data: Allows semi-structured, unstructured and structured data to be processed in a way to create new insights of significant business value.
Process all the data: Instead of looking at samples of data or small sections of data, organizations can look at large volumes of data to get new perspective and make business decisions with higher degree of accuracy.
Scalability: Reducing latency in business is critical for success. The massive scalability of Big Data systems allow organizations to process massive amounts of data in a fraction of the time required for traditional systems.
Commodity hardware: Self-healing, extremely scalable, highly available environment with cost-effective commodity hardware.
KEY CALLOUT: Schema on Read
IMPORTANT NOTE: Hadoop is not meant to replace your relational database. Hadoop is for storing Big Data, which is often the type of data that you would otherwise not store in a database due to size or cost constraints You will still have your database for relational, transactional data.
I canât really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.
What we now know of as Hadoop really started back in 2005, when the team at yahoo! â started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahooâs most critical application, Search. The initial focus was on building out the technology â the key components being HDFS and MapReduce â that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.
By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the teamâs focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.
In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left â with the blessing of Yahoo â to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.
[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
SQL is a query language
Declarative, what not how
Oriented around answering a question
Requires uniform schema
Requires metadata
Known by everyone
A great choice for answering queries, building reports, use with automated tools
With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that weâve put Hive on a clear roadmap to SQL compliance.
That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
âhdfs dfsâ is the *new* âhadoop fsâ
Blank acts like ~
These two slides were just to make folks feel at home with CLI access to HDFS
See https://martin.atlassian.net/wiki/x/FwAvAQ for more details
Surely not the typical Volume/Velocity/Variety definition of âBig Dataâ, but gives us a controlled environment to do some simple prototyping and validating with
See https://martin.atlassian.net/wiki/x/NYBmAQ for more details
See https://martin.atlassian.net/wiki/x/FwAvAQ for more information