SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Hadoop Demystified
What is it? How does Microsoft fit in?
and… of course… some demos!
Presentation for ATL .NET User Group
(July, 2014)
Lester Martin
Page 1
Agenda
• Hadoop 101
–Fundamentally, What is Hadoop?
–How is it Different?
–History of Hadoop
• Components of the Hadoop Ecosystem
• MapReduce, Pig, and Hive Demos
–Word Count
–Open Georgia Dataset Analysis
Page 2
Connection before Content
• Lester Martin
• Hortonworks – Professional Services
• lmartin@hortonworks.com
• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 3
© Hortonworks Inc. 2012
Scale-Out
Processing
Scalable, Fault Tolerant, Open Source Data Storage and Processing
Page 7
MapReduce
What is Core Apache Hadoop?
Flexibility to Store and Mine
Any Type of Data
 Ask questions that were previously
impossible to ask or solve
 Not bound by a single, fixed schema
Excels at
Processing Complex Data
 Scale-out architecture divides
workloads across multiple nodes
 Eliminates ETL bottlenecks
Scales
Economically
 Deployed on “commodity” hardware
 Open source platform guards
against vendor lock
Scale-Out
Storage
HDFS
Scale-Out
Resource Mgt
YARN
The Need for Hadoop
• Store and use all types of data
• Process ALL the data; not just a sample
• Scalability to 1000s of nodes
• Commodity hardware
Page 5
Relational Database vs. Hadoop
Relational Hadoop
Required on write schema Required on Read
Reads are fast speed Writes are fast
Standards and structure governance Loosely structured
Limited, no data processing processing Processing coupled with data
Structured data types Multi and unstructured
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
best fit use Data Discovery
Processing unstructured data
Massive storage/processing
P
Fundamentally, a Simple Algorithm
1. Review stack of quarters
2. Count each year that ends
in an even number
Page 7
Processing at Scale
Page 8
Distributed Algorithm – Map:Reduce
Page 9
Map
(total number of quarters)
Reduce
(sum each person’s total)
A Brief History of Apache Hadoop
Page 10
2013
Focus on INNOVATION
2005: Hadoop created
at Yahoo!
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
HDP / Hadoop Components
Page 11
HDP: Enterprise Hadoop Platform
Page 12
Hortonworks
Data Platform (HDP)
• The ONLY 100% open source
and complete platform
• Integrates full range of
enterprise-ready services
• Certified and tested at scale
• Engineered for deep
ecosystem interoperability
OS/VM Cloud Appliance
PLATFORM
SERVICES
HADOOP
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Typical Hadoop Cluster
Page 13
HDFS - Writing Files
Rack1 Rack2 Rack3 RackN
request write
Hadoop Client
return DNs, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
write blocks
block report
fs sync Backup NN
per NN
checkpoint
Name Node
Hive
• Data warehousing package built on top of Hadoop
• Bringing structure to unstructured data
• Query petabytes of data with HiveQL
• Schema on read
1
•
•
–
–
Hive: SQL-Like Interface to Hadoop
• Provides basic SQL functionality using MapReduce to
execute queries
• Supports standard SQL clauses
INSERT INTO
SELECT
FROM … JOIN … ON
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
• Supports basic DDL
CREATE/ALTER/DROP TABLE, DATABASE
Page 17
Hortonworks Investment
in Apache Hive
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Page 18
Stinger Phase 3
• Hive on Apache Tez
• Query Service (always on)
• Buffer Cache
• Cost Based Optimizer (Optiq)
Stinger Phase 1:
• Base Optimizations
• SQL Types
• SQL Analytic Functions
• ORCFile Modern File Format
Stinger Phase 2:
• SQL Types
• SQL Analytic Functions
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
Goals:
…70% complete
in 6 months…all IN Hadoop
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
Stinger: Enhancing SQL Semantics
Page 19
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT Sub-queries in FROM clause
DOUBLE GROUP BY, ORDER BY
STRING CLUSTER BY, DISTRIBUTE BY
TIMESTAMP ROLLUP and CUBE
BINARY UNION
DECIMAL LEFT, RIGHT and FULL INNER/OUTER JOIN
ARRAY, MAP, STRUCT, UNION CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR INTERSECT, EXCEPT, UNION DISTINCT
DATE Sub-queries in HAVING
Sub-queries in WHERE (IN/NOT IN,
EXISTS/NOT EXISTS
Hive 0.10
Hive 12
Hive 0.11
Compete Subset
Hive 13
Pig
• Pig was created at Yahoo! to analyze data in HDFS without writing
Map/Reduce code.
• Two components:
– SQL like processing language called “Pig Latin”
– PIG execution engine producing Map/Reduce code
• Popular uses:
– ETL at scale (offloading)
– Text parsing and processing to Hive or HBase
– Aggregating data from multiple sources
•
•
•
Pig
Sample Code to find dropped call data:
4G_Data = LOAD ‘/archive/FDR_4G.txt’ using TextLoader();
Customer_Master = LOAD ‘masterdb.customer_data’ using
HCatLoader();
4G_Data_Full = JOIN 4G_Data by customerID, CustomerMaster by
customerID;
X = FILTER 4G_Data_Full BY State == ‘call_dropped’;
•
•
•
Typical Data Analysis Workflow
Powering the Modern Data Architecture
HADOOP 2.0
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Page 23
Interact with all data in
multiple ways simultaneously
Redundant, Reliable Storage
HDFS 2
Cluster Resource Management
YARN
Standard SQL
Processing
Hive
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
…
HADOOP 1.0
HDFS 1
(redundant, reliable storage)
MapReduce
(distributed data processing
& cluster resource management)
Single Use System
Batch Apps
Data Processing
Frameworks
(Hive, Pig, Cascading, …)
Word Counting Time!!
Hadoop’s “Hello Whirled” Example
A quick refresher of core elements of
Hadoop and then code walk-thrus with
Java MapReduce and Pig
Page 25
Core Hadoop Concepts
• Applications are written in high-level code
–Developers need not worry about network programming, temporal
dependencies or low-level infrastructure
• Nodes talk to each other as little as possible
–Developers should not write code which communicates between
nodes
–“Shared nothing” architecture
• Data is spread among machines in advance
–Computation happens where the data is stored, wherever possible
– Data is replicated multiple times on the system for increased
availability and reliability
Page 26
Hadoop: Very High-Level Overview
• When data is loaded in the system, it is split into
“blocks”
–Typically 64MB or 128MB
• Map tasks (first part of MapReduce) work on relatively
small portions of data
–Typically a single block
• A master program allocates work to nodes such that a
Map tasks will work on a block of data stored locally
on that node whenever possible
–Many nodes work in parallel, each on their own part of the overall
dataset
Page 27
Fault Tolerance
• If a node fails, the master will detect that failure and
re-assign the work to a different node on the system
• Restarting a task does not require communication
with nodes working on other portions of the data
• If a failed node restarts, it is automatically added back
to the system and assigned new tasks
• If a nodes appears to be running slowly, the master
can redundantly execute another instance of the same
task
–Results from the first to finish will be used
–Known as “speculative execution”
Page 28
Hadoop Components
• Hadoop consists of two core components
–The Hadoop Distributed File System (HDFS)
–MapReduce
• Many other projects based around core Hadoop (the
“Ecosystem”)
–Pig, Hive, Hbase, Flume, Oozie, Sqoop, Datameer, etc
• A set of machines running HDFS and MapReduce is
known as a Hadoop Cluster
–Individual machines are known as nodes
–A cluster can have as few as one node, as many as several
thousand
– More nodes = better performance!
Page 29
Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
• Data is split into blocks and distributed across
multiple nodes in the cluster
–Each block is typically 64MB (the default) or 128MB in size
• Each block is replicated multiple times
–Default is to replicate each block three times
–Replicas are stored on different nodes
– This ensures both reliability and availability
Page 30
HDFS Replicated Blocks Visualized
Page 31
HDFS *is* a File System
• Screenshot for “Name Node UI”
Page 32
Accessing HDFS
• Applications can read and write HDFS files directly via
a Java API
• Typically, files are created on a local filesystem and
must be moved into HDFS
• Likewise, files stored in HDFS may need to be moved
to a machine’s local filesystem
• Access to HDFS from the command line is achieved
with the hdfs dfs command
–Provides various shell-like commands as you find on Linux
–Replaces the hadoop fs command
• Graphical tools available like the Sandbox’s Hue File
Browser and Red Gate’s HDFS Explorer
Page 33
hdfs dfs Examples
• Copy file foo.txt from local disk to the user’s directory
in HDFS
–This will copy the file to /user/username/fooHDFS.txt
• Get a directory listing of the user’s home directory in
HDFS
• Get a directory listing of the HDFS root directory
Page 34
hdfs dfs –put fooLocal.txt fooHDFS.txt
hdfs dfs –ls
hdfs dfs –ls /
hdfs dfs Examples (continued)
• Display the contents of a specific HDFS file
• Move that file back to the local disk
• Create a directory called input under the user’s home
directory
• Delete the HDFS directory input and all its contents
Page 35
hdfs dfs –cat /user/fred/fooHDFS.txt
hdfs dfs –mkdir input
hdfs dfs –rm –r input
hdfs dfs –get /user/fred/fooHDFS.txt barLocal.txt
Hadoop Components: MapReduce
• MapReduce is the system used to process data in the
Hadoop cluster
• Consists of two phases: Map, and then Reduce
–Between the two is a stage known as the shuffle and sort
• Each Map task operates on a discrete portion of the
overall dataset
–Typically one HDFS block of data
• After all Maps are complete, the MapReduce system
distributes the intermediate data to nodes which
perform the Reduce phase
–Source code examples and live demo coming!
Page 36
Features of MapReduce
• Hadoop attempts to run tasks on nodes which hold
their portion of the data locally, to avoid network
traffic
• Automatic parallelization, distribution, and fault-
tolerance
• Status and monitoring tools
• A clean abstraction for programmers
–MapReduce programs are usually written in Java
– Can be written in any language using Hadoop Streaming
– All of Hadoop is written in Java
–With “housekeeping” taken care of by the framework, developers
can concentrate simply on writing Map and Reduce functions
Page 37
MapReduce Visualized
Page 38
Detailed Administrative Console
• Screenshot from “Job Tracker UI”
Page 39
MapReduce: The Mapper
• The Mapper reads data in the form of key/value pairs
(KVPs)
• It outputs zero or more KVPs
• The Mapper may use or completely ignore the input
key
–For example, a standard pattern is to read a line of a file at a time
– The key is the byte offset into the file at which the line starts
– The value is the contents of the line itself
– Typically the key is considered irrelevant with this pattern
• If the Mapper writes anything out, it must in the form
of KVPs
–This “intermediate data” is NOT stored in HDFS (local storage only
without replication)
Page 40
MapReducer: The Reducer
• After the Map phase is over, all the intermediate
values for a given intermediate key are combined
together into a list
• This list is given to a Reducer
–There may be a single Reducer, or multiple Reducers
–All values associated with a particular intermediate key are
guaranteed to go to the same Reducer
–The intermediate keys, and their value lists, are passed in sorted
order
• The Reducer outputs zero or more KVPs
–These are written to HDFS
–In practice, the Reducer often emits a single KVP for each input
key
Page 41
MapReduce Example: Word Count
• Count the number of occurrences of each word in a
large amount of input data
Page 42
map(String input_key, String input_value)
foreach word in input_value:
emit(w,1)
reduce(String output_key, Iter<int> intermediate_vals)
set count = 0
foreach val in intermediate_vals:
count += val
emit(output_key, count)
MapReduce Example: Map Phase
Page 43
• Input to the Mapper
• Ignoring the key
– It is just an offset
• Output from the Mapper
• No attempt is made to optimize
within a record in this example
– This is a great use case for a
“Combiner”
(8675, ‘I will not eat
green eggs and ham’)
(8709, ‘I will not eat
them Sam I am’)
(‘I’, 1), (‘will’, 1),
(‘not’, 1), (‘eat’, 1),
(‘green’, 1), (‘eggs’, 1),
(‘and’, 1), (‘ham’, 1),
(‘I’, 1), (‘will’, 1),
(‘not’, 1), (‘eat’, 1),
(‘them’, 1), (‘Sam’, 1),
(‘I’, 1), (‘am’, 1)
MapReduce Example: Reduce Phase
Page 44
• Input to the Reducer
• Notice keys are sorted and
associated values for same key
are in a single list
– Shuffle & Sort did this for us
• Output from the Reducer
• All done!
(‘I’, [1, 1, 1])
(‘Sam’, [1])
(‘am’, [1])
(‘and’, [1])
(‘eat’, [1, 1])
(‘eggs’, [1])
(‘green’, [1])
(‘ham’, [1])
(‘not’, [1, 1])
(‘them’, [1])
(‘will’, [1, 1])
(‘I’, 3)
(‘Sam’, 1)
(‘am’, 1)
(‘and’, 1)
(‘eat’, 2)
(‘eggs’, 1)
(‘green’, 1)
(‘ham’, 1)
(‘not’, 2)
(‘them’, 1)
(‘will’, 2)
Code Walkthru & Demo Time!!
• Word Count Example
–Java MapReduce
–Pig
Page 45
Additional Demonstrations
A Real-World Analysis Example
Compare/contrast solving the same
problem with Java MapReduce, Pig,
and Hive
Page 46
Dataset: Open Georgia
• Salaries & Travel Reimbursements
–Organization
– Local Boards of Education
– Several Atlanta-area districts; multiple years
– State Agencies, Boards, Authorities and Commissions
– Dept of Public Safety; 2010
Page 47
Format & Sample Data
Page 48
NAME (String) TITLE (String)
SALARY
(float)
ORG TYPE
(String)
ORG (String) YEAR (int)
ABBOTT,DEEDEE W
GRADES 9-12
TEACHER
52,122.10 LBOE
ATLANTA INDEPENDENT
SCHOOL SYSTEM
2010
ALLEN,ANNETTE D
SPEECH-LANGUAGE
PATHOLOGIST
92,937.28 LBOE
ATLANTA INDEPENDENT
SCHOOL SYSTEM
2010
BAHR,SHERREEN T GRADE 5 TEACHER 52,752.71 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
BAILEY,ANTOINETT
E R
SCHOOL
SECRETARY/CLERK
19,905.90 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
BAILEY,ASHLEY N
EARLY INTERVENTION
PRIMARY TEACHER
43,992.82 LBOE
COBB COUNTY SCHOOL
DISTRICT
2010
CALVERT,RONALD
MARTIN
STATE PATROL (SP) 51,370.40 SABAC
PUBLIC SAFETY, DEPARTMENT
OF
2010
CAMERON,MICHAE
L D
PUBLIC SAFETY TRN
(AL)
34,748.60 SABAC
PUBLIC SAFETY, DEPARTMENT
OF
2010
DAAS,TARWYN
TARA
GRADES 9-12
TEACHER
41,614.50 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2011
DABBS,SANDRA L
GRADES 9-12
TEACHER
79,801.59 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2011
E'LOM,SOPHIA L
IS PERSONNEL -
GENERAL ADMIN
75,509.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
EADDY,FENNER R SUBSTITUTE 13,469.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
EADY,ARNETTA A ASSISTANT PRINCIPAL 71,879.00 LBOE
FULTON COUNTY BOARD OF
EDUCATION
2012
Simple Use Case
• For all loaded State of Georgia salary information
–Produce statistics for each specific job title
– Number of employees
– Salary breakdown
– Minimum
– Maximum
– Average
–Limit the data to investigate
– Fiscal year 2010
– School district employees
Page 49
Code Walkthru & Demo; Part Deux!
• Word Count Example
–Java MapReduce
–Pig
–Hive
Page 50
Demo Wrap-Up
• All code, test data, wiki pages, and blog posting can
be found, or linked to, from
–https://github.com/lestermartin/hadoop-exploration
• This deck can be found on SlideShare
–http://www.slideshare.net/lestermartin
• Questions?
Page 51
Thank You!!
• Lester Martin
• Hortonworks – Professional Services
• lmartin@hortonworks.com
• http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc)
Page 52

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEOpenStack
 
SQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseSQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseNAVER Engineering
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringErik Krogen
 
HP3 PAR StoreServ - Storage Best Pratices
HP3 PAR StoreServ - Storage Best Pratices HP3 PAR StoreServ - Storage Best Pratices
HP3 PAR StoreServ - Storage Best Pratices Alexandre Reis
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
The Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on LinuxThe Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on LinuxPicker Weng
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)fisuda
 
Accelerating Local Search with PostgreSQL (KNN-Search)
Accelerating Local Search with PostgreSQL (KNN-Search)Accelerating Local Search with PostgreSQL (KNN-Search)
Accelerating Local Search with PostgreSQL (KNN-Search)Jonathan Katz
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDBSage Weil
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...Amazon Web Services Japan
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerDatabricks
 
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 rockplace
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈
AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈
AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈Amazon Web Services Korea
 

Was ist angesagt? (20)

Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSEUnderstanding blue store, Ceph's new storage backend - Tim Serong, SUSE
Understanding blue store, Ceph's new storage backend - Tim Serong, SUSE
 
SQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouseSQream DB, GPU-accelerated data warehouse
SQream DB, GPU-accelerated data warehouse
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
HP3 PAR StoreServ - Storage Best Pratices
HP3 PAR StoreServ - Storage Best Pratices HP3 PAR StoreServ - Storage Best Pratices
HP3 PAR StoreServ - Storage Best Pratices
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
The Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on LinuxThe Theory and Implementation of DVFS on Linux
The Theory and Implementation of DVFS on Linux
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)
NGSIv1 を知っている開発者向けの NGSIv2 の概要 (Orion 3.1.0対応)
 
Accelerating Local Search with PostgreSQL (KNN-Search)
Accelerating Local Search with PostgreSQL (KNN-Search)Accelerating Local Search with PostgreSQL (KNN-Search)
Accelerating Local Search with PostgreSQL (KNN-Search)
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
 
Care and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst OptimizerCare and Feeding of Catalyst Optimizer
Care and Feeding of Catalyst Optimizer
 
Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성 Jenkins를 활용한 Openshift CI/CD 구성
Jenkins를 활용한 Openshift CI/CD 구성
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈
AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈
AWS 클라우드 이해하기-사례 중심 (정민정) - AWS 웨비나 시리즈
 

Andere mochten auch

Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaAbhinav Singh
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHortonworks
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesHortonworks
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614Sri Ambati
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Rakuten Group, Inc.
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)Steve Min
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 

Andere mochten auch (20)

Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Stream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache KafkaStream Processing using Apache Spark and Apache Kafka
Stream Processing using Apache Spark and Apache Kafka
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Deep learning - Part I
Deep learning - Part IDeep learning - Part I
Deep learning - Part I
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
The path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial ServicesThe path to a Modern Data Architecture in Financial Services
The path to a Modern Data Architecture in Financial Services
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
 
Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)Introduction to Deep Learning (NVIDIA)
Introduction to Deep Learning (NVIDIA)
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 

Ähnlich wie Hadoop Demystified: What is it? How does Microsoft fit in? and... of course... some demos

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 

Ähnlich wie Hadoop Demystified: What is it? How does Microsoft fit in? and... of course... some demos (20)

Hadoop
HadoopHadoop
Hadoop
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 

Kürzlich hochgeladen

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 

Kürzlich hochgeladen (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 

Hadoop Demystified: What is it? How does Microsoft fit in? and... of course... some demos

  • 1. Hadoop Demystified What is it? How does Microsoft fit in? and… of course… some demos! Presentation for ATL .NET User Group (July, 2014) Lester Martin Page 1
  • 2. Agenda • Hadoop 101 –Fundamentally, What is Hadoop? –How is it Different? –History of Hadoop • Components of the Hadoop Ecosystem • MapReduce, Pig, and Hive Demos –Word Count –Open Georgia Dataset Analysis Page 2
  • 3. Connection before Content • Lester Martin • Hortonworks – Professional Services • lmartin@hortonworks.com • http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc) Page 3
  • 4. © Hortonworks Inc. 2012 Scale-Out Processing Scalable, Fault Tolerant, Open Source Data Storage and Processing Page 7 MapReduce What is Core Apache Hadoop? Flexibility to Store and Mine Any Type of Data  Ask questions that were previously impossible to ask or solve  Not bound by a single, fixed schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Eliminates ETL bottlenecks Scales Economically  Deployed on “commodity” hardware  Open source platform guards against vendor lock Scale-Out Storage HDFS Scale-Out Resource Mgt YARN
  • 5. The Need for Hadoop • Store and use all types of data • Process ALL the data; not just a sample • Scalability to 1000s of nodes • Commodity hardware Page 5
  • 6. Relational Database vs. Hadoop Relational Hadoop Required on write schema Required on Read Reads are fast speed Writes are fast Standards and structure governance Loosely structured Limited, no data processing processing Processing coupled with data Structured data types Multi and unstructured Interactive OLAP Analytics Complex ACID Transactions Operational Data Store best fit use Data Discovery Processing unstructured data Massive storage/processing P
  • 7. Fundamentally, a Simple Algorithm 1. Review stack of quarters 2. Count each year that ends in an even number Page 7
  • 9. Distributed Algorithm – Map:Reduce Page 9 Map (total number of quarters) Reduce (sum each person’s total)
  • 10. A Brief History of Apache Hadoop Page 10 2013 Focus on INNOVATION 2005: Hadoop created at Yahoo! Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 20122006 STABILITY 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo
  • 11. HDP / Hadoop Components Page 11
  • 12. HDP: Enterprise Hadoop Platform Page 12 Hortonworks Data Platform (HDP) • The ONLY 100% open source and complete platform • Integrates full range of enterprise-ready services • Certified and tested at scale • Engineered for deep ecosystem interoperability OS/VM Cloud Appliance PLATFORM SERVICES HADOOP CORE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS KNOX* OOZIE AMBARI FALCON* YARN MAP TEZREDUCE HIVE & HCATALOG PIGHBASE
  • 14. HDFS - Writing Files Rack1 Rack2 Rack3 RackN request write Hadoop Client return DNs, etc. DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM write blocks block report fs sync Backup NN per NN checkpoint Name Node
  • 15. Hive • Data warehousing package built on top of Hadoop • Bringing structure to unstructured data • Query petabytes of data with HiveQL • Schema on read 1 • • – –
  • 16. Hive: SQL-Like Interface to Hadoop • Provides basic SQL functionality using MapReduce to execute queries • Supports standard SQL clauses INSERT INTO SELECT FROM … JOIN … ON WHERE GROUP BY HAVING ORDER BY LIMIT • Supports basic DDL CREATE/ALTER/DROP TABLE, DATABASE Page 17
  • 17. Hortonworks Investment in Apache Hive Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Page 18 Stinger Phase 3 • Hive on Apache Tez • Query Service (always on) • Buffer Cache • Cost Based Optimizer (Optiq) Stinger Phase 1: • Base Optimizations • SQL Types • SQL Analytic Functions • ORCFile Modern File Format Stinger Phase 2: • SQL Types • SQL Analytic Functions • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB Goals: …70% complete in 6 months…all IN Hadoop SQL Support broadest range of SQL semantics for analytic applications running against Hadoop
  • 18. Stinger: Enhancing SQL Semantics Page 19 Hive SQL Datatypes Hive SQL Semantics INT SELECT, LOAD, INSERT from query TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING BOOLEAN GROUP BY, ORDER BY, SORT BY FLOAT Sub-queries in FROM clause DOUBLE GROUP BY, ORDER BY STRING CLUSTER BY, DISTRIBUTE BY TIMESTAMP ROLLUP and CUBE BINARY UNION DECIMAL LEFT, RIGHT and FULL INNER/OUTER JOIN ARRAY, MAP, STRUCT, UNION CROSS JOIN, LEFT SEMI JOIN CHAR Windowing functions (OVER, RANK, etc.) VARCHAR INTERSECT, EXCEPT, UNION DISTINCT DATE Sub-queries in HAVING Sub-queries in WHERE (IN/NOT IN, EXISTS/NOT EXISTS Hive 0.10 Hive 12 Hive 0.11 Compete Subset Hive 13
  • 19. Pig • Pig was created at Yahoo! to analyze data in HDFS without writing Map/Reduce code. • Two components: – SQL like processing language called “Pig Latin” – PIG execution engine producing Map/Reduce code • Popular uses: – ETL at scale (offloading) – Text parsing and processing to Hive or HBase – Aggregating data from multiple sources • • •
  • 20. Pig Sample Code to find dropped call data: 4G_Data = LOAD ‘/archive/FDR_4G.txt’ using TextLoader(); Customer_Master = LOAD ‘masterdb.customer_data’ using HCatLoader(); 4G_Data_Full = JOIN 4G_Data by customerID, CustomerMaster by customerID; X = FILTER 4G_Data_Full BY State == ‘call_dropped’; • • •
  • 22. Powering the Modern Data Architecture HADOOP 2.0 Multi Use Data Platform Batch, Interactive, Online, Streaming, … Page 23 Interact with all data in multiple ways simultaneously Redundant, Reliable Storage HDFS 2 Cluster Resource Management YARN Standard SQL Processing Hive Batch MapReduce Interactive Tez Online Data Processing HBase, Accumulo Real Time Stream Processing Storm others … HADOOP 1.0 HDFS 1 (redundant, reliable storage) MapReduce (distributed data processing & cluster resource management) Single Use System Batch Apps Data Processing Frameworks (Hive, Pig, Cascading, …)
  • 23. Word Counting Time!! Hadoop’s “Hello Whirled” Example A quick refresher of core elements of Hadoop and then code walk-thrus with Java MapReduce and Pig Page 25
  • 24. Core Hadoop Concepts • Applications are written in high-level code –Developers need not worry about network programming, temporal dependencies or low-level infrastructure • Nodes talk to each other as little as possible –Developers should not write code which communicates between nodes –“Shared nothing” architecture • Data is spread among machines in advance –Computation happens where the data is stored, wherever possible – Data is replicated multiple times on the system for increased availability and reliability Page 26
  • 25. Hadoop: Very High-Level Overview • When data is loaded in the system, it is split into “blocks” –Typically 64MB or 128MB • Map tasks (first part of MapReduce) work on relatively small portions of data –Typically a single block • A master program allocates work to nodes such that a Map tasks will work on a block of data stored locally on that node whenever possible –Many nodes work in parallel, each on their own part of the overall dataset Page 27
  • 26. Fault Tolerance • If a node fails, the master will detect that failure and re-assign the work to a different node on the system • Restarting a task does not require communication with nodes working on other portions of the data • If a failed node restarts, it is automatically added back to the system and assigned new tasks • If a nodes appears to be running slowly, the master can redundantly execute another instance of the same task –Results from the first to finish will be used –Known as “speculative execution” Page 28
  • 27. Hadoop Components • Hadoop consists of two core components –The Hadoop Distributed File System (HDFS) –MapReduce • Many other projects based around core Hadoop (the “Ecosystem”) –Pig, Hive, Hbase, Flume, Oozie, Sqoop, Datameer, etc • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster –Individual machines are known as nodes –A cluster can have as few as one node, as many as several thousand – More nodes = better performance! Page 29
  • 28. Hadoop Components: HDFS • HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster • Data is split into blocks and distributed across multiple nodes in the cluster –Each block is typically 64MB (the default) or 128MB in size • Each block is replicated multiple times –Default is to replicate each block three times –Replicas are stored on different nodes – This ensures both reliability and availability Page 30
  • 29. HDFS Replicated Blocks Visualized Page 31
  • 30. HDFS *is* a File System • Screenshot for “Name Node UI” Page 32
  • 31. Accessing HDFS • Applications can read and write HDFS files directly via a Java API • Typically, files are created on a local filesystem and must be moved into HDFS • Likewise, files stored in HDFS may need to be moved to a machine’s local filesystem • Access to HDFS from the command line is achieved with the hdfs dfs command –Provides various shell-like commands as you find on Linux –Replaces the hadoop fs command • Graphical tools available like the Sandbox’s Hue File Browser and Red Gate’s HDFS Explorer Page 33
  • 32. hdfs dfs Examples • Copy file foo.txt from local disk to the user’s directory in HDFS –This will copy the file to /user/username/fooHDFS.txt • Get a directory listing of the user’s home directory in HDFS • Get a directory listing of the HDFS root directory Page 34 hdfs dfs –put fooLocal.txt fooHDFS.txt hdfs dfs –ls hdfs dfs –ls /
  • 33. hdfs dfs Examples (continued) • Display the contents of a specific HDFS file • Move that file back to the local disk • Create a directory called input under the user’s home directory • Delete the HDFS directory input and all its contents Page 35 hdfs dfs –cat /user/fred/fooHDFS.txt hdfs dfs –mkdir input hdfs dfs –rm –r input hdfs dfs –get /user/fred/fooHDFS.txt barLocal.txt
  • 34. Hadoop Components: MapReduce • MapReduce is the system used to process data in the Hadoop cluster • Consists of two phases: Map, and then Reduce –Between the two is a stage known as the shuffle and sort • Each Map task operates on a discrete portion of the overall dataset –Typically one HDFS block of data • After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase –Source code examples and live demo coming! Page 36
  • 35. Features of MapReduce • Hadoop attempts to run tasks on nodes which hold their portion of the data locally, to avoid network traffic • Automatic parallelization, distribution, and fault- tolerance • Status and monitoring tools • A clean abstraction for programmers –MapReduce programs are usually written in Java – Can be written in any language using Hadoop Streaming – All of Hadoop is written in Java –With “housekeeping” taken care of by the framework, developers can concentrate simply on writing Map and Reduce functions Page 37
  • 37. Detailed Administrative Console • Screenshot from “Job Tracker UI” Page 39
  • 38. MapReduce: The Mapper • The Mapper reads data in the form of key/value pairs (KVPs) • It outputs zero or more KVPs • The Mapper may use or completely ignore the input key –For example, a standard pattern is to read a line of a file at a time – The key is the byte offset into the file at which the line starts – The value is the contents of the line itself – Typically the key is considered irrelevant with this pattern • If the Mapper writes anything out, it must in the form of KVPs –This “intermediate data” is NOT stored in HDFS (local storage only without replication) Page 40
  • 39. MapReducer: The Reducer • After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list • This list is given to a Reducer –There may be a single Reducer, or multiple Reducers –All values associated with a particular intermediate key are guaranteed to go to the same Reducer –The intermediate keys, and their value lists, are passed in sorted order • The Reducer outputs zero or more KVPs –These are written to HDFS –In practice, the Reducer often emits a single KVP for each input key Page 41
  • 40. MapReduce Example: Word Count • Count the number of occurrences of each word in a large amount of input data Page 42 map(String input_key, String input_value) foreach word in input_value: emit(w,1) reduce(String output_key, Iter<int> intermediate_vals) set count = 0 foreach val in intermediate_vals: count += val emit(output_key, count)
  • 41. MapReduce Example: Map Phase Page 43 • Input to the Mapper • Ignoring the key – It is just an offset • Output from the Mapper • No attempt is made to optimize within a record in this example – This is a great use case for a “Combiner” (8675, ‘I will not eat green eggs and ham’) (8709, ‘I will not eat them Sam I am’) (‘I’, 1), (‘will’, 1), (‘not’, 1), (‘eat’, 1), (‘green’, 1), (‘eggs’, 1), (‘and’, 1), (‘ham’, 1), (‘I’, 1), (‘will’, 1), (‘not’, 1), (‘eat’, 1), (‘them’, 1), (‘Sam’, 1), (‘I’, 1), (‘am’, 1)
  • 42. MapReduce Example: Reduce Phase Page 44 • Input to the Reducer • Notice keys are sorted and associated values for same key are in a single list – Shuffle & Sort did this for us • Output from the Reducer • All done! (‘I’, [1, 1, 1]) (‘Sam’, [1]) (‘am’, [1]) (‘and’, [1]) (‘eat’, [1, 1]) (‘eggs’, [1]) (‘green’, [1]) (‘ham’, [1]) (‘not’, [1, 1]) (‘them’, [1]) (‘will’, [1, 1]) (‘I’, 3) (‘Sam’, 1) (‘am’, 1) (‘and’, 1) (‘eat’, 2) (‘eggs’, 1) (‘green’, 1) (‘ham’, 1) (‘not’, 2) (‘them’, 1) (‘will’, 2)
  • 43. Code Walkthru & Demo Time!! • Word Count Example –Java MapReduce –Pig Page 45
  • 44. Additional Demonstrations A Real-World Analysis Example Compare/contrast solving the same problem with Java MapReduce, Pig, and Hive Page 46
  • 45. Dataset: Open Georgia • Salaries & Travel Reimbursements –Organization – Local Boards of Education – Several Atlanta-area districts; multiple years – State Agencies, Boards, Authorities and Commissions – Dept of Public Safety; 2010 Page 47
  • 46. Format & Sample Data Page 48 NAME (String) TITLE (String) SALARY (float) ORG TYPE (String) ORG (String) YEAR (int) ABBOTT,DEEDEE W GRADES 9-12 TEACHER 52,122.10 LBOE ATLANTA INDEPENDENT SCHOOL SYSTEM 2010 ALLEN,ANNETTE D SPEECH-LANGUAGE PATHOLOGIST 92,937.28 LBOE ATLANTA INDEPENDENT SCHOOL SYSTEM 2010 BAHR,SHERREEN T GRADE 5 TEACHER 52,752.71 LBOE COBB COUNTY SCHOOL DISTRICT 2010 BAILEY,ANTOINETT E R SCHOOL SECRETARY/CLERK 19,905.90 LBOE COBB COUNTY SCHOOL DISTRICT 2010 BAILEY,ASHLEY N EARLY INTERVENTION PRIMARY TEACHER 43,992.82 LBOE COBB COUNTY SCHOOL DISTRICT 2010 CALVERT,RONALD MARTIN STATE PATROL (SP) 51,370.40 SABAC PUBLIC SAFETY, DEPARTMENT OF 2010 CAMERON,MICHAE L D PUBLIC SAFETY TRN (AL) 34,748.60 SABAC PUBLIC SAFETY, DEPARTMENT OF 2010 DAAS,TARWYN TARA GRADES 9-12 TEACHER 41,614.50 LBOE FULTON COUNTY BOARD OF EDUCATION 2011 DABBS,SANDRA L GRADES 9-12 TEACHER 79,801.59 LBOE FULTON COUNTY BOARD OF EDUCATION 2011 E'LOM,SOPHIA L IS PERSONNEL - GENERAL ADMIN 75,509.00 LBOE FULTON COUNTY BOARD OF EDUCATION 2012 EADDY,FENNER R SUBSTITUTE 13,469.00 LBOE FULTON COUNTY BOARD OF EDUCATION 2012 EADY,ARNETTA A ASSISTANT PRINCIPAL 71,879.00 LBOE FULTON COUNTY BOARD OF EDUCATION 2012
  • 47. Simple Use Case • For all loaded State of Georgia salary information –Produce statistics for each specific job title – Number of employees – Salary breakdown – Minimum – Maximum – Average –Limit the data to investigate – Fiscal year 2010 – School district employees Page 49
  • 48. Code Walkthru & Demo; Part Deux! • Word Count Example –Java MapReduce –Pig –Hive Page 50
  • 49. Demo Wrap-Up • All code, test data, wiki pages, and blog posting can be found, or linked to, from –https://github.com/lestermartin/hadoop-exploration • This deck can be found on SlideShare –http://www.slideshare.net/lestermartin • Questions? Page 51
  • 50. Thank You!! • Lester Martin • Hortonworks – Professional Services • lmartin@hortonworks.com • http://about.me/lestermartin (links to blog, github, twitter, LI, FB, etc) Page 52

Hinweis der Redaktion

  1. Hadoop fills several important needs in your data storage and processing infrastructure Store and use all types of data: Allows semi-structured, unstructured and structured data to be processed in a way to create new insights of significant business value. Process all the data: Instead of looking at samples of data or small sections of data, organizations can look at large volumes of data to get new perspective and make business decisions with higher degree of accuracy. Scalability: Reducing latency in business is critical for success. The massive scalability of Big Data systems allow organizations to process massive amounts of data in a fraction of the time required for traditional systems. Commodity hardware: Self-healing, extremely scalable, highly available environment with cost-effective commodity hardware.
  2. KEY CALLOUT: Schema on Read IMPORTANT NOTE: Hadoop is not meant to replace your relational database. Hadoop is for storing Big Data, which is often the type of data that you would otherwise not store in a database due to size or cost constraints You will still have your database for relational, transactional data.
  3. I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop. What we now know of as Hadoop really started back in 2005, when the team at yahoo! – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application. By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications. In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop. [note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
  4. SQL is a query language Declarative, what not how Oriented around answering a question Requires uniform schema Requires metadata Known by everyone A great choice for answering queries, building reports, use with automated tools
  5. With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance. That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
  6. “hdfs dfs” is the *new* “hadoop fs” Blank acts like ~
  7. These two slides were just to make folks feel at home with CLI access to HDFS
  8. See https://martin.atlassian.net/wiki/x/FwAvAQ for more details Surely not the typical Volume/Velocity/Variety definition of “Big Data”, but gives us a controlled environment to do some simple prototyping and validating with
  9. See https://martin.atlassian.net/wiki/x/NYBmAQ for more details
  10. See https://martin.atlassian.net/wiki/x/FwAvAQ for more information