SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Apache Hadoop Masterclass
An introduction to concepts & ecosystem for all audiences
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Who We Are?
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Agenda
1. What is Hadoop? -- Why is it so popular?
2. Hardware
3. Ecosystem Tools
4. Summary & Recap
3
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.What is Hadoop?
How is it different? Why is it so popular?
4
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.1 First, a question…
Conceptual design exercise
5
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
What is Big Data?
Why people are turning to Hadoop
6
Data Complexity: Variety and Velocity
TB
GB
MB
PB Big Data!
!
Log files
Spatial & 

GPS coordinates
Data market feeds
eGov feeds
Weather
Text/image
Click stream
Wikis/blogs
Sensors/RFID/

devices
Social sentiment
Audio/video
Web 2.0
!
!
Web Logs
Digital Marketing
Search Marketing
Recommendations
Advertising
Mobile
Collaboration
eCommerce
ERP/CRM
Payables
Payroll
Inventory
Contacts
Deal Tracking
Sales Pipeline
DataSize
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Problem:
Legacy database, experiencing huge growth in data volumes
7
Scale
UP
€ / GB
€€ / GB
€€€ / GBLarge Application Database
or Data Warehouse
€€€€ / GB
TB ➔ PB ?
Data Volume
Performance
Cost
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
How would you re-design from the basics?
8
Scale OUT
1/x
1/x1/x
1/x
1/x
✓ Commodity 

hardware
✓ Low, 

predictable

cost per unit
How do we store data?
Requirements:
Less cost
Less complexity
Less risk
More scalable
More reliable
More flexible
0-5 6-10 11-15 16-20 21-25
0-5 0-56-10 6-10
11-15 11-1516-20
16-20
21-2521-25
How do we query data?
Q QQQQ ✓ Distributed

Filesystem
✓ Distributed

Computation
€ € € € €
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Engineering for efficiency & reliability
What issues exist for coordinating between nodes?
!
▪ Limited network bandwidth
!
▪ Data loss/re-transmission
!
▪ Server death
!
▪ Failover scenarios
9
⇒ Do more locally before crossing network
!
⇒ Use reliable transfers
!
⇒ Make servers repeatable elsewhere
!
⇒ Make nodes identity-less: any other can take over
⇒ Job co-ordination
These are our system requirements
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Engineering for efficiency & reliability
What issues exist within each node?
!
▪ Hardware - MTBF (Mean Time Between Failure)
!
▪ Software crash
!
▪ Disk seek time
!
▪ Disk contention
10
⇒ Expect failure: nodes have no identity, data replication
!
!
⇒ as above
!
!
⇒ Avoid random seeks, read sequentially (10x faster than seek)
!
!
⇒ Use multiple disks in parallel, with multiple cores
!
!
!
These are our system requirements
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Engineering for flexibility
What issues exist for a query interface?
!
▪ Schema not obvious up front
!
▪ Schema changes after implementation
!
▪ Language may not suit all users
!
▪ Language may not support advanced use cases
11
⇒ Support both schema and schema-less operations
!
!
⇒ as above + Support sparse/columnar storage
!
!
⇒ Plug-in query framework and/or Programming API
!
!
⇒ Generic programming API
!
These are our system requirements
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
To sum up:
▪ Original legacy assumptions, traditional enterprise systems -
▪ Hardware can be made reliable by spending more on it
▪ Machines have a unique identity each
▪ Datasets must be centralised
!
▪ New paradigm for data management with Hadoop -
▪ Distribute + duplicate chunks of each data file across many nodes
▪ Use local compute resource to process each chunk in parallel
▪ Minimise elapsed seek time
▪ Expect failure
▪ Handle failover elegantly, and automatically
12
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.2 What is Hadoop?
Architecture Overview
13
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
What is Hadoop?
Logical Architecture
14
HDFS

Hadoop Distributed Filesystem
MapReduce
Computation:
Storage:Data
Management
Data
Analysis
User Tools
Core
Distribution
Distributed
System
+
Node NodeNode Node NodeNode
Fast, Private
Network
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
What is Hadoop?
Physical Architecture
15
Computation:
Storage:
Node
DataNode
TaskTracker
RegionServer
Node
DataNode
TaskTracker
RegionServer
Node
DataNode
TaskTracker
RegionServer
Node
DataNode
TaskTracker
RegionServerHBase:
Master:
Node
NameNode
Node
JobTracker
Node
HMaster
User Tools
Client Libraries
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.3 HDFS
Distributed Storage
16
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Storage: writing data
17
HDFS

48 TB
8 TB 8 TB8 TB 8 TB 8 TB8 TB
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Storage: reading data
18
HDFS

48 TB
8 TB 8 TB8 TB 8 TB 8 TB8 TB
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.4 MapReduce
Distributed Computation
19
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Computation
20
HDFS

48 TB
8 TB 8 TB8 TB 8 TB 8 TB8 TB
MapReduce
F(x) F(x)
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Computation: MapReduce
High-level MapReduce process
21
Input File
(split into chunks)
Map Shuffle
& sort
Reduce Output Files
(one per
Reducer)
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Computation: MapReduce
High-level MapReduce process
22
Then there were
two cousins laid up;
when the one
should
be lamed with
reasons and the
other mad without
any. But is all this
for your father? No,
some of it is for my
child’s father. O,
how full of briers is
this working day
world! They are but
burs, cousin,
thrown upon thee in
holiday
foolery: if we walk
not in the trodden
paths our very
the,1
other,1
…
were,1
the,1
…
this,1
some,1
…
father,1
this,1
…
they,1
upon,1
…
the,1
paths,1
…
father,{1}
other,{1}
paths,{1}
some,{1}
the,{1,1,1}
they,{1}
this,{1,1}
upon,{1}
where,{1}
father,1
other,1
paths,1
some,1
the,3
they,1
this,2
upon,1
where,1
Input Map Shuffle
& Sort
Reduce Output
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.5 Key Concepts
Points to Remember
23
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Linear Scalability
24
HDFS

Hadoop Distributed Filesystem
MapReduce
Node Node NodeNode NodeNode Node NodeNode
Cluster Size 150% Time
Cluster Size
Storage Capacity
Compute Capacity
Compute Time
Hardware €
Software €
Total €
€ / TB
Staffing €
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
HDFS for Storage
What do I need to remember?
Functional Qualities:
▪ Store many files into a large pooled filesystem
▪ Familiar interface: standard UNIX filesystem
▪ Organise by any arbitrary naming scheme: files/directories
!
Non-Functional Qualities:
▪ Files can be very very large (>> single disk)
▪ Distributed, fault-tolerant, reliable – replicas of chunks are
stored
▪ High aggregate throughput: parallel reads/writes
▪ Scalable
25
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Computation: Not only MapReduce
▪ With Hadoop 2.1, Hadoop introduced YARN
▪ Enhanced MapReduce:
▪ Makes MapReduce just one of many pluggable framework
modules
▪ Existing MapReduce programs continue to run as before
▪ Previous MapReduce API versions run side-by-side
▪ Takes us way beyond MapReduce:
▪ Any new programming model can now be plugged into a
running Hadoop cluster on demand
▪ Including: MPI, BSP, graph, others …
26
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
1.6 Why is it so popular?
Key Differentiators
27
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Why is Hadoop so popular?
The story so far
1. Scalability
▪ Democratisation - the ability to store and manage petabytes
of files and analyse them across thousands of nodes has
never been available to the general population before
▪ Why should the HPC world care?

More users = more problems solved for you by others
▪ Predictable scalability curve
2. Cost
▪ It can be done on cheap hardware, with minimal staff
▪ Free software
3. Ease of use
▪ Familiar Filesystem interface
▪ Simple programming model
28
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Why is Hadoop so popular?
▪ What about flexibility?
▪ How does it compare to a traditional database system?
▪ Tables consisting of rows, columns => 2D
▪ Foreign keys linking tables together
▪ One-to-one, one-to-many, many-to-many relations
29
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Traditional Database Approach
▪ Data model tightly
coupled to storage
▪ Have to model like this
up-front before
anything can be
inserted
▪ Only choice of model
▪ May not be appropriate
for your application
▪ We’ve grown up with
this as the normal
general-purpose
approach, but is it
really?
30
Customers
Orders
Order Lines
SCHEMA
ON WRITE
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Traditional database approach
Sometimes it doesn’t fit
▪ What if we want to store graph data?
▪ Network of nodes & edges
▪ e.g. social graph
!
!
!
!
!
▪ Tricky!
▪ Need to store lots of self-references to far away rows
▪ Slow performance
▪ What if we need changing attributes for different node types?
▪ Loss of intuitive schema design
31
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Traditional database approach
Sometimes it doesn’t fit
▪ What if we want to store a sparse matrix?
▪ To avoid a very complicated schema, would need to store a
table with NULL for all empty matrix cells
!
!
▪ Lots of NULLs
▪ Waste memory + disk space
▪ Very poor performance
▪ Loss of intuitive schema design
32
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Traditional database approach
Sometimes it doesn’t fit
▪ What if we want to store binary files from a scientific
instrument… images… videos?
▪ Could store metadata in first few cols, then pack binary
data into one huge column?
!
!
▪ Inefficient
▪ Slow
▪ Might run out of space in the binary column
▪ No internal structure
▪ Database driver is a bottleneck
33
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hadoop data model
Types of data
34
Structured
• Pre-defined
schema
• Highly
• Example:
relational
database system
Semi-
structured
• Inconsistent
structure
• Cannot be
stored in rows
and tables in a
typical
database
• Examples:

logs, tweets,
sensor feeds
Unstructur
ed
• Lacks structure
or..
• Parts of it lack
structure
• Examples: free-
form text,
reports,
customer
feedback forms
SCHEMA
ON READ
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hadoop approach
A more flexible option?
▪ Just load your data files into the cluster as they are
▪ Don’t worry about the format right now
▪ Leave files in their native/application format
▪ Since they are now in a cluster with distributed computing
power local to each storage shard anyway:
▪ Query the data files in-place
▪ Bring the computation to the data
▪ Saves shipping data around
▪ Use a more flexible model – don’t worry about the schema up-
front, figure it out later
▪ Use alternative programming models to query data – MapReduce
is more abstract than SQL so can support wider range of
customisation
▪ Beyond that, can replace MapReduce altogether
35
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
3. Hardware Considerations
50,000 ft picture
36
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hardware Considerations
The Basics
▪ Cluster infrastructure design is a complex topic
▪ Full treatment is beyond the scope of this talk
▪ But, essentially:
▪ Prefer cheap commodity hardware
▪ Failure is planned and expected – Hadoop will deal with it
gracefully
▪ The only exception to this is a service called the
“NameNode” which coordinates the filesystem – for some
distributions this requires a high-availability node
▪ The more memory the better
37
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hardware Considerations
Performance
▪ Try to pack as many disks and cores into a node as is
reasonable:
▪ 8 disks +
▪ Cheap SATA drives – NOT expensive SAS drives
▪ Cheap CPUs, not expensive server-grade high performance
chips
!
▪ Match one core to a CPU, generally
▪ So for 12 disks, get 12 CPU cores
▪ Adding more cores => add more memory
38
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hardware Considerations
Bottlenecks
▪ You will hit bottlenecks in the following order:
1. Disk I/O throughput
2. Network throughput
3. Running out of memory
4. CPU speed
!
▪ Therefore, money will be wasted on expensive high-speed
CPUs, and they will increase power consumption costs
▪ Disk I/O is best addressed by following the one-disk-to-
CPU-core ratio guideline
39
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
4. HadoopTools
Ecosystem tools & utilities
40
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Mana
gem
ent
&
Moni
torin
g
(Ambar
i,
Zookee
per,
Nagios,
Ganglia
)
Hadoop Subprojects &Tools
Overview
41
Distributed Storage

(HDFS)
Distributed Processing
(MapReduce)
Scripting
(Pig)
Metadata Management
(HCatalog)
NoSQLColumnDB
(HBase)
DataExtraction&Load
(Sqoop,WebHDFS,3rdpartytools)
Workflow&Scheduling
(Oozie)
Query
(Hive)
Machine Learning & Predictive Analytics
(Mahout)
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Pig
What do I need to remember?
▪ High-level dataflow scripting tool
▪ Best for exploring data quickly or prototyping
▪ When you load data, you can:
▪ Specify a schema at load time
▪ Not specify one and let Pig guess
▪ Not specify one and tell Pig later once you work it out
▪ Pig eats any kind of data (hence the name “Pig”)
▪ Pluggable load/save adapters available (e.g. Text, Binary,
Hbase)
42
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Pig
Sample script
> LOAD 'student' USING PigStorage() AS (name:chararray, year:int, gpa:float);

> DUMP A;

(John,2005,4.0F)

(John,2006,3.9F)

(Mary,2006,3.8F)

(Mary,2007,2.7F)

(Bill,2009,3.9F)

(Joe,2010,3.8F)
> B = GROUP A BY name;

> C = FOREACH B GENERATE AVG(gpa);

> STORE C INTO ’grades’ USING PigStorage();
43
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hive
What do I need to remember?
▪ Data Warehouse-like front end & SQL for Hadoop
▪ Best for managing known data with fixed schema
▪ Stores data in HDFS, metadata in a local Database (MySql)
▪ Supports two table types:
▪ Internal tables (Hive manages the data files & format)
▪ External tables (you manage the data files & format)
▪ Not fully safe to throw inexperienced native-SQL users at
44
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Hive
Sample script
> CREATE TABLE page_view(viewTime INT, userid BIGINT,

page_url STRING, referrer_url STRING,

ip STRING COMMENT 'IP Address of the User’)

> COMMENT 'This is the page view table’

> PARTITIONED BY(dt STRING, country STRING)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY '1’

> STORED AS SEQUENCEFILE;
> SELECT * FROM page_view WHERE viewTime>1400;
> SELECT * FROM page_view

> JOIN users ON page_view.userid = users.userid

> ORDER BY users.subscription_date;
45
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Oozie
What do I need to remember?
▪ Workflow engine & scheduler for Hadoop
▪ Best for building repeatable production workflows of
common Hadoop jobs
▪ Great for composing smaller jobs into larger more
complex ones
▪ Stores job metadata in a local Database (MySql)
▪ Complex recurring schedules easy to create
▪ Some counter-intuitive cluster setup steps required, due
to the way job actions for Hive and Pig are launched
46
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Sqoop
What do I need to remember?
▪ Data import/export tool for Hadoop + external database
▪ Suitable for importing/exporting flat simple tables
▪ Suitable for importing arbitrary query as table
▪ Does not have any configuration or state
▪ Can easily overload a remote database if too much
parallelism requested (think DDoS attack) !
▪ Can load directly to/from Hive tables as well as HDFS files
47
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
5. Summary
48
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Summary
Hadoop as a concept
▪ Driving factors:
▪ Why it was designed the way it was
▪ Why that is important to you as a user
▪ What makes it popular
▪ What it isn’t so suitable for
49
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Summary
Hadoop as a platform
▪ Key concepts:
▪ Distributed Storage
▪ Distributed Computation
▪ Bring the computation to the data
▪ Variety in data formats (store files, analyse in-place later)
▪ Variety in data analysis (more abstract programming
models)
▪ Linear Scalability
50
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Summary
Hadoop Ecosystem
▪ Multiple vendors
▪ Brings choice and competition
▪ Feature differentiators
▪ Variety of licensing models
!
▪ Rich set of user tools
▪ Exploration
▪ Querying
▪ Workflow & ETL
▪ Distributed Database
▪ Machine Learning
51
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
FurtherTraining
▪ Apache Hadoop 2.0 Developing Java Applications (4 days)
▪ Apache Hadoop 2.0 Development for Data Analysts (4
days)
▪ Apache Hadoop 2.0 Operations Management (3 days)
▪ MapR Hadoop Fundamentals of Administration (3 days)
▪ Apache Cassandra DevOps Fundamentals (3 days)
▪ Apache Hadoop Masterclass (1 day)
▪ Big Data Concepts Masterclass (1 day)
▪ Machine Learning at scale with Apache Mahout (1 day)
52
Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Contact Details
Tim Seears

CTO

Big Data Partnership



info@bigdatapartnership.com

@BigDataExperts

Weitere ähnliche Inhalte

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Empfohlen

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 

Empfohlen (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Hadoop masterclass from Big Data Partnership

  • 1. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Apache Hadoop Masterclass An introduction to concepts & ecosystem for all audiences
  • 2. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Who We Are?
  • 3. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Agenda 1. What is Hadoop? -- Why is it so popular? 2. Hardware 3. Ecosystem Tools 4. Summary & Recap 3
  • 4. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.What is Hadoop? How is it different? Why is it so popular? 4
  • 5. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.1 First, a question… Conceptual design exercise 5
  • 6. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. What is Big Data? Why people are turning to Hadoop 6 Data Complexity: Variety and Velocity TB GB MB PB Big Data! ! Log files Spatial & 
 GPS coordinates Data market feeds eGov feeds Weather Text/image Click stream Wikis/blogs Sensors/RFID/
 devices Social sentiment Audio/video Web 2.0 ! ! Web Logs Digital Marketing Search Marketing Recommendations Advertising Mobile Collaboration eCommerce ERP/CRM Payables Payroll Inventory Contacts Deal Tracking Sales Pipeline DataSize
  • 7. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Problem: Legacy database, experiencing huge growth in data volumes 7 Scale UP € / GB €€ / GB €€€ / GBLarge Application Database or Data Warehouse €€€€ / GB TB ➔ PB ? Data Volume Performance Cost
  • 8. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. How would you re-design from the basics? 8 Scale OUT 1/x 1/x1/x 1/x 1/x ✓ Commodity 
 hardware ✓ Low, 
 predictable
 cost per unit How do we store data? Requirements: Less cost Less complexity Less risk More scalable More reliable More flexible 0-5 6-10 11-15 16-20 21-25 0-5 0-56-10 6-10 11-15 11-1516-20 16-20 21-2521-25 How do we query data? Q QQQQ ✓ Distributed
 Filesystem ✓ Distributed
 Computation € € € € €
  • 9. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Engineering for efficiency & reliability What issues exist for coordinating between nodes? ! ▪ Limited network bandwidth ! ▪ Data loss/re-transmission ! ▪ Server death ! ▪ Failover scenarios 9 ⇒ Do more locally before crossing network ! ⇒ Use reliable transfers ! ⇒ Make servers repeatable elsewhere ! ⇒ Make nodes identity-less: any other can take over ⇒ Job co-ordination These are our system requirements
  • 10. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Engineering for efficiency & reliability What issues exist within each node? ! ▪ Hardware - MTBF (Mean Time Between Failure) ! ▪ Software crash ! ▪ Disk seek time ! ▪ Disk contention 10 ⇒ Expect failure: nodes have no identity, data replication ! ! ⇒ as above ! ! ⇒ Avoid random seeks, read sequentially (10x faster than seek) ! ! ⇒ Use multiple disks in parallel, with multiple cores ! ! ! These are our system requirements
  • 11. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Engineering for flexibility What issues exist for a query interface? ! ▪ Schema not obvious up front ! ▪ Schema changes after implementation ! ▪ Language may not suit all users ! ▪ Language may not support advanced use cases 11 ⇒ Support both schema and schema-less operations ! ! ⇒ as above + Support sparse/columnar storage ! ! ⇒ Plug-in query framework and/or Programming API ! ! ⇒ Generic programming API ! These are our system requirements
  • 12. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. To sum up: ▪ Original legacy assumptions, traditional enterprise systems - ▪ Hardware can be made reliable by spending more on it ▪ Machines have a unique identity each ▪ Datasets must be centralised ! ▪ New paradigm for data management with Hadoop - ▪ Distribute + duplicate chunks of each data file across many nodes ▪ Use local compute resource to process each chunk in parallel ▪ Minimise elapsed seek time ▪ Expect failure ▪ Handle failover elegantly, and automatically 12
  • 13. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.2 What is Hadoop? Architecture Overview 13
  • 14. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. What is Hadoop? Logical Architecture 14 HDFS
 Hadoop Distributed Filesystem MapReduce Computation: Storage:Data Management Data Analysis User Tools Core Distribution Distributed System + Node NodeNode Node NodeNode Fast, Private Network
  • 15. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. What is Hadoop? Physical Architecture 15 Computation: Storage: Node DataNode TaskTracker RegionServer Node DataNode TaskTracker RegionServer Node DataNode TaskTracker RegionServer Node DataNode TaskTracker RegionServerHBase: Master: Node NameNode Node JobTracker Node HMaster User Tools Client Libraries
  • 16. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.3 HDFS Distributed Storage 16
  • 17. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Storage: writing data 17 HDFS
 48 TB 8 TB 8 TB8 TB 8 TB 8 TB8 TB
  • 18. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Storage: reading data 18 HDFS
 48 TB 8 TB 8 TB8 TB 8 TB 8 TB8 TB
  • 19. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.4 MapReduce Distributed Computation 19
  • 20. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Computation 20 HDFS
 48 TB 8 TB 8 TB8 TB 8 TB 8 TB8 TB MapReduce F(x) F(x)
  • 21. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Computation: MapReduce High-level MapReduce process 21 Input File (split into chunks) Map Shuffle & sort Reduce Output Files (one per Reducer)
  • 22. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Computation: MapReduce High-level MapReduce process 22 Then there were two cousins laid up; when the one should be lamed with reasons and the other mad without any. But is all this for your father? No, some of it is for my child’s father. O, how full of briers is this working day world! They are but burs, cousin, thrown upon thee in holiday foolery: if we walk not in the trodden paths our very the,1 other,1 … were,1 the,1 … this,1 some,1 … father,1 this,1 … they,1 upon,1 … the,1 paths,1 … father,{1} other,{1} paths,{1} some,{1} the,{1,1,1} they,{1} this,{1,1} upon,{1} where,{1} father,1 other,1 paths,1 some,1 the,3 they,1 this,2 upon,1 where,1 Input Map Shuffle & Sort Reduce Output
  • 23. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.5 Key Concepts Points to Remember 23
  • 24. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Linear Scalability 24 HDFS
 Hadoop Distributed Filesystem MapReduce Node Node NodeNode NodeNode Node NodeNode Cluster Size 150% Time Cluster Size Storage Capacity Compute Capacity Compute Time Hardware € Software € Total € € / TB Staffing €
  • 25. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. HDFS for Storage What do I need to remember? Functional Qualities: ▪ Store many files into a large pooled filesystem ▪ Familiar interface: standard UNIX filesystem ▪ Organise by any arbitrary naming scheme: files/directories ! Non-Functional Qualities: ▪ Files can be very very large (>> single disk) ▪ Distributed, fault-tolerant, reliable – replicas of chunks are stored ▪ High aggregate throughput: parallel reads/writes ▪ Scalable 25
  • 26. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Computation: Not only MapReduce ▪ With Hadoop 2.1, Hadoop introduced YARN ▪ Enhanced MapReduce: ▪ Makes MapReduce just one of many pluggable framework modules ▪ Existing MapReduce programs continue to run as before ▪ Previous MapReduce API versions run side-by-side ▪ Takes us way beyond MapReduce: ▪ Any new programming model can now be plugged into a running Hadoop cluster on demand ▪ Including: MPI, BSP, graph, others … 26
  • 27. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 1.6 Why is it so popular? Key Differentiators 27
  • 28. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Why is Hadoop so popular? The story so far 1. Scalability ▪ Democratisation - the ability to store and manage petabytes of files and analyse them across thousands of nodes has never been available to the general population before ▪ Why should the HPC world care?
 More users = more problems solved for you by others ▪ Predictable scalability curve 2. Cost ▪ It can be done on cheap hardware, with minimal staff ▪ Free software 3. Ease of use ▪ Familiar Filesystem interface ▪ Simple programming model 28
  • 29. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Why is Hadoop so popular? ▪ What about flexibility? ▪ How does it compare to a traditional database system? ▪ Tables consisting of rows, columns => 2D ▪ Foreign keys linking tables together ▪ One-to-one, one-to-many, many-to-many relations 29
  • 30. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Traditional Database Approach ▪ Data model tightly coupled to storage ▪ Have to model like this up-front before anything can be inserted ▪ Only choice of model ▪ May not be appropriate for your application ▪ We’ve grown up with this as the normal general-purpose approach, but is it really? 30 Customers Orders Order Lines SCHEMA ON WRITE
  • 31. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Traditional database approach Sometimes it doesn’t fit ▪ What if we want to store graph data? ▪ Network of nodes & edges ▪ e.g. social graph ! ! ! ! ! ▪ Tricky! ▪ Need to store lots of self-references to far away rows ▪ Slow performance ▪ What if we need changing attributes for different node types? ▪ Loss of intuitive schema design 31
  • 32. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Traditional database approach Sometimes it doesn’t fit ▪ What if we want to store a sparse matrix? ▪ To avoid a very complicated schema, would need to store a table with NULL for all empty matrix cells ! ! ▪ Lots of NULLs ▪ Waste memory + disk space ▪ Very poor performance ▪ Loss of intuitive schema design 32
  • 33. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Traditional database approach Sometimes it doesn’t fit ▪ What if we want to store binary files from a scientific instrument… images… videos? ▪ Could store metadata in first few cols, then pack binary data into one huge column? ! ! ▪ Inefficient ▪ Slow ▪ Might run out of space in the binary column ▪ No internal structure ▪ Database driver is a bottleneck 33
  • 34. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hadoop data model Types of data 34 Structured • Pre-defined schema • Highly • Example: relational database system Semi- structured • Inconsistent structure • Cannot be stored in rows and tables in a typical database • Examples:
 logs, tweets, sensor feeds Unstructur ed • Lacks structure or.. • Parts of it lack structure • Examples: free- form text, reports, customer feedback forms SCHEMA ON READ
  • 35. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hadoop approach A more flexible option? ▪ Just load your data files into the cluster as they are ▪ Don’t worry about the format right now ▪ Leave files in their native/application format ▪ Since they are now in a cluster with distributed computing power local to each storage shard anyway: ▪ Query the data files in-place ▪ Bring the computation to the data ▪ Saves shipping data around ▪ Use a more flexible model – don’t worry about the schema up- front, figure it out later ▪ Use alternative programming models to query data – MapReduce is more abstract than SQL so can support wider range of customisation ▪ Beyond that, can replace MapReduce altogether 35
  • 36. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 3. Hardware Considerations 50,000 ft picture 36
  • 37. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hardware Considerations The Basics ▪ Cluster infrastructure design is a complex topic ▪ Full treatment is beyond the scope of this talk ▪ But, essentially: ▪ Prefer cheap commodity hardware ▪ Failure is planned and expected – Hadoop will deal with it gracefully ▪ The only exception to this is a service called the “NameNode” which coordinates the filesystem – for some distributions this requires a high-availability node ▪ The more memory the better 37
  • 38. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hardware Considerations Performance ▪ Try to pack as many disks and cores into a node as is reasonable: ▪ 8 disks + ▪ Cheap SATA drives – NOT expensive SAS drives ▪ Cheap CPUs, not expensive server-grade high performance chips ! ▪ Match one core to a CPU, generally ▪ So for 12 disks, get 12 CPU cores ▪ Adding more cores => add more memory 38
  • 39. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hardware Considerations Bottlenecks ▪ You will hit bottlenecks in the following order: 1. Disk I/O throughput 2. Network throughput 3. Running out of memory 4. CPU speed ! ▪ Therefore, money will be wasted on expensive high-speed CPUs, and they will increase power consumption costs ▪ Disk I/O is best addressed by following the one-disk-to- CPU-core ratio guideline 39
  • 40. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 4. HadoopTools Ecosystem tools & utilities 40
  • 41. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Mana gem ent & Moni torin g (Ambar i, Zookee per, Nagios, Ganglia ) Hadoop Subprojects &Tools Overview 41 Distributed Storage
 (HDFS) Distributed Processing (MapReduce) Scripting (Pig) Metadata Management (HCatalog) NoSQLColumnDB (HBase) DataExtraction&Load (Sqoop,WebHDFS,3rdpartytools) Workflow&Scheduling (Oozie) Query (Hive) Machine Learning & Predictive Analytics (Mahout)
  • 42. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Pig What do I need to remember? ▪ High-level dataflow scripting tool ▪ Best for exploring data quickly or prototyping ▪ When you load data, you can: ▪ Specify a schema at load time ▪ Not specify one and let Pig guess ▪ Not specify one and tell Pig later once you work it out ▪ Pig eats any kind of data (hence the name “Pig”) ▪ Pluggable load/save adapters available (e.g. Text, Binary, Hbase) 42
  • 43. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Pig Sample script > LOAD 'student' USING PigStorage() AS (name:chararray, year:int, gpa:float);
 > DUMP A;
 (John,2005,4.0F)
 (John,2006,3.9F)
 (Mary,2006,3.8F)
 (Mary,2007,2.7F)
 (Bill,2009,3.9F)
 (Joe,2010,3.8F) > B = GROUP A BY name;
 > C = FOREACH B GENERATE AVG(gpa);
 > STORE C INTO ’grades’ USING PigStorage(); 43
  • 44. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hive What do I need to remember? ▪ Data Warehouse-like front end & SQL for Hadoop ▪ Best for managing known data with fixed schema ▪ Stores data in HDFS, metadata in a local Database (MySql) ▪ Supports two table types: ▪ Internal tables (Hive manages the data files & format) ▪ External tables (you manage the data files & format) ▪ Not fully safe to throw inexperienced native-SQL users at 44
  • 45. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Hive Sample script > CREATE TABLE page_view(viewTime INT, userid BIGINT,
 page_url STRING, referrer_url STRING,
 ip STRING COMMENT 'IP Address of the User’)
 > COMMENT 'This is the page view table’
 > PARTITIONED BY(dt STRING, country STRING)
 > ROW FORMAT DELIMITED
 > FIELDS TERMINATED BY '1’
 > STORED AS SEQUENCEFILE; > SELECT * FROM page_view WHERE viewTime>1400; > SELECT * FROM page_view
 > JOIN users ON page_view.userid = users.userid
 > ORDER BY users.subscription_date; 45
  • 46. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Oozie What do I need to remember? ▪ Workflow engine & scheduler for Hadoop ▪ Best for building repeatable production workflows of common Hadoop jobs ▪ Great for composing smaller jobs into larger more complex ones ▪ Stores job metadata in a local Database (MySql) ▪ Complex recurring schedules easy to create ▪ Some counter-intuitive cluster setup steps required, due to the way job actions for Hive and Pig are launched 46
  • 47. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Sqoop What do I need to remember? ▪ Data import/export tool for Hadoop + external database ▪ Suitable for importing/exporting flat simple tables ▪ Suitable for importing arbitrary query as table ▪ Does not have any configuration or state ▪ Can easily overload a remote database if too much parallelism requested (think DDoS attack) ! ▪ Can load directly to/from Hive tables as well as HDFS files 47
  • 48. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. 5. Summary 48
  • 49. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Summary Hadoop as a concept ▪ Driving factors: ▪ Why it was designed the way it was ▪ Why that is important to you as a user ▪ What makes it popular ▪ What it isn’t so suitable for 49
  • 50. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Summary Hadoop as a platform ▪ Key concepts: ▪ Distributed Storage ▪ Distributed Computation ▪ Bring the computation to the data ▪ Variety in data formats (store files, analyse in-place later) ▪ Variety in data analysis (more abstract programming models) ▪ Linear Scalability 50
  • 51. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Summary Hadoop Ecosystem ▪ Multiple vendors ▪ Brings choice and competition ▪ Feature differentiators ▪ Variety of licensing models ! ▪ Rich set of user tools ▪ Exploration ▪ Querying ▪ Workflow & ETL ▪ Distributed Database ▪ Machine Learning 51
  • 52. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. FurtherTraining ▪ Apache Hadoop 2.0 Developing Java Applications (4 days) ▪ Apache Hadoop 2.0 Development for Data Analysts (4 days) ▪ Apache Hadoop 2.0 Operations Management (3 days) ▪ MapR Hadoop Fundamentals of Administration (3 days) ▪ Apache Cassandra DevOps Fundamentals (3 days) ▪ Apache Hadoop Masterclass (1 day) ▪ Big Data Concepts Masterclass (1 day) ▪ Machine Learning at scale with Apache Mahout (1 day) 52
  • 53. Copyright © 2014 Big Data Partnership Ltd. All rights reserved. Contact Details Tim Seears
 CTO
 Big Data Partnership
 
 info@bigdatapartnership.com
 @BigDataExperts

Hinweis der Redaktion

  1. EDIT
  2. Imagine you have a legacy database or DW => Data volumes growing rapidly => Running out of space => We scale up a little, cost goes up a little – nothing too serious => Scale up a little more, cost/GB starts to look a little concerning => Scale up again, suddenly we hit a discontinuity => cost/GB is prohibitive => Or performance drops dramatically => Or maybe you want to go to TB-PB range and it’s not even possible => Problem is, as data increases we have to SCALE UP because of the architecture choices => Cost per GB also increases, so it becomes exponentially more expensive to grow => Interestingly, this also broadly holds for complexity of integrating more business data sources (Variety) (BI problem), not just adding more Volume (DW problem) => Because again the architecture dictates we have increasingly higher cost to add more Variety The net result of storing large schema-driven databases is that the individual machines they’re housed on must be: * high quality * redundant * highly available This translates into: * high costs for servers, infrastructure and support
  3. Let’s imagine we could scale OUT, rather than UP. => What would that look like? * Assume database going to be clustered (but not in the usual sense) * What if we cluster only part of data on each node? * What effect would this have on the system? * What problems would ensue? => sharding => each node only has part of data => local queries on each node
  4. One of the most important properties of Hadoop is that it scales completely linearly. This is in stark contrast to traditional database systems, which usually suffer from uneven scaling properties. So if we start out with cluster of certain size => increase by +50% by simply adding new nodes. What happens is that the distributed filesystem capacity increases by exactly the same percentage. And at the same time, the compute capacity (i.e., how much data we can process *without* increasing the comptation time) Think about this for a second – the emergence of Hadoop is actually the first time that virtually-unlimited, linear, scalability has become available to everyone. For free. That’s actually really important. The only other place you might have been able to achieve something comparable was probably building a proprietary grid computing platform, so maybe those of you with that kind of background are wondering why this matters to you? Well, because Hadoop means this is now becoming mainstream: – and that irreversibly changes it from a highly specialist field where you have to develop all your applications from scratch – taking a long time – to one where other people are out there producing applications and libraries for you, – solving problems for you, – and you can leapfrog ahead on your own work schedule by leveraging the work of other people, rather than starting from nothing. – I think history has shown that’s always a really positive impact for everyone, regardless of which sector you work in.