Scaling Out With Hadoop And HBase

•

17 gefällt mir•5,216 views

A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands

Technologie

An Introduction to Dealing with

Big Data

My Current Project...

IP Address Registration for
Europe, Middle East, Russia

Ipv4:2 32 (4.3×109)addresses
Ipv6: 2128 (3.4×1038) addresses

Challenge

10 years of historical registration/routing data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)
80 million per day / 1,000 per second

Make it searchable...

Google Yahoo Amazon
eBay
Facebookusers
300M MySpace users
264M Wikipedia
LinkedInusers
Twitterusers
50M

45M Digg Hyves
Flickr users YouTube
32M
Marktplaats 5.5M ads
6.5M users,

Scalability:

Handling more load / requests
Handling more data
Handling more types of data

...without anything breaking or falling over
...and without going bankrupt

UP
Out Out Out Out
Out Out Out Out
Out Out Out Out
VS Out Out Out Out
Out Out Out Out
Out Out Out Out

Scaling Out, Part 1

Processing Data
a.k.a. Data Crunching

Map/Reduce

Parallel Batch Processing of Data
Break the data into chunks
Distribute the chunks
Process the chunks in parallel
Merge the results

Reliable, Scalable, Distributed Computing

(written in Java)

Distributed File System (DFS)

Foundation for all Hadoop projects
Automatic ﬁle replication
Automatic checksumming / error correction
Based on Google’s File System (GFS)

Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages

4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million ﬁnished PDFs
24 hours, about $240

Scaling Out, Part 1I

Storing & Retrieving Data
Reads and Writes

Relational Databases
are hard to scale out

Ways to Scale out an RDBMS (1)

Replication
Good for scaling reads
Master-Slave Single point of failure
Single point of bottleneck
Master-Master Limited scaling of writes
Complicated

Ways to Scale out an RDBMS (2)

Partitioning
Vertical : by function / table
Horizontal : by key / id (Sharding)

Not truly Relational anymore (application joins)
Limited Scalability (relocating, resharding)

Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance ...pick any two

Relational Non-Relational

ACID vs BASE
Atomic Basic
Consistent Availability
Isolated Soft State
Durable Eventual Consistency

NoSQL NO-SQL

Non-Relational Databases

Better Different

Types of NOSQL
(Distributed) Key-Value
Redis
Voldemort Document Oriented
Scalaris (D)
CouchDB
MongoDB
Riak (D)

Column Oriented
Cassandra (D)
HBase (D)
Graph Oriented
Neo4J

(D) = Distributed (automatic out scaling)

Those Big Numbers Again...

10 years of historical data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)
80 million per day / 1,000 per second

Make it searchable...

~ 200 000 000 000 records

Map / Reduce

~ 15 000 000 000 records

Our Data is 3D

IP Address
1 0..*
Record
Record
1 0..*
Timestamp
Timestamp

Best ﬁt & performance:
Column Oriented

Row Column Name (!) Values (!)

Facebook
Cassandra Twitter
Digg

Tunable: Availability vs Consistency
Very active community
0.4.1
No documentation

Yahoo Adobe
Meetup Tumblr
StumbleUpon
Streamy

Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation

Initial Results:
Tested on an EC2 cluster of 8 XLarge instances

3.8 B (23 GB) 33 M (1 GB)
5 hours

33 M (1 GB) 15 GB
Record duplication: 6x

75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second

In order to choose the right
scaling tools, you need to:
Understand your data
Know what you want to query and how

val shameless = <SelfPromotion>

Try some Scala in the basement !

</SelfPromotion>

Weitere ähnliche Inhalte

Was ist angesagt?

Data mining primitiveslavanya marichamy

Introduction to HadoopDr. C.V. Suresh Babu

Big data lecture notesMohit Saini

Big dataMithilesh Joshi - SEO & Digital Marketing Consultant

Task programmingYogendra Tamang

Fraud and Risk in Big DataUmma Khatuna Jannat

Classification and prediction in data miningEr. Nawaraj Bhandari

multi dimensional data modelmoni sindhu

HadoopNishant Gandhi

Big Data Analytics with HadoopPhilippe Julio

Cloud Security And Privacytmather

Map ReduceMichel Bruley

What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!

Data preprocessingankur bhalla

lecture12-clustering.pptImXaib

In-Memory Big Data AnalyticsSupreeth M P

Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan

Machine learning clusteringCosmoAIMS Bassett

Data storage security in cloud computingSonali Jain

Hadoop Map ReduceVNIT-ACM Student Chapter

Was ist angesagt? (20)

Data mining primitives

Introduction to Hadoop

Big data lecture notes

Big data

Task programming

Fraud and Risk in Big Data

Classification and prediction in data mining

multi dimensional data model

Hadoop

Big Data Analytics with Hadoop

Cloud Security And Privacy

Map Reduce

What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka

Data preprocessing

lecture12-clustering.ppt

In-Memory Big Data Analytics

Data mining-primitives-languages-and-system-architectures2641

Machine learning clustering

Data storage security in cloud computing

Hadoop Map Reduce

Andere mochten auch

An Introduction to Functional Programming using HaskellMichel Rijnders

Next-Generation SIEM: Delivered from the Cloud Alert Logic

Modern Big Data Analytics Tools: An OverviewGreat Wide Open

NewSQL overview, Feb 2015Ivan Glushkov

Big data unit iNavjot Kaur

MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett

Up to speed in domain driven designRick van der Arend

Andere mochten auch (7)

An Introduction to Functional Programming using Haskell

Next-Generation SIEM: Delivered from the Cloud

Modern Big Data Analytics Tools: An Overview

NewSQL overview, Feb 2015

Big data unit i

MySQL vs. NoSQL and NewSQL - survey results

Up to speed in domain driven design

Ähnlich wie Scaling Out With Hadoop And HBase

Small, Medium and Big DataPierre De Wilde

Above the cloud: Big Data and BIDenny Lee

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

Data Driven Innovation with Amazon Web ServicesAmazon Web Services

Mongodb labBas van Oudenaarde

Next Generation Data Platforms - Deon ThomasThoughtworks

The Cassandra Distributed DatabaseEric Evans

Introduction to NoSQLYan Cui

Schemaless DatabasesDan Gunter

(DAT203) Building Graph Databases on AWSAmazon Web Services

Yahoo compares Storm and SparkChicago Hadoop Users Group

NO SQL: What, Why, HowIgor Moochnick

BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy

Microsoft Openness Mongo DBHeriyadi Janwar

Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Etu L2 Training - Hadoop 企業應用實作James Chen

Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

MySQL And Search At CraigslistJeremy Zawodny

Ähnlich wie Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data

Above the cloud: Big Data and BI

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

Data Driven Innovation with Amazon Web Services

Mongodb lab

Next Generation Data Platforms - Deon Thomas

The Cassandra Distributed Database

Introduction to NoSQL

Schemaless Databases

(DAT203) Building Graph Databases on AWS

Yahoo compares Storm and Spark

NO SQL: What, Why, How

BDI- The Beginning (Big data training in Coimbatore)

Microsoft Openness Mongo DB

Big Data/Hadoop Infrastructure Considerations

Apache Spark: The Next Gen toolset for Big Data Processing

Etu L2 Training - Hadoop 企業應用實作

Sf NoSQL MeetUp: Apache Hadoop and HBase

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...

MySQL And Search At Craigslist

Kürzlich hochgeladen

Corporate and higher education May webinar.pptxRustici Software

DBX First Quarter 2024 Investor PresentationDropbox

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

ICT role in 21st century education and its challengesrafiqahmad00786416

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Elevate Developer Efficiency & build GenAI Application with Amazon QBhuvaneswari Subramani

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Exploring Multimodal Embeddings with MilvusZilliz

MINDCTI Revenue Release Quarter One 2024MIND CTI

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Kürzlich hochgeladen (20)

Corporate and higher education May webinar.pptx

DBX First Quarter 2024 Investor Presentation

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

ICT role in 21st century education and its challenges

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024

presentation ICT roal in 21st century education

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

WSO2's API Vision: Unifying Control, Empowering Developers

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Exploring Multimodal Embeddings with Milvus

MINDCTI Revenue Release Quarter One 2024

Artificial Intelligence Chap.5 : Uncertainty

Scaling Out With Hadoop And HBase

1. Scaling Out Hadoop and NoSQL Age Mooij

2. An Introduction to Dealing with Big Data

3. About me... @agemooij

4. Big Data ...and me

5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses

6. Challenge 10 years of historical registration/routing data in ﬂat ﬁles 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...

7. Big Data ...and you

8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,

9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt

10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out

11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching

12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results

13. Reliable, Scalable, Distributed Computing (written in Java)

14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic ﬁle replication Automatic checksumming / error correction Based on Google’s File System (GFS)

15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages

16.

17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million ﬁnished PDFs 24 hours, about $240

18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes

19. Relational Databases are hard to scale out

20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated

21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)

22. Why are RDBMSs so hard to scale out

23. Brewer’s CAP Theorem Consistency Availability Partition Tolerance ...pick any two

24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency

25. NoSQL NO-SQL Non-Relational Databases Better Different

26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)

27. RIPE NCC Experiences so far...

28. Those Big Numbers Again... 10 years of historical data in ﬂat ﬁles 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...

29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records

30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best ﬁt & performance: Column Oriented Row Column Name (!) Values (!)

31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation

32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation

33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second

34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how

35. Big Data ...Be Prepared !

36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>

Scaling Out With Hadoop And HBase

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Scaling Out With Hadoop And HBase

Ähnlich wie Scaling Out With Hadoop And HBase (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scaling Out With Hadoop And HBase