SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Scaling Out
Hadoop and NoSQL


    Age Mooij
An Introduction to Dealing with




Big Data
About me...




              @agemooij
Big Data
  ...and me
My Current Project...




           IP Address Registration for
           Europe, Middle East, Russia

           Ipv4:2 32   (4.3×109)addresses
           Ipv6: 2128 (3.4×1038) addresses
Challenge

10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)

                30 billion records per year (4 TB)
                80 million per day / 1,000 per second




        Make it searchable...
Big Data
  ...and you
Google             Yahoo          Amazon
                                                  eBay
            Facebookusers
                  300M           MySpace users
                                      264M         Wikipedia
LinkedInusers
                      Twitterusers
      50M

                           45M           Digg         Hyves
       Flickr users       YouTube
           32M
                                              Marktplaats 5.5M ads
                                                    6.5M users,
Scalability:

         Handling more load / requests
             Handling more data
          Handling more types of data



  ...without anything breaking or falling over
         ...and without going bankrupt
UP
          Out Out Out Out
          Out Out Out Out
          Out Out Out Out
     VS   Out Out Out Out
          Out Out Out Out
          Out Out Out Out
Scaling Out, Part 1

Processing Data
  a.k.a. Data Crunching
Map/Reduce

 Parallel Batch Processing of Data
     Break the data into chunks
       Distribute the chunks
    Process the chunks in parallel
         Merge the results
Reliable, Scalable, Distributed Computing




           (written in Java)
Distributed File System (DFS)

    Foundation for all Hadoop projects
        Automatic file replication
Automatic checksumming / error correction
   Based on Google’s File System (GFS)
Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages
4TB of raw image TIFF data (stored in S3)
       100 Amazon EC2 instances
          Hadoop Map/Reduce
        11 million finished PDFs
         24 hours, about $240
Scaling Out, Part 1I

Storing & Retrieving Data
       Reads and Writes
Relational Databases
are hard to scale out
Ways to Scale out an RDBMS (1)


    Replication
                       Good for scaling reads
     Master-Slave      Single point of failure
                       Single point of bottleneck
    Master-Master      Limited scaling of writes
                       Complicated
Ways to Scale out an RDBMS (2)


                           Partitioning
Vertical   : by function / table
Horizontal : by key / id (Sharding)


     Not truly Relational anymore (application joins)
      Limited Scalability (relocating, resharding)
Why are RDBMSs
so hard to
scale out
Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance   ...pick any two
Relational   Non-Relational



ACID vs      BASE
Atomic       Basic
Consistent   Availability
Isolated     Soft State
Durable      Eventual Consistency
NoSQL             NO-SQL

 Non-Relational Databases

    Better Different
Types of NOSQL
(Distributed) Key-Value
        Redis
        Voldemort             Document Oriented
        Scalaris (D)
                                            CouchDB
                                            MongoDB
                                            Riak (D)


  Column Oriented
       Cassandra (D)
       HBase (D)
                                  Graph Oriented
                                              Neo4J



                          (D) = Distributed (automatic out scaling)
RIPE NCC
Experiences so far...
Those Big Numbers Again...


10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)

                  30 billion records per year (4 TB)
                  80 million per day / 1,000 per second




                       Make it searchable...
~ 200 000 000 000 records




        Map / Reduce




~ 15 000 000 000 records
Our Data is 3D

IP Address
             1     0..*
                           Record
                          Record
                                    1   0..*
                                                Timestamp
                                               Timestamp



       Best fit & performance:
                   Column Oriented


 Row             Column Name (!)               Values (!)
Facebook
Cassandra                                 Twitter
                                           Digg


  Tunable: Availability vs Consistency
  Very active community
  0.4.1
  No documentation
Yahoo Adobe
                      Meetup Tumblr
                       StumbleUpon
                          Streamy


Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation
Initial Results:
   Tested on an EC2 cluster of 8 XLarge instances


3.8 B (23 GB)                                        33 M (1 GB)
                            5 hours




33 M (1 GB)                                            15 GB
                                                 Record duplication: 6x

    75 minutes                        “Needle in a haystack” full on-disk table scan:
44000 inserts/second                             0.5 M records/second
In order to choose the right
  scaling tools, you need to:
       Understand your data
Know what you want to query and how
Big Data
   ...Be Prepared !
val shameless = <SelfPromotion>




    Try some Scala in the basement !



        </SelfPromotion>

Weitere ähnliche Inhalte

Was ist angesagt?

Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data modelmoni sindhu
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Cloud Security And Privacy
Cloud Security And PrivacyCloud Security And Privacy
Cloud Security And Privacytmather
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
In-Memory Big Data Analytics
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data AnalyticsSupreeth M P
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
Data storage security in cloud computing
Data storage security in cloud computingData storage security in cloud computing
Data storage security in cloud computingSonali Jain
 

Was ist angesagt? (20)

Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Big data
Big dataBig data
Big data
 
Task programming
Task programmingTask programming
Task programming
 
Fraud and Risk in Big Data
Fraud and Risk in Big DataFraud and Risk in Big Data
Fraud and Risk in Big Data
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Cloud Security And Privacy
Cloud Security And PrivacyCloud Security And Privacy
Cloud Security And Privacy
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
In-Memory Big Data Analytics
In-Memory Big Data AnalyticsIn-Memory Big Data Analytics
In-Memory Big Data Analytics
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Data storage security in cloud computing
Data storage security in cloud computingData storage security in cloud computing
Data storage security in cloud computing
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 

Andere mochten auch

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellMichel Rijnders
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Alert Logic
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015Ivan Glushkov
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven designRick van der Arend
 

Andere mochten auch (7)

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey results
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven design
 

Ähnlich wie Scaling Out With Hadoop And HBase

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 

Ähnlich wie Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 

Kürzlich hochgeladen

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Scaling Out With Hadoop And HBase

  • 1. Scaling Out Hadoop and NoSQL Age Mooij
  • 2. An Introduction to Dealing with Big Data
  • 3. About me... @agemooij
  • 4. Big Data ...and me
  • 5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses
  • 6. Challenge 10 years of historical registration/routing data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 7. Big Data ...and you
  • 8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,
  • 9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt
  • 10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out
  • 11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching
  • 12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results
  • 13. Reliable, Scalable, Distributed Computing (written in Java)
  • 14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic file replication Automatic checksumming / error correction Based on Google’s File System (GFS)
  • 15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages
  • 16.
  • 17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million finished PDFs 24 hours, about $240
  • 18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes
  • 20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated
  • 21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)
  • 22. Why are RDBMSs so hard to scale out
  • 24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency
  • 25. NoSQL NO-SQL Non-Relational Databases Better Different
  • 26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)
  • 28. Those Big Numbers Again... 10 years of historical data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records
  • 30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best fit & performance: Column Oriented Row Column Name (!) Values (!)
  • 31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation
  • 32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation
  • 33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second
  • 34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how
  • 35. Big Data ...Be Prepared !
  • 36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>