SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Analyzing the internet
using Hadoop and Hbase

               Friso van Vollenhoven
               fvanvollenhoven@xebia.com
+


Friso van Vollenhoven
fvanvollenhoven@xebia.com
+



             =      One of ve Regional Internet
                    Registrars




  Allocate IP address ranges to organizations
      (e.g. ISPs, network operators, etc.)

   Provide near real time network
information based on measurements
WEB 5.0 SERVICE

            PowerBook G4




   THE
INTERNET
WEB 5.0 SERVICE
                 AS

              T-Mobile                                                        PowerBook G4




                            BG
 AS                             P
                                                AS
KPN
                                              AMS-IX

                 BGP
       BG
         P




                                                            BG
       BG




         P                                                       P
                            B   GP


                 AS                                                      AS

               Level3                                                XS4All

                                 P
        BGP

                                BG
  AS

 DT                                  BG
                                          P
                   BGP




                                                       AS

                       AS                            BT
                 AT&T
Our Data
  BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG||




                                  I can take trafc for anything in the
                                    range 192.37.0.0 - 192.37.255.255
                                   and deliver it through the network
                                    route 3333 -> 5378 -> 286 -> 1836


                                                                      Got
                              ROUTER
                                                                      it!

                AS3333                                               ROUTER
                                                                              PEER
                                           Thanks!
                                                                               AS
                                               RIPE NCC
                                               ROUTER
                                                              PEER
                                                               AS
(not really a router!)
Our Data

• 15 active data collection points (the not
  really routers), collecting updates from
  ~600 peers
• BGP updates every 5 minutes, ~80M / day
• Routing table dumps every 8 hours, ~10M
  entries / dump (160M index updates)
FAQ
•   What are all updates that are somehow related to
    AS3333 for the past two weeks?
•   How many updates per second were announcing
    prexes originated by AS286 between 11:00 and
    12:00 on February 2nd in 2005?
•   Which AS adjacencies have existed for AS3356 in
    the past year?
•   What was the size of the routing table of the peer
    router at 193.148.15.140 two years ago?
Not so frequently asked:
What happened in Egypt?




     http://stat.ripe.net/egypt
THE INTERNET


                                                          DATA COLLECTOR
                         DATA COLLECTOR




                                 DATA COLLECTOR                 DATA COLLECTOR                ONLINE QUERY SYSTEM
                                                                                         -+                             RIPE.net                                              +
                                                                                                                                                      Home | Login | Register |
                                                                                                                                                                      Contact

                                                                                                RIS DATA
                                                                                              “RDX Wall Art: The
                                                                                                                     - “RDX Wall Art: The Making
                                                                                                                     Of” iand new short
                                                                                              Making Of” is a new
                                                                                                                     documentary iand new short isa
                                                                                              short documentary
                                                                                              Blog
                                                                                              highlighting
                                                                                                                     Forun
                                                                                                                     new short
                                                                                                                     - isa new short documentary
push to central server                                                                                               - highlighting iand new sho
                                                                                                                     documentary
                                                                                              some of the pioneers
                                                                                                                     - some of the pioneers
                                                                                              highlighting
                                                                                                                     highlighting iand new sho
                                                                                              more ...
                                                                                                                     more ...



                 raw data
                repository
               (le system                                                       query
                  based)
                                                                                                           Pages 1 2 3 4 5 6 7 . . .
                                                                                                                 120 121 122

  copy to HDFS



                                          CLUSTER
                  source data
                   on HDFS



                             <<MapReduce>>
                              create datasets        <<MapReduce>>
                                                     insert into HBase
                                            intermediate
                                              intermediate
                                               data on
                                                 intermediate
                                                HDFS on
                                                  data
                                                   HDFS on
                                                    data
                                                     HDFS
Import
                                           <<mapreduce>>
                                           HBASE INSERT
                                               JOB




 UPDATES                   <<mapreduce>>
 SOURCE     COPY TO HDFS      UPDATE
  FILES                     IMPORT JOB




                                           <<mapreduce>>
                                                           <<mapreduce>>
                                           ADJACENCIES
                                                           HBASE INSERT
                                            DERIVED SET
                                                               JOB
                                            CREATE JOB




TABLEDUMP                  <<mapreduce>>   <<mapreduce>>                   HBase
  SOURCE    COPY TO HDFS    TABLEDUMP      HBASE INSERT
   FILES                    IMPORT JOB         JOB




                           <<mapreduce>>
                                           <<mapreduce>>
                              PEERS
                                           HBASE INSERT
                            DERIVED SET
                                               JOB
                            CREATE JOB
Import

• Hadoop MapReduce for parallelization and
  fault tolerance
• Chain jobs to create derived datasets
• Database store operations are idempotent
BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG||
       980099497 A|193.148.15.68|3333|192.37.0.0/16|3333          1836|IGP|193.148.15.140




                      RECORD
                prex: 193.92.140/21                            TIMELINE
              origin IP: 193.148.15.68                T1, T2, T3, T4, T5, T6, T7, etc.
            AS path: 3333 5378 286 1836




    Inserting new data points means read-modify-write:
    • find record
    • fetch timeline
    • merge timeline with new data
    • write merged timeline
Schema and
      representation
 ROW KEY           CF: DATA           CF: META

                [record]:[timeline]
                [record]:[timeline]
[index bytes]   [record]:[timeline]
                                      exists = Y
                [record]:[timeline]
                [record]:[timeline]
                [record]:[timeline]
[index bytes]   [record]:[timeline]
                                      exists = Y
                [record]:[timeline]
Schema and
            representation
•   Multiple indexes; data is duplicated for each index
•   Protobuf encoded records and timelines
•   Timelines are timestamps or pairs of timestamps denoting validity
    intervals
•   Data is partitioned over time segments (index contains time part)
•   Indexes have an additional discriminator byte to avoid extremely
    wide rows
•   Keep an in-memory representation of the key space for
    backwards traversal
•   Keep data hot spot (most recent two to three weeks) in HBase
    block cache
Schema and
          representation
  Strategy:
• Make sure data hotspot fits in HBase block
  cache
• Scale out seek operations for (higher
  latency) access to old data
  Result:
• 10x 10K RPM disk per worker node
• 16GB heap space for region servers
{ "select": ["META_DATA", "BLOB", "TIMELINE"],
                                                           HTTP POST

                                                                        Query
    "data_class": "RIS_UPDATE",
    "where": {
        "composite": {
            "primary": {


                                                                        Server
                "time_index": {
                    "from": "08:00:00.000 05-02-2008",
                    "to": "18:00:00.000 05-05-2008"}
            },
            "secondary": {
                "ipv4": {
                    "match": "ALL_LEVELS_MORE_SPECIFIC",               PROTOBUF
                    "exact_match": "INCLUDE_EXACT",
                    "resource": "193.0.0.0/8"                              or
                }                                                        JSON
            }
        }
    },


                                                                         Client
    "combine_timeline": true
}




• In-memory index of all resources (IPs and AS numbers)
• Internally generates lists ranges to scan that match query predicates
• Combines time partitions into a single timeline before returning to
    client
•   Runs on every RS behind a load balancer
•   Knows which data exists
Hadoop and HBase:
 some experiences
Blocking writes because of memstore
flushing and compactions.
Tuned these (amongst others):
hbase.regionserver.global.memstore.upperLimit
hbase.regionserver.global.memstore.lowerLimit
hbase.hregion.memstore.flush.size
hbase.hregion.max.filesize
hbase.hstore.compactionThreshold
hbase.hstore.blockingStoreFiles
hbase.hstore.compaction.max
Hadoop and HBase:
       some experiences
      Dev machines: dual quad core, 32GB RAM,
      4x 750GB 5.4K RPM data disks, 150GB OS
      disk
      Prod machines: dual quad core, 64GB RAM,
      10x 600GB 10K RPM data disks,
      150GB OS disk

See a previously IO bound job become CPU bound
because of expanded IO capacity.
Hadoop and HBase:
        some experiences
       Garbage collecting a 16GB heap can take
       quite a long time.


Using G1*. Full collections still take seconds, but no
missed heartbeats.


                                 * New HBase versions have a x / feature that
                                 works with the CMS collector. RIPE NCC is
                                 looking into upgrading.
Hadoop and HBase:
         some experiences
       Estimating required capacity and machine
       proles is hard.


Pick a starting point. Monitor. Tune. Scale. In that order.
Hadoop and HBase:
         some experiences
        Other useful tools

•   Ganglia
•   CFEngine / Puppet
•   Splunk
•   Bash
Hadoop and HBase:
  some experiences

 Awesome
community!
(mailing lists are a big help when in trouble)
Q&A?
  Friso van Vollenhoven
  fvanvollenhoven@xebia.com

Weitere ähnliche Inhalte

Ähnlich wie Berlin Buzzwords preso

Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...Kapil Pendse
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組みRyousei Takano
 
Ip Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course SamplerIp Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course SamplerJim Jenkins
 
BGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and ToolBGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and ToolAPNIC
 
RPKI and Me
RPKI and MeRPKI and Me
RPKI and MeMyNOG
 
evolution towards NGN
evolution towards NGNevolution towards NGN
evolution towards NGNAJAL A J
 

Ähnlich wie Berlin Buzzwords preso (9)

IPTV lecture
IPTV lectureIPTV lecture
IPTV lecture
 
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
Remote Monitoring and Surveillance as a Value Added Service - Proposed By 3 J...
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
Ip Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course SamplerIp Networking Over Satelite Course Sampler
Ip Networking Over Satelite Course Sampler
 
BGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and ToolBGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
BGP Scanner - Isolario BGP-MRT Data Reader C Library and Tool
 
RPKI and Me
RPKI and MeRPKI and Me
RPKI and Me
 
Observer ts
Observer tsObserver ts
Observer ts
 
Observer ts
Observer tsObserver ts
Observer ts
 
evolution towards NGN
evolution towards NGNevolution towards NGN
evolution towards NGN
 

Mehr von fvanvollenhoven

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!fvanvollenhoven
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collectorfvanvollenhoven
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentationfvanvollenhoven
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jfvanvollenhoven
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksfvanvollenhoven
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 

Mehr von fvanvollenhoven (9)

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4j
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networks
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 

KĂźrzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vĂĄzquez
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

KĂźrzlich hochgeladen (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Berlin Buzzwords preso

  • 1. Analyzing the internet using Hadoop and Hbase Friso van Vollenhoven fvanvollenhoven@xebia.com
  • 3. + = One of ve Regional Internet Registrars Allocate IP address ranges to organizations (e.g. ISPs, network operators, etc.) Provide near real time network information based on measurements
  • 4. WEB 5.0 SERVICE PowerBook G4 THE INTERNET
  • 5. WEB 5.0 SERVICE AS T-Mobile PowerBook G4 BG AS P AS KPN AMS-IX BGP BG P BG BG P P B GP AS AS Level3 XS4All P BGP BG AS DT BG P BGP AS AS BT AT&T
  • 6. Our Data BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG|| I can take trafc for anything in the range 192.37.0.0 - 192.37.255.255 and deliver it through the network route 3333 -> 5378 -> 286 -> 1836 Got ROUTER it! AS3333 ROUTER PEER Thanks! AS RIPE NCC ROUTER PEER AS (not really a router!)
  • 7. Our Data • 15 active data collection points (the not really routers), collecting updates from ~600 peers • BGP updates every 5 minutes, ~80M / day • Routing table dumps every 8 hours, ~10M entries / dump (160M index updates)
  • 8. FAQ • What are all updates that are somehow related to AS3333 for the past two weeks? • How many updates per second were announcing prexes originated by AS286 between 11:00 and 12:00 on February 2nd in 2005? • Which AS adjacencies have existed for AS3356 in the past year? • What was the size of the routing table of the peer router at 193.148.15.140 two years ago?
  • 9. Not so frequently asked: What happened in Egypt? http://stat.ripe.net/egypt
  • 10. THE INTERNET DATA COLLECTOR DATA COLLECTOR DATA COLLECTOR DATA COLLECTOR ONLINE QUERY SYSTEM -+ RIPE.net + Home | Login | Register | Contact RIS DATA “RDX Wall Art: The - “RDX Wall Art: The Making Of” iand new short Making Of” is a new documentary iand new short isa short documentary Blog highlighting Forun new short - isa new short documentary push to central server - highlighting iand new sho documentary some of the pioneers - some of the pioneers highlighting highlighting iand new sho more ... more ... raw data repository (le system query based) Pages 1 2 3 4 5 6 7 . . . 120 121 122 copy to HDFS CLUSTER source data on HDFS <<MapReduce>> create datasets <<MapReduce>> insert into HBase intermediate intermediate data on intermediate HDFS on data HDFS on data HDFS
  • 11. Import <<mapreduce>> HBASE INSERT JOB UPDATES <<mapreduce>> SOURCE COPY TO HDFS UPDATE FILES IMPORT JOB <<mapreduce>> <<mapreduce>> ADJACENCIES HBASE INSERT DERIVED SET JOB CREATE JOB TABLEDUMP <<mapreduce>> <<mapreduce>> HBase SOURCE COPY TO HDFS TABLEDUMP HBASE INSERT FILES IMPORT JOB JOB <<mapreduce>> <<mapreduce>> PEERS HBASE INSERT DERIVED SET JOB CREATE JOB
  • 12. Import • Hadoop MapReduce for parallelization and fault tolerance • Chain jobs to create derived datasets • Database store operations are idempotent
  • 13. BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG|| 980099497 A|193.148.15.68|3333|192.37.0.0/16|3333 1836|IGP|193.148.15.140 RECORD prex: 193.92.140/21 TIMELINE origin IP: 193.148.15.68 T1, T2, T3, T4, T5, T6, T7, etc. AS path: 3333 5378 286 1836 Inserting new data points means read-modify-write: • nd record • fetch timeline • merge timeline with new data • write merged timeline
  • 14. Schema and representation ROW KEY CF: DATA CF: META [record]:[timeline] [record]:[timeline] [index bytes] [record]:[timeline] exists = Y [record]:[timeline] [record]:[timeline] [record]:[timeline] [index bytes] [record]:[timeline] exists = Y [record]:[timeline]
  • 15. Schema and representation • Multiple indexes; data is duplicated for each index • Protobuf encoded records and timelines • Timelines are timestamps or pairs of timestamps denoting validity intervals • Data is partitioned over time segments (index contains time part) • Indexes have an additional discriminator byte to avoid extremely wide rows • Keep an in-memory representation of the key space for backwards traversal • Keep data hot spot (most recent two to three weeks) in HBase block cache
  • 16. Schema and representation Strategy: • Make sure data hotspot ts in HBase block cache • Scale out seek operations for (higher latency) access to old data Result: • 10x 10K RPM disk per worker node • 16GB heap space for region servers
  • 17. { "select": ["META_DATA", "BLOB", "TIMELINE"], HTTP POST Query "data_class": "RIS_UPDATE", "where": { "composite": { "primary": { Server "time_index": { "from": "08:00:00.000 05-02-2008", "to": "18:00:00.000 05-05-2008"} }, "secondary": { "ipv4": { "match": "ALL_LEVELS_MORE_SPECIFIC", PROTOBUF "exact_match": "INCLUDE_EXACT", "resource": "193.0.0.0/8" or } JSON } } }, Client "combine_timeline": true } • In-memory index of all resources (IPs and AS numbers) • Internally generates lists ranges to scan that match query predicates • Combines time partitions into a single timeline before returning to client • Runs on every RS behind a load balancer • Knows which data exists
  • 18. Hadoop and HBase: some experiences Blocking writes because of memstore flushing and compactions. Tuned these (amongst others): hbase.regionserver.global.memstore.upperLimit hbase.regionserver.global.memstore.lowerLimit hbase.hregion.memstore.flush.size hbase.hregion.max.filesize hbase.hstore.compactionThreshold hbase.hstore.blockingStoreFiles hbase.hstore.compaction.max
  • 19. Hadoop and HBase: some experiences Dev machines: dual quad core, 32GB RAM, 4x 750GB 5.4K RPM data disks, 150GB OS disk Prod machines: dual quad core, 64GB RAM, 10x 600GB 10K RPM data disks, 150GB OS disk See a previously IO bound job become CPU bound because of expanded IO capacity.
  • 20. Hadoop and HBase: some experiences Garbage collecting a 16GB heap can take quite a long time. Using G1*. Full collections still take seconds, but no missed heartbeats. * New HBase versions have a x / feature that works with the CMS collector. RIPE NCC is looking into upgrading.
  • 21. Hadoop and HBase: some experiences Estimating required capacity and machine proles is hard. Pick a starting point. Monitor. Tune. Scale. In that order.
  • 22. Hadoop and HBase: some experiences Other useful tools • Ganglia • CFEngine / Puppet • Splunk • Bash
  • 23. Hadoop and HBase: some experiences Awesome community! (mailing lists are a big help when in trouble)
  • 24. Q&A? Friso van Vollenhoven fvanvollenhoven@xebia.com

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n