SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
HADOOP
  AT
DATASIFT
ABOUT ME
  Jairam Chandar
   Big Data Engineer
         Datasift
        @jairamc
 http://about.me/jairam
    http://jairam.me
And I’m a Formula 1 Fan!
OUTLINE
• What   is Datasift ?

• Where   do we use Hadoop ?

 • The   Numbers

 • The   Use-cases

 • The   Lessons
!! SALES PITCH ALERT !!
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
WHAT IS DATASIFT?
THE NUMBERS

• Machines

 • HBase

   • 60   Machines as RegionServers

   •1   HMaster

   •3   Zookeeper nodes
THE NUMBERS
• Machines

 • Hadoop

   • 135   Machines divided into 2 clusters

     • Datanodes/Tasktrakers

   • Namenodes     with High-Availability Failover

   •1   Jobtracker each
THE NUMBERS
•   Machines

    •   DL380 Gen8

    •   2 * Intel Xeon E5646 @ 2.40GHz (24 core total)

    •   48GB RAM

    •   6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is
        storage)

    •   1 Gigabit network links
THE NUMBERS
•   Data

    •   Average load of 7500 interactions per second

    •   Peak loads of 15000 interactions per second sustained over a min

    •   Peak of 21000 interactions per second during superbowl

    •   Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB

    •   Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with
        replication (RF = 3)

    •   And that’s not it!
THE USE CASES
•   HBase

    •   Recordings

    •   Archive

•   Map/Reduce

    •   Exports

    •   Historics

    •   Migration
THE USE CASES
•   Recordings

    •   User defined streams

    •   Stored in HBase for later retrieval

    •   Export to multiple output formats and stores

    •   <recording-id><interaction-uuid>

        •   Recording-id is a SHA-1 hash

        •   Allows recordings to be distributed by their key without generating hot-
            spots.
THE RECORDER
THE USE CASES
• Exporter

 • Export    data from HBase for customer

 • Export    files ~ 5 – 10 GB or ~ 3-6 million records

 • MR    over HBase using TableInputFormat

 • But   the data needs to be sorted

   • TotalOrderPartioner
EXPORTER
HISTORICS
THE USE CASES

• Twitter   Import

  •2   years of Tweets

  • About    95,000,000,000 tweets

  • Over    300 TB with added augmentation

  • Import   was not as simple as you would imagine
THE USE CASES
•   Archive

    •   Not just the Firehose but the Ultrahose

    •   Stored in HBase as well

    •   HBase architecture (BigTable) creates Hotspots with Time Series data

    •   Leading randomizing bit (see HBaseWD)

    •   Pre-split regions

    •   Concurrent writes
THE USE CASES
•   Historics

    •   Export archive data

    •   Slightly different from Exporter

    •   Much larger time lines (1 – 3 months)

    •   Controlled access to Hadoop cluster with efficient job scheduling

    •   Unfiltered Input Data

        •   Therefore longer processing time

    •   Hence more optimizations required
HISTORICS
THE LESSONS
•   Tune Tune Tune (Default == BAD)

•   Based on use case tune -

    •   Heap

    •   Block Size

    •   Memstore size

•   Keep number of column families low

•   Be aware of hot-spotting issue when writing time-series data
THE LESSONS
• Use   compression (eg. Snappy)

• Ops   need intimate understanding of system

• Monitor  system metrics (GC, CPU, Compaction, I/O) and
 application metrics (writes/sec etc)

• Don't   be afraid to fiddle with HBase code

• Using   a distribution is advisable
QUESTIONS?


            We are hiring
http://datasift.com/about-us/careers

Weitere ähnliche Inhalte

Was ist angesagt?

Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...
Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...
Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...Zarafa
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnikData Con LA
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017HashedIn Technologies
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planningRiyaz Shaikh
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloudAvinash Ramineni
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explainedDavid Groozman
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introductionDavid Groozman
 

Was ist angesagt? (20)

Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...
Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...
Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...
 
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Foss evolution cos-boudnik
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnik
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Barcamp MySQL
Barcamp MySQLBarcamp MySQL
Barcamp MySQL
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 

Andere mochten auch

Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 

Andere mochten auch (6)

Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Ähnlich wie Hadoop at datasift

Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDBAWS Germany
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesHaohui Mai
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHPAndrew Goodwin
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Running & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersRunning & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersFred de Villamil
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tigerElizabeth Smith
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
 

Ähnlich wie Hadoop at datasift (20)

Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Running & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch ClustersRunning & Scaling Large Elasticsearch Clusters
Running & Scaling Large Elasticsearch Clusters
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 

Kürzlich hochgeladen

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Kürzlich hochgeladen (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Hadoop at datasift

  • 2. ABOUT ME Jairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://jairam.me And I’m a Formula 1 Fan!
  • 3. OUTLINE • What is Datasift ? • Where do we use Hadoop ? • The Numbers • The Use-cases • The Lessons
  • 4. !! SALES PITCH ALERT !!
  • 14. THE NUMBERS • Machines • HBase • 60 Machines as RegionServers •1 HMaster •3 Zookeeper nodes
  • 15. THE NUMBERS • Machines • Hadoop • 135 Machines divided into 2 clusters • Datanodes/Tasktrakers • Namenodes with High-Availability Failover •1 Jobtracker each
  • 16. THE NUMBERS • Machines • DL380 Gen8 • 2 * Intel Xeon E5646 @ 2.40GHz (24 core total) • 48GB RAM • 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage) • 1 Gigabit network links
  • 17. THE NUMBERS • Data • Average load of 7500 interactions per second • Peak loads of 15000 interactions per second sustained over a min • Peak of 21000 interactions per second during superbowl • Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB • Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3) • And that’s not it!
  • 18. THE USE CASES • HBase • Recordings • Archive • Map/Reduce • Exports • Historics • Migration
  • 19. THE USE CASES • Recordings • User defined streams • Stored in HBase for later retrieval • Export to multiple output formats and stores • <recording-id><interaction-uuid> • Recording-id is a SHA-1 hash • Allows recordings to be distributed by their key without generating hot- spots.
  • 21. THE USE CASES • Exporter • Export data from HBase for customer • Export files ~ 5 – 10 GB or ~ 3-6 million records • MR over HBase using TableInputFormat • But the data needs to be sorted • TotalOrderPartioner
  • 24. THE USE CASES • Twitter Import •2 years of Tweets • About 95,000,000,000 tweets • Over 300 TB with added augmentation • Import was not as simple as you would imagine
  • 25. THE USE CASES • Archive • Not just the Firehose but the Ultrahose • Stored in HBase as well • HBase architecture (BigTable) creates Hotspots with Time Series data • Leading randomizing bit (see HBaseWD) • Pre-split regions • Concurrent writes
  • 26. THE USE CASES • Historics • Export archive data • Slightly different from Exporter • Much larger time lines (1 – 3 months) • Controlled access to Hadoop cluster with efficient job scheduling • Unfiltered Input Data • Therefore longer processing time • Hence more optimizations required
  • 28. THE LESSONS • Tune Tune Tune (Default == BAD) • Based on use case tune - • Heap • Block Size • Memstore size • Keep number of column families low • Be aware of hot-spotting issue when writing time-series data
  • 29. THE LESSONS • Use compression (eg. Snappy) • Ops need intimate understanding of system • Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc) • Don't be afraid to fiddle with HBase code • Using a distribution is advisable
  • 30. QUESTIONS? We are hiring http://datasift.com/about-us/careers