Hadoop at datasift

•

1 gefällt mir•1,648 views

Jairam Chandar

Presentation given at Edinburgh University Student Tech-Meetup on 6th Feb, 2013.

Technologie

ABOUT ME
Jairam Chandar
Big Data Engineer
Datasift
@jairamc
http://about.me/jairam
http://jairam.me
And I’m a Formula 1 Fan!

OUTLINE
• What is Datasift ?

• Where do we use Hadoop ?

• The Numbers

• The Use-cases

• The Lessons

THE NUMBERS

• Machines

• HBase

• 60 Machines as RegionServers

•1 HMaster

•3 Zookeeper nodes

THE NUMBERS
• Machines

• Hadoop

• 135 Machines divided into 2 clusters

• Datanodes/Tasktrakers

• Namenodes with High-Availability Failover

•1 Jobtracker each

THE NUMBERS
• Machines

• DL380 Gen8

• 2 * Intel Xeon E5646 @ 2.40GHz (24 core total)

• 48GB RAM

• 6 * 2 TB disks in JBOD (small partition on ﬁrst disk for OS, rest is
storage)

• 1 Gigabit network links

THE NUMBERS
• Data

• Average load of 7500 interactions per second

• Peak loads of 15000 interactions per second sustained over a min

• Peak of 21000 interactions per second during superbowl

• Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB

• Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with
replication (RF = 3)

• And that’s not it!

THE USE CASES
• HBase

• Recordings

• Archive

• Map/Reduce

• Exports

• Historics

• Migration

THE USE CASES
• Recordings

• User deﬁned streams

• Stored in HBase for later retrieval

• Export to multiple output formats and stores

• <recording-id><interaction-uuid>

• Recording-id is a SHA-1 hash

• Allows recordings to be distributed by their key without generating hot-
spots.

THE USE CASES
• Exporter

• Export data from HBase for customer

• Export ﬁles ~ 5 – 10 GB or ~ 3-6 million records

• MR over HBase using TableInputFormat

• But the data needs to be sorted

• TotalOrderPartioner

THE USE CASES

• Twitter Import

•2 years of Tweets

• About 95,000,000,000 tweets

• Over 300 TB with added augmentation

• Import was not as simple as you would imagine

THE USE CASES
• Archive

• Not just the Firehose but the Ultrahose

• Stored in HBase as well

• HBase architecture (BigTable) creates Hotspots with Time Series data

• Leading randomizing bit (see HBaseWD)

• Pre-split regions

• Concurrent writes

THE USE CASES
• Historics

• Export archive data

• Slightly different from Exporter

• Much larger time lines (1 – 3 months)

• Controlled access to Hadoop cluster with efﬁcient job scheduling

• Unﬁltered Input Data

• Therefore longer processing time

• Hence more optimizations required

THE LESSONS
• Tune Tune Tune (Default == BAD)

• Based on use case tune -

• Heap

• Block Size

• Memstore size

• Keep number of column families low

• Be aware of hot-spotting issue when writing time-series data

THE LESSONS
• Use compression (eg. Snappy)

• Ops need intimate understanding of system

• Monitor system metrics (GC, CPU, Compaction, I/O) and
application metrics (writes/sec etc)

• Don't be afraid to ﬁddle with HBase code

• Using a distribution is advisable

QUESTIONS?

We are hiring
http://datasift.com/about-us/careers

Weitere ähnliche Inhalte

Was ist angesagt?

Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...Zarafa

Introduce to sparkYen Hao Huang

La big datacamp2014_vikram_dixitData Con LA

Big Data and Hadoop EcosystemRajkumar Singh

Foss evolution cos-boudnikData Con LA

Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN

Redis Modules - Redis India Tour - 2017HashedIn Technologies

Apache KuduMike Frampton

Hadoop_EcoSystem_Pradeep_MGPradeep MG

Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

Handling the growth of dataPiyush Katariya

Big data solution capacity planningRiyaz Shaikh

An introduction to Big-Data processing applying hadoopAmir Sedighi

Hadoop introduction葵慶李

Effectively deploying hadoop to the cloudAvinash Ramineni

Asbury Hadoop OverviewBrian Enochson

Barcamp MySQLKris Buytaert

ImpalaToGo design explainedDavid Groozman

ImpalaToGo introductionDavid Groozman

Was ist angesagt? (20)

Zarafa SummerCamp 2012 - Tips & tricks for running Zarafa is larger scale env...

Introduce to spark

La big datacamp2014_vikram_dixit

Big Data and Hadoop Ecosystem

Foss evolution cos-boudnik

Basic Hadoop Architecture V1 vs V2

Redis Modules - Redis India Tour - 2017

Apache Kudu

Hadoop_EcoSystem_Pradeep_MG

Facebook - Jonthan Gray - Hadoop World 2010

HBaseCon 2015: HBase Operations in a Flurry

Handling the growth of data

Big data solution capacity planning

An introduction to Big-Data processing applying hadoop

Hadoop introduction

Effectively deploying hadoop to the cloud

Asbury Hadoop Overview

Barcamp MySQL

ImpalaToGo design explained

ImpalaToGo introduction

Andere mochten auch

Hadoop Integration in CassandraJairam Chandar

Designing Teams for Emerging ChallengesAaron Irizarry

UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter

Visual Design with DataSeth Familian

3 Things Every Sales Team Needs to Be Thinking About in 2017Drift

How to Become a Thought Leader in Your NicheLeslie Samuel

Andere mochten auch (6)

Hadoop Integration in Cassandra

Designing Teams for Emerging Challenges

UX, ethnography and possibilities: for Libraries, Museums and Archives

Visual Design with Data

3 Things Every Sales Team Needs to Be Thinking About in 2017

How to Become a Thought Leader in Your Niche

Ähnlich wie Hadoop at datasift

Hadoop and friendsChandan Rajah

Dissecting Scalable Database Architectureshypertable

Intro to big data choco devday - 23-01-2014Hassan Islamov

HBase in Practice DataWorks Summit/Hadoop Summit

HBase in Practicelarsgeorge

Managing Security At 1M Events a Second using ElasticsearchJoe Alex

August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network

Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit

Deep Dive into DynamoDBAWS Germany

Taming the resource tigerElizabeth Smith

Scaling HDFS to Manage Billions of FilesHaohui Mai

Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit

Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale

Extended memory access in PHPAndrew Goodwin

Intro to Big DataZohar Elkayam

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services

Column Stores and Google BigQueryCsaba Toth

Running & Scaling Large Elasticsearch ClustersFred de Villamil

Taming the resource tigerElizabeth Smith

Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.

Ähnlich wie Hadoop at datasift (20)

Hadoop and friends

Dissecting Scalable Database Architectures

Intro to big data choco devday - 23-01-2014

HBase in Practice

Managing Security At 1M Events a Second using Elasticsearch

August 2013 HUG: Removing the NameNode's memory limitation

Real time fraud detection at 1+M scale on hadoop stack

Deep Dive into DynamoDB

Taming the resource tiger

Scaling HDFS to Manage Billions of Files

Scaling HDFS to Manage Billions of Files with Key-Value Stores

Solving Office 365 Big Challenges using Cassandra + Spark

Extended memory access in PHP

Intro to Big Data

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Column Stores and Google BigQuery

Running & Scaling Large Elasticsearch Clusters

Taming the resource tiger

Java one2015 - Work With Hundreds of Hot Terabytes in JVMs

Kürzlich hochgeladen

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

2024 April Patch TuesdayIvanti

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

How to write a Business Continuity PlanDatabarracks

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Scale your database traffic with Read & Write split using MySQL RouterMydbops

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Kürzlich hochgeladen (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Testing tools and AI - ideas what to try with some tool examples

2024 April Patch Tuesday

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Generative Artificial Intelligence: How generative AI works.pdf

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

Long journey of Ruby standard library at RubyConf AU 2024

How to write a Business Continuity Plan

DevEX - reference for building teams, processes, and platforms

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

Potential of AI (Generative AI) in Business: Learnings and Insights

Moving Beyond Passwords: FIDO Paris Seminar.pdf

What is DBT - The Ultimate Data Build Tool.pdf

Scale your database traffic with Read & Write split using MySQL Router

TeamStation AI System Report LATAM IT Salaries 2024

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

UiPath Community: Communication Mining from Zero to Hero

Hadoop at datasift

1. HADOOP AT DATASIFT

2. ABOUT ME Jairam Chandar Big Data Engineer Datasift @jairamc http://about.me/jairam http://jairam.me And I’m a Formula 1 Fan!

3. OUTLINE • What is Datasift ? • Where do we use Hadoop ? • The Numbers • The Use-cases • The Lessons

4. !! SALES PITCH ALERT !!

10. WHAT IS DATASIFT?

11. WHAT IS DATASIFT?

12. WHAT IS DATASIFT?

13. WHAT IS DATASIFT?

14. THE NUMBERS • Machines • HBase • 60 Machines as RegionServers •1 HMaster •3 Zookeeper nodes

15. THE NUMBERS • Machines • Hadoop • 135 Machines divided into 2 clusters • Datanodes/Tasktrakers • Namenodes with High-Availability Failover •1 Jobtracker each

16. THE NUMBERS • Machines • DL380 Gen8 • 2 * Intel Xeon E5646 @ 2.40GHz (24 core total) • 48GB RAM • 6 * 2 TB disks in JBOD (small partition on ﬁrst disk for OS, rest is storage) • 1 Gigabit network links

17. THE NUMBERS • Data • Average load of 7500 interactions per second • Peak loads of 15000 interactions per second sustained over a min • Peak of 21000 interactions per second during superbowl • Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB • Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3) • And that’s not it!

18. THE USE CASES • HBase • Recordings • Archive • Map/Reduce • Exports • Historics • Migration

19. THE USE CASES • Recordings • User deﬁned streams • Stored in HBase for later retrieval • Export to multiple output formats and stores • <recording-id><interaction-uuid> • Recording-id is a SHA-1 hash • Allows recordings to be distributed by their key without generating hotspots.

20. THE RECORDER

21. THE USE CASES • Exporter • Export data from HBase for customer • Export ﬁles ~ 5 – 10 GB or ~ 3-6 million records • MR over HBase using TableInputFormat • But the data needs to be sorted • TotalOrderPartioner

22. EXPORTER

23. HISTORICS

24. THE USE CASES • Twitter Import •2 years of Tweets • About 95,000,000,000 tweets • Over 300 TB with added augmentation • Import was not as simple as you would imagine

25. THE USE CASES • Archive • Not just the Firehose but the Ultrahose • Stored in HBase as well • HBase architecture (BigTable) creates Hotspots with Time Series data • Leading randomizing bit (see HBaseWD) • Pre-split regions • Concurrent writes

26. THE USE CASES • Historics • Export archive data • Slightly different from Exporter • Much larger time lines (1 – 3 months) • Controlled access to Hadoop cluster with efﬁcient job scheduling • Unﬁltered Input Data • Therefore longer processing time • Hence more optimizations required

27. HISTORICS

28. THE LESSONS • Tune Tune Tune (Default == BAD) • Based on use case tune - • Heap • Block Size • Memstore size • Keep number of column families low • Be aware of hot-spotting issue when writing time-series data

29. THE LESSONS • Use compression (eg. Snappy) • Ops need intimate understanding of system • Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc) • Don't be afraid to ﬁddle with HBase code • Using a distribution is advisable

30. QUESTIONS? We are hiring http://datasift.com/about-us/careers

Hadoop at datasift

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Hadoop at datasift

Ähnlich wie Hadoop at datasift (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop at datasift