SlideShare ist ein Scribd-Unternehmen logo
1 von 40
HADOOP DEMO 
Esther Kundin 
Bloomberg LP
About Me
Big Data –What is It?
Outline 
• What Is Big Data? 
• A History Lesson 
• Hadoop – Dive in to the details 
• HDFS 
• MapReduce 
• HBase 
• Industry Trends 
• Questions
What is Big Data?
A History Lesson
Big Data Origins 
• Indexing the web requires lots of storage 
• Petabytes of data! 
• Economic problem – reliable servers expensive! 
• Solution: 
• Cram in as many cheap machines as possible 
• Replace them when they fail 
• Solve reliability via software!
Big Data Origins Cont’d 
• DBs are slow and expensive 
• Lots of unneeded features 
RDBMS NoSQL 
ACID Eventual 
consistency 
Strongly-typed No type checking 
Complex Joins Get/Put 
RAID storage Commodity 
hardware
Big Data Origins Cont’d 
• Google publishes papers about: 
• GFS (2000) 
• MapReduce (2004) 
• BigTable (2006) 
• Hadoop, originally developed at Yahoo, accepted as 
Apache top-level project in 2008
Translation 
GFS HDFS 
MapReduce Hadoop MapReduce 
BigTable HBASE
Why Hadoop? 
• Huge and growing ecosystem of services 
• Pace of development is swift 
• Tons of money and talent pouring in
Diving into the details!
Hadoop Ecosytem 
• HDFS – Hadoop Distributed File System 
• Pig: a scripting language that simplifies the creation of MapReduce 
jobs and excels at exploring and transforming data. 
• Hive: provides SQL-like access to your Big Data. 
• HBase: Hadoop database . 
• HCatalog: for defining and sharing schemas . 
• Ambari: for provisioning, managing, and monitoring Apache Hadoop 
clusters . 
• ZooKeeper: an open-source server which enables highly reliable 
distributed coordination . 
• Sqoop: for efficiently transferring bulk data between Hadoop and 
relation databases . 
• Oozie: a workflow scheduler system to manage Apache Hadoop jobs 
• Mahout : scalable machine learning library
HDFS 
• Hadoop Distributed File System 
• Basis for all other tools, built on top of it 
• Allows for distributed workloads
HDFS details
HDFS Demo
MapReduce
MapReduce demo 
• To run, can use: 
• Custom JAVA application 
• PIG – nice interface 
• Hadoop Streaming + any executable, like python 
• Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- 
program-in-python/ 
• HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
HBase 
• Database running on top of HDFS 
• NOSQL – key/value store 
• Distributed 
• Good for sparse requests, rather than scans like MapReduce 
• Sorted 
• Eventually Consistent
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS
HBase Read 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client requests Meta 
Region Server 
address
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client determines 
Which RegionServer 
to contact and caches 
that data
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client requests data 
from the Region 
Server, which gets 
data from HDFS
HBase Demo
HMaster 
• Only one main master at a time – ensured by zookeeper 
• Keeps track of all table metadata 
• Used in table creation, modification, and deletion. 
• Not used for reads
Region Server 
• This is the worker node of HBase 
• Performs Gets, Puts, and Scans for the regions it handles 
• Multiple regions are handled by each Region Server 
• On startup 
• Registers with zookeeper 
• Hmaster assigns it regions 
• Physical blocks on HDFS may or may not be on the same machine 
• Regions are split if they get too big 
• Data stored in a format called Hfile 
• Cache of data is what gives good performance. Cache 
based on blocks, not rows
HBaseWrite – step 1 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
Region Server 
persists write at 
the end of the 
WAL
HBaseWrite – step 2 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
Regions Server 
saves write in a 
sorted map in 
memory in the 
MemStore
HBaseWrite – offline 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
When MemStore reaches 
a configurable size, it is 
flushed to an HFile
Minor Compaction 
• When writing a MemStore to Hfile, may trigger a Minor 
Compaction 
• Combine many small Hfiles into one large one 
• Saves disk reads 
• May block further MemStore flushes, so try to keep to a 
minimum
Major Compaction 
• Happens at configurable times for the system 
• Ie. Once a week on weekends 
• Default to once every 24 hrs 
• Resource-intensive 
• Don’t set it to “never” 
• Reads in all Hfiles and makes sure there is one Hfile per 
Region per column family 
• Purges deleted records 
• Ensures that HDFS files are local
Tuning your DB - HBase Keys 
• Row Key – byte array 
• Best performance for Single Row Gets 
• Best Caching Performance 
• Key Design – 
• Distributes well – usually accomplished by hashing natural key 
• MD5 
• SHA1
Tuning your DB - BlockCache 
• Each region server has a BlockCache where it stores file 
blocks that it has already read 
• Every read that is in the block increases performance 
• Don’t want your blocks to be much bigger than your rows 
• Modes of caching: 
• 2-level LRU cache, by default 
• Other options: BucketCache – can use DirectByteBuffers to 
manage off-heap RAM – better Garbage Collection stats on the 
region server
Tuning your DB - Columns and Column 
Families 
• All columns in a column families accessed together for 
reads 
• Different column families stored in different HFiles 
• All Column Families written once when any MemStore is 
full 
• Example: 
• Storing package tracking information: 
• Need package shipping info 
• Need to store each location in the path
Tuning your DB – Bloom Filters 
• Can be set on rows or columns 
• Keep an extra index of available keys 
• Slows down reads and writes a bit 
• Increases storage 
• Saves time checking if keys exist 
• Turn on if it is likely that client will request missing data
Tuning your DB – Short-Circuit Reads 
• HDFS exposes service interface 
• If file is actually local, much faster to just read Hfile 
directly off of the disk
Current Industry Trends
Big Data in Finance – the challenges 
• Real-Time financial analysis 
• Reliability 
• “medium-data”
What Bloomberg is Working on 
• Working with Hortonworks on fixing real-time issues in 
Hadoop 
• Creating a framework for reliably serving real-time data 
• Presenting at Hadoop World and Hadoop Summit 
• Open source Chef recipes for running a hadoop cluster on 
OpenStack-managed VMs
Questions? 
• Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseHBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 
Thug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangThug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangChen Zhang
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 

Was ist angesagt? (20)

HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
Thug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangThug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen Zhang
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 

Andere mochten auch

BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Usama Fayyad
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Andere mochten auch (12)

BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkkeval dalasaniya
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBaseGokuldas Pillai
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduseScott Miao
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 

Ähnlich wie Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends (20)

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Hadoop
HadoopHadoop
Hadoop
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

  • 1. HADOOP DEMO Esther Kundin Bloomberg LP
  • 4. Outline • What Is Big Data? • A History Lesson • Hadoop – Dive in to the details • HDFS • MapReduce • HBase • Industry Trends • Questions
  • 5. What is Big Data?
  • 7. Big Data Origins • Indexing the web requires lots of storage • Petabytes of data! • Economic problem – reliable servers expensive! • Solution: • Cram in as many cheap machines as possible • Replace them when they fail • Solve reliability via software!
  • 8. Big Data Origins Cont’d • DBs are slow and expensive • Lots of unneeded features RDBMS NoSQL ACID Eventual consistency Strongly-typed No type checking Complex Joins Get/Put RAID storage Commodity hardware
  • 9. Big Data Origins Cont’d • Google publishes papers about: • GFS (2000) • MapReduce (2004) • BigTable (2006) • Hadoop, originally developed at Yahoo, accepted as Apache top-level project in 2008
  • 10. Translation GFS HDFS MapReduce Hadoop MapReduce BigTable HBASE
  • 11. Why Hadoop? • Huge and growing ecosystem of services • Pace of development is swift • Tons of money and talent pouring in
  • 12. Diving into the details!
  • 13. Hadoop Ecosytem • HDFS – Hadoop Distributed File System • Pig: a scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data. • Hive: provides SQL-like access to your Big Data. • HBase: Hadoop database . • HCatalog: for defining and sharing schemas . • Ambari: for provisioning, managing, and monitoring Apache Hadoop clusters . • ZooKeeper: an open-source server which enables highly reliable distributed coordination . • Sqoop: for efficiently transferring bulk data between Hadoop and relation databases . • Oozie: a workflow scheduler system to manage Apache Hadoop jobs • Mahout : scalable machine learning library
  • 14. HDFS • Hadoop Distributed File System • Basis for all other tools, built on top of it • Allows for distributed workloads
  • 18. MapReduce demo • To run, can use: • Custom JAVA application • PIG – nice interface • Hadoop Streaming + any executable, like python • Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- program-in-python/ • HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
  • 19. HBase • Database running on top of HDFS • NOSQL – key/value store • Distributed • Good for sparse requests, rather than scans like MapReduce • Sorted • Eventually Consistent
  • 20. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS
  • 21. HBase Read Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests Meta Region Server address
  • 22. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client determines Which RegionServer to contact and caches that data
  • 23. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests data from the Region Server, which gets data from HDFS
  • 25. HMaster • Only one main master at a time – ensured by zookeeper • Keeps track of all table metadata • Used in table creation, modification, and deletion. • Not used for reads
  • 26. Region Server • This is the worker node of HBase • Performs Gets, Puts, and Scans for the regions it handles • Multiple regions are handled by each Region Server • On startup • Registers with zookeeper • Hmaster assigns it regions • Physical blocks on HDFS may or may not be on the same machine • Regions are split if they get too big • Data stored in a format called Hfile • Cache of data is what gives good performance. Cache based on blocks, not rows
  • 27. HBaseWrite – step 1 Region Server WAL (on HDFS) MemStore HFile HFile HFile Region Server persists write at the end of the WAL
  • 28. HBaseWrite – step 2 Region Server WAL (on HDFS) MemStore HFile HFile HFile Regions Server saves write in a sorted map in memory in the MemStore
  • 29. HBaseWrite – offline Region Server WAL (on HDFS) MemStore HFile HFile HFile When MemStore reaches a configurable size, it is flushed to an HFile
  • 30. Minor Compaction • When writing a MemStore to Hfile, may trigger a Minor Compaction • Combine many small Hfiles into one large one • Saves disk reads • May block further MemStore flushes, so try to keep to a minimum
  • 31. Major Compaction • Happens at configurable times for the system • Ie. Once a week on weekends • Default to once every 24 hrs • Resource-intensive • Don’t set it to “never” • Reads in all Hfiles and makes sure there is one Hfile per Region per column family • Purges deleted records • Ensures that HDFS files are local
  • 32. Tuning your DB - HBase Keys • Row Key – byte array • Best performance for Single Row Gets • Best Caching Performance • Key Design – • Distributes well – usually accomplished by hashing natural key • MD5 • SHA1
  • 33. Tuning your DB - BlockCache • Each region server has a BlockCache where it stores file blocks that it has already read • Every read that is in the block increases performance • Don’t want your blocks to be much bigger than your rows • Modes of caching: • 2-level LRU cache, by default • Other options: BucketCache – can use DirectByteBuffers to manage off-heap RAM – better Garbage Collection stats on the region server
  • 34. Tuning your DB - Columns and Column Families • All columns in a column families accessed together for reads • Different column families stored in different HFiles • All Column Families written once when any MemStore is full • Example: • Storing package tracking information: • Need package shipping info • Need to store each location in the path
  • 35. Tuning your DB – Bloom Filters • Can be set on rows or columns • Keep an extra index of available keys • Slows down reads and writes a bit • Increases storage • Saves time checking if keys exist • Turn on if it is likely that client will request missing data
  • 36. Tuning your DB – Short-Circuit Reads • HDFS exposes service interface • If file is actually local, much faster to just read Hfile directly off of the disk
  • 38. Big Data in Finance – the challenges • Real-Time financial analysis • Reliability • “medium-data”
  • 39. What Bloomberg is Working on • Working with Hortonworks on fixing real-time issues in Hadoop • Creating a framework for reliably serving real-time data • Presenting at Hadoop World and Hadoop Summit • Open source Chef recipes for running a hadoop cluster on OpenStack-managed VMs

Hinweis der Redaktion

  1. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  2. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  3. Name node is the manager, data node is the worker
  4. Job Tracker = Resource Manager Task Tracker = Node Manager Number of Jobs depends on the range of keys Number of mappers is set by the user – you’d want it to correspond to the set of possible values. So, if the values are ascii, you won’t want reducers to exceed 256. You also don’t want them to exceed the number of data nodes you have.
  5. Remember, HBase treats everything as a file system
  6. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  7. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  8. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  9. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  10. All columns in a column family are read for a get – but not all column families unless specified
  11. Although there is a separate memstore per column family – as soon as one is full, all of them written to hfiles. Note also that deletes are handled with a marker, and only really purged at a major compaction
  12. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry