SlideShare ist ein Scribd-Unternehmen logo
1 von 40
HADOOP DEMO 
Esther Kundin 
Bloomberg LP
About Me
Big Data –What is It?
Outline 
• What Is Big Data? 
• A History Lesson 
• Hadoop – Dive in to the details 
• HDFS 
• MapReduce 
• HBase 
• Industry Trends 
• Questions
What is Big Data?
A History Lesson
Big Data Origins 
• Indexing the web requires lots of storage 
• Petabytes of data! 
• Economic problem – reliable servers expensive! 
• Solution: 
• Cram in as many cheap machines as possible 
• Replace them when they fail 
• Solve reliability via software!
Big Data Origins Cont’d 
• DBs are slow and expensive 
• Lots of unneeded features 
RDBMS NoSQL 
ACID Eventual 
consistency 
Strongly-typed No type checking 
Complex Joins Get/Put 
RAID storage Commodity 
hardware
Big Data Origins Cont’d 
• Google publishes papers about: 
• GFS (2000) 
• MapReduce (2004) 
• BigTable (2006) 
• Hadoop, originally developed at Yahoo, accepted as 
Apache top-level project in 2008
Translation 
GFS HDFS 
MapReduce Hadoop MapReduce 
BigTable HBASE
Why Hadoop? 
• Huge and growing ecosystem of services 
• Pace of development is swift 
• Tons of money and talent pouring in
Diving into the details!
Hadoop Ecosytem 
• HDFS – Hadoop Distributed File System 
• Pig: a scripting language that simplifies the creation of MapReduce 
jobs and excels at exploring and transforming data. 
• Hive: provides SQL-like access to your Big Data. 
• HBase: Hadoop database . 
• HCatalog: for defining and sharing schemas . 
• Ambari: for provisioning, managing, and monitoring Apache Hadoop 
clusters . 
• ZooKeeper: an open-source server which enables highly reliable 
distributed coordination . 
• Sqoop: for efficiently transferring bulk data between Hadoop and 
relation databases . 
• Oozie: a workflow scheduler system to manage Apache Hadoop jobs 
• Mahout : scalable machine learning library
HDFS 
• Hadoop Distributed File System 
• Basis for all other tools, built on top of it 
• Allows for distributed workloads
HDFS details
HDFS Demo
MapReduce
MapReduce demo 
• To run, can use: 
• Custom JAVA application 
• PIG – nice interface 
• Hadoop Streaming + any executable, like python 
• Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- 
program-in-python/ 
• HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
HBase 
• Database running on top of HDFS 
• NOSQL – key/value store 
• Distributed 
• Good for sparse requests, rather than scans like MapReduce 
• Sorted 
• Eventually Consistent
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS
HBase Read 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client requests Meta 
Region Server 
address
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client determines 
Which RegionServer 
to contact and caches 
that data
HBase Architecture 
Client 
ZK Quorum 
ZK Peer 
ZK Peer 
ZK Peer 
HMaster 
HMaster 
Meta Region 
Server 
RegionServer RegionServer RegionServer 
HDFS 
Client requests data 
from the Region 
Server, which gets 
data from HDFS
HBase Demo
HMaster 
• Only one main master at a time – ensured by zookeeper 
• Keeps track of all table metadata 
• Used in table creation, modification, and deletion. 
• Not used for reads
Region Server 
• This is the worker node of HBase 
• Performs Gets, Puts, and Scans for the regions it handles 
• Multiple regions are handled by each Region Server 
• On startup 
• Registers with zookeeper 
• Hmaster assigns it regions 
• Physical blocks on HDFS may or may not be on the same machine 
• Regions are split if they get too big 
• Data stored in a format called Hfile 
• Cache of data is what gives good performance. Cache 
based on blocks, not rows
HBaseWrite – step 1 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
Region Server 
persists write at 
the end of the 
WAL
HBaseWrite – step 2 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
Regions Server 
saves write in a 
sorted map in 
memory in the 
MemStore
HBaseWrite – offline 
Region Server 
WAL (on 
HDFS) 
MemStore 
HFile 
HFile 
HFile 
When MemStore reaches 
a configurable size, it is 
flushed to an HFile
Minor Compaction 
• When writing a MemStore to Hfile, may trigger a Minor 
Compaction 
• Combine many small Hfiles into one large one 
• Saves disk reads 
• May block further MemStore flushes, so try to keep to a 
minimum
Major Compaction 
• Happens at configurable times for the system 
• Ie. Once a week on weekends 
• Default to once every 24 hrs 
• Resource-intensive 
• Don’t set it to “never” 
• Reads in all Hfiles and makes sure there is one Hfile per 
Region per column family 
• Purges deleted records 
• Ensures that HDFS files are local
Tuning your DB - HBase Keys 
• Row Key – byte array 
• Best performance for Single Row Gets 
• Best Caching Performance 
• Key Design – 
• Distributes well – usually accomplished by hashing natural key 
• MD5 
• SHA1
Tuning your DB - BlockCache 
• Each region server has a BlockCache where it stores file 
blocks that it has already read 
• Every read that is in the block increases performance 
• Don’t want your blocks to be much bigger than your rows 
• Modes of caching: 
• 2-level LRU cache, by default 
• Other options: BucketCache – can use DirectByteBuffers to 
manage off-heap RAM – better Garbage Collection stats on the 
region server
Tuning your DB - Columns and Column 
Families 
• All columns in a column families accessed together for 
reads 
• Different column families stored in different HFiles 
• All Column Families written once when any MemStore is 
full 
• Example: 
• Storing package tracking information: 
• Need package shipping info 
• Need to store each location in the path
Tuning your DB – Bloom Filters 
• Can be set on rows or columns 
• Keep an extra index of available keys 
• Slows down reads and writes a bit 
• Increases storage 
• Saves time checking if keys exist 
• Turn on if it is likely that client will request missing data
Tuning your DB – Short-Circuit Reads 
• HDFS exposes service interface 
• If file is actually local, much faster to just read Hfile 
directly off of the disk
Current Industry Trends
Big Data in Finance – the challenges 
• Real-Time financial analysis 
• Reliability 
• “medium-data”
What Bloomberg is Working on 
• Working with Hortonworks on fixing real-time issues in 
Hadoop 
• Creating a framework for reliably serving real-time data 
• Presenting at Hadoop World and Hadoop Summit 
• Open source Chef recipes for running a hadoop cluster on 
OpenStack-managed VMs
Questions? 
• Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
Thug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangThug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen Zhang
 
Chicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBaseChicago Data Summit: Geo-based Content Processing Using HBase
Chicago Data Summit: Geo-based Content Processing Using HBase
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 

Andere mochten auch

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Andere mochten auch (12)

BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
Scott Miao
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 

Ähnlich wie Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends (20)

Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hadoop
HadoopHadoop
Hadoop
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

  • 1. HADOOP DEMO Esther Kundin Bloomberg LP
  • 4. Outline • What Is Big Data? • A History Lesson • Hadoop – Dive in to the details • HDFS • MapReduce • HBase • Industry Trends • Questions
  • 5. What is Big Data?
  • 7. Big Data Origins • Indexing the web requires lots of storage • Petabytes of data! • Economic problem – reliable servers expensive! • Solution: • Cram in as many cheap machines as possible • Replace them when they fail • Solve reliability via software!
  • 8. Big Data Origins Cont’d • DBs are slow and expensive • Lots of unneeded features RDBMS NoSQL ACID Eventual consistency Strongly-typed No type checking Complex Joins Get/Put RAID storage Commodity hardware
  • 9. Big Data Origins Cont’d • Google publishes papers about: • GFS (2000) • MapReduce (2004) • BigTable (2006) • Hadoop, originally developed at Yahoo, accepted as Apache top-level project in 2008
  • 10. Translation GFS HDFS MapReduce Hadoop MapReduce BigTable HBASE
  • 11. Why Hadoop? • Huge and growing ecosystem of services • Pace of development is swift • Tons of money and talent pouring in
  • 12. Diving into the details!
  • 13. Hadoop Ecosytem • HDFS – Hadoop Distributed File System • Pig: a scripting language that simplifies the creation of MapReduce jobs and excels at exploring and transforming data. • Hive: provides SQL-like access to your Big Data. • HBase: Hadoop database . • HCatalog: for defining and sharing schemas . • Ambari: for provisioning, managing, and monitoring Apache Hadoop clusters . • ZooKeeper: an open-source server which enables highly reliable distributed coordination . • Sqoop: for efficiently transferring bulk data between Hadoop and relation databases . • Oozie: a workflow scheduler system to manage Apache Hadoop jobs • Mahout : scalable machine learning library
  • 14. HDFS • Hadoop Distributed File System • Basis for all other tools, built on top of it • Allows for distributed workloads
  • 18. MapReduce demo • To run, can use: • Custom JAVA application • PIG – nice interface • Hadoop Streaming + any executable, like python • Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce- program-in-python/ • HIVE – SQL over MapReduce – “we put the SQL in NoSQL”
  • 19. HBase • Database running on top of HDFS • NOSQL – key/value store • Distributed • Good for sparse requests, rather than scans like MapReduce • Sorted • Eventually Consistent
  • 20. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS
  • 21. HBase Read Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests Meta Region Server address
  • 22. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client determines Which RegionServer to contact and caches that data
  • 23. HBase Architecture Client ZK Quorum ZK Peer ZK Peer ZK Peer HMaster HMaster Meta Region Server RegionServer RegionServer RegionServer HDFS Client requests data from the Region Server, which gets data from HDFS
  • 25. HMaster • Only one main master at a time – ensured by zookeeper • Keeps track of all table metadata • Used in table creation, modification, and deletion. • Not used for reads
  • 26. Region Server • This is the worker node of HBase • Performs Gets, Puts, and Scans for the regions it handles • Multiple regions are handled by each Region Server • On startup • Registers with zookeeper • Hmaster assigns it regions • Physical blocks on HDFS may or may not be on the same machine • Regions are split if they get too big • Data stored in a format called Hfile • Cache of data is what gives good performance. Cache based on blocks, not rows
  • 27. HBaseWrite – step 1 Region Server WAL (on HDFS) MemStore HFile HFile HFile Region Server persists write at the end of the WAL
  • 28. HBaseWrite – step 2 Region Server WAL (on HDFS) MemStore HFile HFile HFile Regions Server saves write in a sorted map in memory in the MemStore
  • 29. HBaseWrite – offline Region Server WAL (on HDFS) MemStore HFile HFile HFile When MemStore reaches a configurable size, it is flushed to an HFile
  • 30. Minor Compaction • When writing a MemStore to Hfile, may trigger a Minor Compaction • Combine many small Hfiles into one large one • Saves disk reads • May block further MemStore flushes, so try to keep to a minimum
  • 31. Major Compaction • Happens at configurable times for the system • Ie. Once a week on weekends • Default to once every 24 hrs • Resource-intensive • Don’t set it to “never” • Reads in all Hfiles and makes sure there is one Hfile per Region per column family • Purges deleted records • Ensures that HDFS files are local
  • 32. Tuning your DB - HBase Keys • Row Key – byte array • Best performance for Single Row Gets • Best Caching Performance • Key Design – • Distributes well – usually accomplished by hashing natural key • MD5 • SHA1
  • 33. Tuning your DB - BlockCache • Each region server has a BlockCache where it stores file blocks that it has already read • Every read that is in the block increases performance • Don’t want your blocks to be much bigger than your rows • Modes of caching: • 2-level LRU cache, by default • Other options: BucketCache – can use DirectByteBuffers to manage off-heap RAM – better Garbage Collection stats on the region server
  • 34. Tuning your DB - Columns and Column Families • All columns in a column families accessed together for reads • Different column families stored in different HFiles • All Column Families written once when any MemStore is full • Example: • Storing package tracking information: • Need package shipping info • Need to store each location in the path
  • 35. Tuning your DB – Bloom Filters • Can be set on rows or columns • Keep an extra index of available keys • Slows down reads and writes a bit • Increases storage • Saves time checking if keys exist • Turn on if it is likely that client will request missing data
  • 36. Tuning your DB – Short-Circuit Reads • HDFS exposes service interface • If file is actually local, much faster to just read Hfile directly off of the disk
  • 38. Big Data in Finance – the challenges • Real-Time financial analysis • Reliability • “medium-data”
  • 39. What Bloomberg is Working on • Working with Hortonworks on fixing real-time issues in Hadoop • Creating a framework for reliably serving real-time data • Presenting at Hadoop World and Hadoop Summit • Open source Chef recipes for running a hadoop cluster on OpenStack-managed VMs

Hinweis der Redaktion

  1. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  2. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry
  3. Name node is the manager, data node is the worker
  4. Job Tracker = Resource Manager Task Tracker = Node Manager Number of Jobs depends on the range of keys Number of mappers is set by the user – you’d want it to correspond to the set of possible values. So, if the values are ascii, you won’t want reducers to exceed 256. You also don’t want them to exceed the number of data nodes you have.
  5. Remember, HBase treats everything as a file system
  6. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  7. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  8. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  9. Zookeeper quorum should be odd, as a majority is needed for consensus Znode is the name of each attribute that is managed by zookeeper
  10. All columns in a column family are read for a get – but not all column families unless specified
  11. Although there is a separate memstore per column family – as soon as one is full, all of them written to hfiles. Note also that deletes are handled with a marker, and only really purged at a major compaction
  12. Thanks to Matt Hunt for this slide: http://www.slideshare.net/MatthewHunt1/hadoop-at-bloombergmedium-data-for-the-financial-industry