SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Hadoop in SIGMOD 2011 2011/5/20
Papers LCI: a social channel analysis platform for live customer intelligence Bistro data feed management system Apache hadoop goes realtime at Facebook Nova: continuous Pig/Hadoop workflows A Hadoop based distributed loading approach to parallel data warehouses A batch of PNUTS: experiences connecting cloud batch and serving systems
Papers (Continued) Turbocharging DBMS buffer pool using SSDs Online reorganization in read optimized MMDBS Automated partitioning design in parallel database systems Oracle database filesystem Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse Efficient processing of data warehousing queries in a split execution environment SQL server column store indexes An analytic data engine for visualization in tableau
Apache Hadoop Goes Realtime at Facebook
Workload Types Facebook Messaging High Write Throughput Large Tables Data Migration Facebook Insights Realtime Analytics High Throughput Increments Facebook Metrics System (ODS) Automatic Sharding Fast Reads of Recent Data and Table Scans
Why Hadoop & HBase Elasticity High write throughput Efficient and low-latency strong consistency semantics within a data center Efficient random reads from disk High Availability and Disaster Recovery Fault Isolation Atomic read-modify-write primitives Range Scans Tolerance of network partitions within a single data center Zero Downtime in case of individual data center failure Active-active serving capability across different data centers
RealtimeHDFS High Availability - AvatarNode Hot Standby – AvatarNode Enhancements to HDFS transaction logging Transparent Failover: DAFS(client enhancement+ZooKeeper) HadoopRPC compatibility Block Availability: Placement Policy a pluggable block placement policy
Realtime HDFS (Cont.) Performance Improvements for a Realtime Workload RPC Timeout Recover File Lease HDFS-append recoverLease Reads from Local Replicas New Features HDFS sync Concurrent Readers (last chunk of data)
Production HBase ACID Compliance (RWCC: Read Write Consistency Control) Atomicity (WALEdit) Consistency Availability Improvements HBase Master Rewrite Region assignment in memory -> ZooKeeper Online Upgrades Distributed Log Splitting Performance Improvements Compaction Read Optimizations
Deployment and Operational Experiences Testing Auto Tesing Tool HBase Verify Monitoring and Tools HBCK More metrics Manual versus Automatic Splitting Add new RegionServers, not region splitting Dark Launch (灰度) Dashboards/ODS integration Backups at the Application layer Schema Changes Importing Data Lzo & zip Reducing Network IO Major compaction
Nova: Continuous Pig/Hadoop Workflows
Nova Overview Scenarios Ingesting and analyzing user behavior logs  Building and updating a search index from a stream of crawled web pages  Processing semi-structured data feeds Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features
Abstract Workflow Model Workflow Two kinds of vertices: tasks (processing steps) and channels (data containers) Edges connect tasks to channels and channels to tasks Edge annotations (all, new, B and Δ) Four common patterns of processing Non-incremental (template detection) Stateless incremental (shingling) Stateless incremental with lookup table (template tagging) Stateful incremental (de-duping)
Abstract Workflow Model (Cont.) Data and Update Model Blocks: base blocks and delta blocks Channel functions: merge, chain and diff Task/Data Interface Consumption mode: all or new Production mode: B or Δ Workflow Programming and Scheduling Data Compaction and Garbage Collection
Nova System Architecture
Efficient Processing of Data Warehousing Queries in a Split Execution Environment
Introduction Two approaches Starting with a parallel database system and adding some MapReduce features Starting with MapReduce and adding database system technology HadoopDB follows the second of the approaches Two heuristics for HadoopDB optimizations Database systems can process data at a faster rate than Hadoop. Minimize the number of MapReduce jobs in SQL execution plan.
HadoopDB HadoopDB Architecture Database Connector Data Loader Catalog Query Interface VectorWise/X100 Database (SIMD)  vs. PostgreSQL HadoopDB Query Execution selection, projection, and partial aggregation(Map and Combine)    database system co-partitioned tables MR for redistributing data SideDB (a "database task done on the side").
Split Query Execution Referential Partitioning Join in database engine Local join foreign-key  Referential Partitioning Split MR/DB Joins Directed join: one of the tables is already partitioned by the join key. Broadcast join: small table ought to be shipped to every node. Adding specialized joins to the MR framework  Map-side join. Tradeoffs: temporary table for join. Another type of join: MR redistributes data  Directed join Split MR/DB Semijoin like 'foreignKey IN (listOfValues)' Can be split into two MapReduce jobs SideDB to eliminate the first MapReduce job
Split Query Execution (Cont.) Post-join Aggregation Two MapReduce jobs Hash-based partial aggregation   save significant I/O A similar technique is applied to TOP N selections Pre-join Aggregation For MR based join. Group-by and join-key columns is smaller than the cardinality of the entire table.
A Query Plan in HadoopDB
Performance No hash partition feature in Hive
Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse
DB2 and Hadoop/Jaql Interactions
A HadoopBased Distributed Loading Approach to Parallel Data Warehouses
Introduction Why Hadoop for Teradata EDW More disk space and space can be easily added HDFS as a storage MapReduce Distributed HDFS blocks to Teradata EDW nodes assignment problem Parameters: n blocks, k copies, m nodes Goal: to assign HDFS blocks to nodes evenly and minimize network traffic
Block Assignment Problem 	HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P)     The problem is defined by:  assignment(X, Y, n,m, k, r)  X is the set of n blocks (X = {1, . . . , n}) of F Y is the set of m nodes running PDBMS (called PDBMS nodes) (Y⊆{1, . . . , P }) k copies, m nodes r is the mapping recording the replicated block locations of each block.r(i) returns the set of nodes which has a copy of the block i. An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1, . . . , n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j.
Block Assignment Problem (Cont.) The problem is defined by:  assignment(X, Y, n,m, k, r)      An even assignment g is an assignment such that ∀ i ∈ Y ∀j ∈ Y| |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1.  The cost of an assignment g is defined to be cost(g) = |{i | g(i) /∈r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes. We use |g| to denote the number of blocks assigned to local nodes by g. We have |g| = n - cost(g). The optimal assignment problem is to find an even assignment with the smallest cost.
OBA algorithm (X, Y, n,m, k, r)=({1, 2, 3}, {1, 2}, 3, 2, 1, {1 -> {1}, 2 -> {1}, 3 -> {2}})

Weitere ähnliche Inhalte

Was ist angesagt?

Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 

Was ist angesagt? (20)

Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Taking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalTaking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo Professional
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
Spot db consistency checking and optimization in spatial database
Spot db  consistency checking and optimization in spatial databaseSpot db  consistency checking and optimization in spatial database
Spot db consistency checking and optimization in spatial database
 
Using Spectrum on Demand from MapInfo Pro
Using Spectrum on Demand from MapInfo ProUsing Spectrum on Demand from MapInfo Pro
Using Spectrum on Demand from MapInfo Pro
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Hadoop Mapreduce joins
Hadoop Mapreduce joinsHadoop Mapreduce joins
Hadoop Mapreduce joins
 
IJET-V3I2P24
IJET-V3I2P24IJET-V3I2P24
IJET-V3I2P24
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design Patterns
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
MapReduce Over Lustre
MapReduce Over LustreMapReduce Over Lustre
MapReduce Over Lustre
 
Incremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsIncremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF Graphs
 
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation:  Data Files, and Data Cleaning & PreparationAaa ped-6-Data manipulation:  Data Files, and Data Cleaning & Preparation
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Geoprocessing
GeoprocessingGeoprocessing
Geoprocessing
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 

Andere mochten auch (7)

C++编程实践
C++编程实践C++编程实践
C++编程实践
 
Qii12
Qii12Qii12
Qii12
 
La Comunicació Amb Els Nostres Fills I Filles
La Comunicació Amb Els Nostres Fills I FillesLa Comunicació Amb Els Nostres Fills I Filles
La Comunicació Amb Els Nostres Fills I Filles
 
Usability
UsabilityUsability
Usability
 
eLearning - mLearning
eLearning - mLearningeLearning - mLearning
eLearning - mLearning
 
Intro To Tendenci - JJ Lassberg
Intro To Tendenci - JJ LassbergIntro To Tendenci - JJ Lassberg
Intro To Tendenci - JJ Lassberg
 
How to grow a profitable assocation
How to grow a profitable assocationHow to grow a profitable assocation
How to grow a profitable assocation
 

Ähnlich wie Hadoop in sigmod 2011

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 

Ähnlich wie Hadoop in sigmod 2011 (20)

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Hadoop in sigmod 2011

  • 1. Hadoop in SIGMOD 2011 2011/5/20
  • 2. Papers LCI: a social channel analysis platform for live customer intelligence Bistro data feed management system Apache hadoop goes realtime at Facebook Nova: continuous Pig/Hadoop workflows A Hadoop based distributed loading approach to parallel data warehouses A batch of PNUTS: experiences connecting cloud batch and serving systems
  • 3. Papers (Continued) Turbocharging DBMS buffer pool using SSDs Online reorganization in read optimized MMDBS Automated partitioning design in parallel database systems Oracle database filesystem Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse Efficient processing of data warehousing queries in a split execution environment SQL server column store indexes An analytic data engine for visualization in tableau
  • 4. Apache Hadoop Goes Realtime at Facebook
  • 5. Workload Types Facebook Messaging High Write Throughput Large Tables Data Migration Facebook Insights Realtime Analytics High Throughput Increments Facebook Metrics System (ODS) Automatic Sharding Fast Reads of Recent Data and Table Scans
  • 6. Why Hadoop & HBase Elasticity High write throughput Efficient and low-latency strong consistency semantics within a data center Efficient random reads from disk High Availability and Disaster Recovery Fault Isolation Atomic read-modify-write primitives Range Scans Tolerance of network partitions within a single data center Zero Downtime in case of individual data center failure Active-active serving capability across different data centers
  • 7. RealtimeHDFS High Availability - AvatarNode Hot Standby – AvatarNode Enhancements to HDFS transaction logging Transparent Failover: DAFS(client enhancement+ZooKeeper) HadoopRPC compatibility Block Availability: Placement Policy a pluggable block placement policy
  • 8. Realtime HDFS (Cont.) Performance Improvements for a Realtime Workload RPC Timeout Recover File Lease HDFS-append recoverLease Reads from Local Replicas New Features HDFS sync Concurrent Readers (last chunk of data)
  • 9. Production HBase ACID Compliance (RWCC: Read Write Consistency Control) Atomicity (WALEdit) Consistency Availability Improvements HBase Master Rewrite Region assignment in memory -> ZooKeeper Online Upgrades Distributed Log Splitting Performance Improvements Compaction Read Optimizations
  • 10. Deployment and Operational Experiences Testing Auto Tesing Tool HBase Verify Monitoring and Tools HBCK More metrics Manual versus Automatic Splitting Add new RegionServers, not region splitting Dark Launch (灰度) Dashboards/ODS integration Backups at the Application layer Schema Changes Importing Data Lzo & zip Reducing Network IO Major compaction
  • 12. Nova Overview Scenarios Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web pages Processing semi-structured data feeds Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features
  • 13. Abstract Workflow Model Workflow Two kinds of vertices: tasks (processing steps) and channels (data containers) Edges connect tasks to channels and channels to tasks Edge annotations (all, new, B and Δ) Four common patterns of processing Non-incremental (template detection) Stateless incremental (shingling) Stateless incremental with lookup table (template tagging) Stateful incremental (de-duping)
  • 14. Abstract Workflow Model (Cont.) Data and Update Model Blocks: base blocks and delta blocks Channel functions: merge, chain and diff Task/Data Interface Consumption mode: all or new Production mode: B or Δ Workflow Programming and Scheduling Data Compaction and Garbage Collection
  • 16. Efficient Processing of Data Warehousing Queries in a Split Execution Environment
  • 17. Introduction Two approaches Starting with a parallel database system and adding some MapReduce features Starting with MapReduce and adding database system technology HadoopDB follows the second of the approaches Two heuristics for HadoopDB optimizations Database systems can process data at a faster rate than Hadoop. Minimize the number of MapReduce jobs in SQL execution plan.
  • 18. HadoopDB HadoopDB Architecture Database Connector Data Loader Catalog Query Interface VectorWise/X100 Database (SIMD) vs. PostgreSQL HadoopDB Query Execution selection, projection, and partial aggregation(Map and Combine)  database system co-partitioned tables MR for redistributing data SideDB (a "database task done on the side").
  • 19. Split Query Execution Referential Partitioning Join in database engine Local join foreign-key  Referential Partitioning Split MR/DB Joins Directed join: one of the tables is already partitioned by the join key. Broadcast join: small table ought to be shipped to every node. Adding specialized joins to the MR framework  Map-side join. Tradeoffs: temporary table for join. Another type of join: MR redistributes data  Directed join Split MR/DB Semijoin like 'foreignKey IN (listOfValues)' Can be split into two MapReduce jobs SideDB to eliminate the first MapReduce job
  • 20. Split Query Execution (Cont.) Post-join Aggregation Two MapReduce jobs Hash-based partial aggregation  save significant I/O A similar technique is applied to TOP N selections Pre-join Aggregation For MR based join. Group-by and join-key columns is smaller than the cardinality of the entire table.
  • 21. A Query Plan in HadoopDB
  • 22. Performance No hash partition feature in Hive
  • 23. Emerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 Warehouse
  • 24. DB2 and Hadoop/Jaql Interactions
  • 25. A HadoopBased Distributed Loading Approach to Parallel Data Warehouses
  • 26. Introduction Why Hadoop for Teradata EDW More disk space and space can be easily added HDFS as a storage MapReduce Distributed HDFS blocks to Teradata EDW nodes assignment problem Parameters: n blocks, k copies, m nodes Goal: to assign HDFS blocks to nodes evenly and minimize network traffic
  • 27. Block Assignment Problem HDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 ≤ i ≤ P) The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n blocks (X = {1, . . . , n}) of F Y is the set of m nodes running PDBMS (called PDBMS nodes) (Y⊆{1, . . . , P }) k copies, m nodes r is the mapping recording the replicated block locations of each block.r(i) returns the set of nodes which has a copy of the block i. An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = {1, . . . , n} to Y where g(i) = j (i ∈ X, j ∈ Y ) means that the block i is assigned to the node j.
  • 28. Block Assignment Problem (Cont.) The problem is defined by: assignment(X, Y, n,m, k, r) An even assignment g is an assignment such that ∀ i ∈ Y ∀j ∈ Y| |{ x | ∀ 1 ≤ x ≤ n&&g(x) = i}| - |{y | ∀ 1 ≤ y ≤ n&&g(y) = j}| | ≤ 1. The cost of an assignment g is defined to be cost(g) = |{i | g(i) /∈r(i) ∀ 1 ≤ i ≤ n}|, which is the number of blocks assigned to remote nodes. We use |g| to denote the number of blocks assigned to local nodes by g. We have |g| = n - cost(g). The optimal assignment problem is to find an even assignment with the smallest cost.
  • 29. OBA algorithm (X, Y, n,m, k, r)=({1, 2, 3}, {1, 2}, 3, 2, 1, {1 -> {1}, 2 -> {1}, 3 -> {2}})