SlideShare ist ein Scribd-Unternehmen logo
1 von 12
1© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS and Hadoop
A Powerful Combination
Jacque Istok
2© Copyright 2010 EMC Corporation. All rights reserved.
You Know Hadoop, But What Is Greenplum?
EMC/Greenplum is an MPP data warehouse
system, based off PostgreSQL, with the full
capabilities of a traditional RDBMS system. In
conjunction with SQL-99 compliance for
structured analysis, Greenplum also offers a
MapReduce implementation for non structured
analysis. In short:
Greenplum ~ Hadoop/Hive
3© Copyright 2010 EMC Corporation. All rights reserved.
Data in a Typical Enterprise
• Data is everywhere –
corporate EDW, 100s
of data marts,
‘shadow’ databases,
spreadsheets, logs,
etc
• The goal of
centralizing all data
in a single EDW has
proven untenable
EDW
~10% of data
Data Marts and
‘Personal Databases’
~90% of data
4© Copyright 2010 EMC Corporation. All rights reserved.
Today’s Big Data Challenges
• Sources of data and the amount of data to analyze
is growing exponentially
• Stale data exists because DW solutions cannot
ingest the vast amounts of data fast enough
• Lack of performance for advanced analytics and
complex queries
• The number of users and the concurrency of users
is increasing rapidly
• Security and privacy around the data is both
preferred and often mandated
5© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of HDFS/Hadoop/Hive
Hive Server accepts SQL and dynamically
generates and executes MapReduce code
Flexible framework for processing large datasets
Materialize data subsets to
reduce impact of node failure
DataNode servers process
analytics close to the data in
parallel
NameNode
DataNodeDataNode DataNode DataNode DataNode
…
NameNode
SQL (subset)
Hive
Process large datasets with support for
both SQL and MapReduce
MapReduce
6© Copyright 2010 EMC Corporation. All rights reserved.
Architecture of Greenplum
Master servers optimize queries
for the most efficient query execution
MPP Scatter/Gather streaming for
fast loading of data
Flexible framework for processing large datasets
Interconnect for continuous
pipelining of data processing
Segment servers process queries
close to the data in parallel
Master
SegmentSegment Segment Segment Segment
…
Master
SQL
MapReduce
Process large datasets with support for
both SQL and MapReduce
7© Copyright 2010 EMC Corporation. All rights reserved.
RDBMS Advantages
8© Copyright 2010 EMC Corporation. All rights reserved.
Common Real World Implementation
Lots ‘O Data
9© Copyright 2010 EMC Corporation. All rights reserved.
A Cyber-Analytics Data Mart Use Case
• Commercial SIEM products struggle
with the volumes of data generated in
a large enterprise. Non-parallel
event processing systems can’t keep
up with ingest, user load, etc
• Greenplum provides the ability to
cost-effectively ingest and store large
volumes of sensor data.
• Greenplum provides the parallel
analytics that support data mining,
event correlation, etc, over datasets
from TB’s to PB’s in size.
Access and
Events
Greenplum
Analytics
Data Mart
GPLoad
SQL MapReduce
(Perl)
(Python Math Lib)
(R)
SoR
ETL
ODS
BI
10© Copyright 2010 EMC Corporation. All rights reserved.
Coexistence Approach – Use Case
Compute
Storage
Analytics
General Purpose X86 Cluster of
Systems
Network
• Provides true, complete SQL compliant analytics
• Data can be read and written from Hadoop via
Greenplum
• Store your data structured, unstructured, column or row
oriented, compressed, leveraging Index support where
appropriate
• SQL can be executed, through Greenplum, on data
residing within Greenplum as well as data residing
within HDFS
• MapReduce can be executed through Greenplum in
Java, C, Perl, Python or through Java in Hadoop
• Designed for rapid analysis of data volumes from less
than a terabyte scaling into the petabytes
11© Copyright 2010 EMC Corporation. All rights reserved.
Big Data is Complementary to EDW
Commodity
Hardware
Virtual Machines Public Cloud
Greenplum
Enterprise Data Warehouse
• Single Source of Truth
• 1 Logical Model
• Heavy data governance and quality
• Operational Reporting
• Financial Consolidation
MapReduce Analytics Cloud
• Source of all raw data (often 10X size of
EDW)
• Self-service infrastructure to support multiple
marts and sandboxes
• Rapid analytic iteration, and business owned
solutions
12© Copyright 2010 EMC Corporation. All rights reserved.

Weitere ähnliche Inhalte

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Greenplum - Jacque Istok - Hadoop World 2010

  • 1. 1© Copyright 2010 EMC Corporation. All rights reserved. RDBMS and Hadoop A Powerful Combination Jacque Istok
  • 2. 2© Copyright 2010 EMC Corporation. All rights reserved. You Know Hadoop, But What Is Greenplum? EMC/Greenplum is an MPP data warehouse system, based off PostgreSQL, with the full capabilities of a traditional RDBMS system. In conjunction with SQL-99 compliance for structured analysis, Greenplum also offers a MapReduce implementation for non structured analysis. In short: Greenplum ~ Hadoop/Hive
  • 3. 3© Copyright 2010 EMC Corporation. All rights reserved. Data in a Typical Enterprise • Data is everywhere – corporate EDW, 100s of data marts, ‘shadow’ databases, spreadsheets, logs, etc • The goal of centralizing all data in a single EDW has proven untenable EDW ~10% of data Data Marts and ‘Personal Databases’ ~90% of data
  • 4. 4© Copyright 2010 EMC Corporation. All rights reserved. Today’s Big Data Challenges • Sources of data and the amount of data to analyze is growing exponentially • Stale data exists because DW solutions cannot ingest the vast amounts of data fast enough • Lack of performance for advanced analytics and complex queries • The number of users and the concurrency of users is increasing rapidly • Security and privacy around the data is both preferred and often mandated
  • 5. 5© Copyright 2010 EMC Corporation. All rights reserved. Architecture of HDFS/Hadoop/Hive Hive Server accepts SQL and dynamically generates and executes MapReduce code Flexible framework for processing large datasets Materialize data subsets to reduce impact of node failure DataNode servers process analytics close to the data in parallel NameNode DataNodeDataNode DataNode DataNode DataNode … NameNode SQL (subset) Hive Process large datasets with support for both SQL and MapReduce MapReduce
  • 6. 6© Copyright 2010 EMC Corporation. All rights reserved. Architecture of Greenplum Master servers optimize queries for the most efficient query execution MPP Scatter/Gather streaming for fast loading of data Flexible framework for processing large datasets Interconnect for continuous pipelining of data processing Segment servers process queries close to the data in parallel Master SegmentSegment Segment Segment Segment … Master SQL MapReduce Process large datasets with support for both SQL and MapReduce
  • 7. 7© Copyright 2010 EMC Corporation. All rights reserved. RDBMS Advantages
  • 8. 8© Copyright 2010 EMC Corporation. All rights reserved. Common Real World Implementation Lots ‘O Data
  • 9. 9© Copyright 2010 EMC Corporation. All rights reserved. A Cyber-Analytics Data Mart Use Case • Commercial SIEM products struggle with the volumes of data generated in a large enterprise. Non-parallel event processing systems can’t keep up with ingest, user load, etc • Greenplum provides the ability to cost-effectively ingest and store large volumes of sensor data. • Greenplum provides the parallel analytics that support data mining, event correlation, etc, over datasets from TB’s to PB’s in size. Access and Events Greenplum Analytics Data Mart GPLoad SQL MapReduce (Perl) (Python Math Lib) (R) SoR ETL ODS BI
  • 10. 10© Copyright 2010 EMC Corporation. All rights reserved. Coexistence Approach – Use Case Compute Storage Analytics General Purpose X86 Cluster of Systems Network • Provides true, complete SQL compliant analytics • Data can be read and written from Hadoop via Greenplum • Store your data structured, unstructured, column or row oriented, compressed, leveraging Index support where appropriate • SQL can be executed, through Greenplum, on data residing within Greenplum as well as data residing within HDFS • MapReduce can be executed through Greenplum in Java, C, Perl, Python or through Java in Hadoop • Designed for rapid analysis of data volumes from less than a terabyte scaling into the petabytes
  • 11. 11© Copyright 2010 EMC Corporation. All rights reserved. Big Data is Complementary to EDW Commodity Hardware Virtual Machines Public Cloud Greenplum Enterprise Data Warehouse • Single Source of Truth • 1 Logical Model • Heavy data governance and quality • Operational Reporting • Financial Consolidation MapReduce Analytics Cloud • Source of all raw data (often 10X size of EDW) • Self-service infrastructure to support multiple marts and sandboxes • Rapid analytic iteration, and business owned solutions
  • 12. 12© Copyright 2010 EMC Corporation. All rights reserved.