SlideShare ist ein Scribd-Unternehmen logo
1 von 20
MetaScale is a subsidiary of
Sears Holdings Corporation
The 3 Ts of Hadoop
Wuheng Luo
Ankur Gupta
06.2013
The 3 Ts of Hadoop
3-Stage Circular Process of Enterprise Big Data
What is the 3Ts?
3Ts = Transfer, Transform, and Translate
A new enterprise big data pattern
 to bring disruptive change to conventional ETL
 To leverage Hadoop for streamlining data processes
 To move toward real-time analytics
The 3Ts Goal
To simplify enterprise data processing, reduce latency to
turn enterprise data from raw form to products of discovery
so as to better support business decisions.
The 3Ts One Liners
Transfer
Once the Hadoop system is in place, a mandate is needed to
immediately and continuously capture and deliver all enterprise data,
from all data sources, through all data systems, to Hadoop, and store the
data under HDFS.
Transform
When source data is in, clean, standardize, and convert the data through
dimensional modeling. Data transformation should be performed in-place
within Hadoop, without moving the data out again for integration reasons.
Translate
Finish the data flow cycle by turning analytical data aggregated in
Hadoop to data products of business wisdom. Use batch and streaming
tools built on top of Hadoop to Interact with data scientists and end users.
Hadoop as Enterprise Data Hub
“Data Hub” is not a new concept, but:
Conventional Data Hub Hadoop Enterprise Data Hub
RDBMS or EDW based Hadoop ecosystem based
No consistent architectural style:
ODS, MDM, messaging or publish-
subscribe, etc.
3-phased architecture to cover full
enterprise data flow cycle from data
source to data products
Heavily reply on ETL 3Ts-driven
Intermediate, partial, siloed True center of enterprise data
… …
TRANSFER
Sourcing Data into Hadoop
Intent
Capture continuously all enterprise data at earliest touch
points possible, deliver the data from all sources, through
all source data systems, to Hadoop, and store the data
under HDFS.
TRANSFER
Motivation
To gain distinctive competing capability, enterprises need to
build an integrated data infrastructure as the foundation
for big data analytics. Use Hadoop as THE centralized
enterprise data repository, and make it the grand
destination for all enterprise source data.
TRANSFER
(3 Ts’) Transfer vs. (ETL’s) Extract
Traditional ETL - Extract Hadoop - Transfer
Bottom-up Top-down
Task/project specific Enterprise-wide mandate
Passive Proactive
Data is not available when needed Data is ready when needed
Same datasets are moved around
again and again, with no value added
Move the data once, and use it many
times, each time with value increased
TRANSFER
Consequences
Before After
Isolated, disconnected in various
siloed data/file systems
Consolidated and centralized in
Hadoop
Monolithically segmented Heterogeneous, diverse, huge
Separated and partial Federated and holistic
TRANSFER
Implementation
 Always do a data gap analysis first
 Fork the ingestion in both batch and streaming if needed
 Have a delivery plan for the data feed
 Synchronize data changes between source system and Hadoop
TRANSFORM
Integrating Data within Hadoop
Intent
Keep the data flow beyond the ingest phase by transforming
the data from dirty to clean, from raw to standardized, and
from transactional to analytical, all within Hadoop.
TRANSFORM
Motivation
As the latency or speed from raw data to business insight
becomes the focal point of enterprise data analytics, use
Hadoop as data integration platform to perform in-place
data transformation.
TRANSFORM
Implementation
 Partition enterprise-wide standardized data and job-specific analytical
data in HDFS, and retain history.
 Use dimensional modeling to transform and standardize, make
dimensional data as the atomic unit of enterprise data.
 Identify all enterprise data entities, and add finest grain attributes to
each entity as dimensional data.
 Take a bottom-up approach, also think about data usage across the
enterprise, not specific task bound.
TRANSFORM
(3 Ts’) Transform vs. (ETL’s) Transform
Transform in ETL / ELT Transform in 3 Ts
in vitro, outside Hadoop in vivo, within Hadoop
Use Hadoop as rental space Use Hadoop as integration platform
Non-value adding data movement in
between data storage and
transformation
Data is transformed while flowing
from one partition to another under
HDFS
High latency Low latency
Network bottleneck Data locality
TRANSLATE
Making Data Products out of Hadoop
Intent
Turn analytical data into data products of business wisdom
using home-made or commercial tools of analytics built on
top of Hadoop. Business decisions supported by data
products will help generate more new data, thus a new
round of enterprise data flow cycle…
TRANSLATE
Motivation
Low-latency big data analytics requires right platform/tools
Use Hadoop as the platform of choice for enterprise data
analytics because of its openness and flexibility
Choose analytical tools that are flexible, agile, interactive
and user friendly
TRANSLATE
Implementation
 Big data analytics takes a team effort
 Include statisticians, data scientists and developers
 Utilize both generic and Hadoop specific technologies
 Consider both batch and streaming based approaches
 Provide access to pre-computed view and on-the-fly query
 Use both home-made and Hadoop-based commercial tools
 Use web-based, mobile friendly UI
 Visualize
The 3 Ts of Hadoop
Continuous Iteration of Enterprise Data Flow
Thank You!
For further information
email:
visit:
contact@metascale.com
www.metascale.com
MetaScale is a subsidiary of
Sears Holdings Corporation

Weitere ähnliche Inhalte

Was ist angesagt?

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
The Future of Hadoop Security
The Future of Hadoop SecurityThe Future of Hadoop Security
The Future of Hadoop Security
DataWorks Summit
 

Was ist angesagt? (19)

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
The Future of Hadoop Security
The Future of Hadoop SecurityThe Future of Hadoop Security
The Future of Hadoop Security
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 

Ähnlich wie Luo june27 1150am_room230_a_v2

TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
ruchabhandiwad
 
MapR Data Hub White Paper V2 2014
MapR Data Hub White Paper V2 2014MapR Data Hub White Paper V2 2014
MapR Data Hub White Paper V2 2014
Erni Susanti
 

Ähnlich wie Luo june27 1150am_room230_a_v2 (20)

Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
data analytics lecture4.pptx
data analytics lecture4.pptxdata analytics lecture4.pptx
data analytics lecture4.pptx
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)TheETLBottleneckinBigDataAnalytics(1)
TheETLBottleneckinBigDataAnalytics(1)
 
Hadoop & Data Warehouse
Hadoop & Data Warehouse Hadoop & Data Warehouse
Hadoop & Data Warehouse
 
Appfluent and Cloudera Solution Brief
Appfluent and Cloudera Solution BriefAppfluent and Cloudera Solution Brief
Appfluent and Cloudera Solution Brief
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - Overview
 
MapR Data Hub White Paper V2 2014
MapR Data Hub White Paper V2 2014MapR Data Hub White Paper V2 2014
MapR Data Hub White Paper V2 2014
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Luo june27 1150am_room230_a_v2

  • 1. MetaScale is a subsidiary of Sears Holdings Corporation The 3 Ts of Hadoop Wuheng Luo Ankur Gupta 06.2013
  • 2. The 3 Ts of Hadoop 3-Stage Circular Process of Enterprise Big Data
  • 3. What is the 3Ts? 3Ts = Transfer, Transform, and Translate A new enterprise big data pattern  to bring disruptive change to conventional ETL  To leverage Hadoop for streamlining data processes  To move toward real-time analytics
  • 4. The 3Ts Goal To simplify enterprise data processing, reduce latency to turn enterprise data from raw form to products of discovery so as to better support business decisions.
  • 5. The 3Ts One Liners Transfer Once the Hadoop system is in place, a mandate is needed to immediately and continuously capture and deliver all enterprise data, from all data sources, through all data systems, to Hadoop, and store the data under HDFS. Transform When source data is in, clean, standardize, and convert the data through dimensional modeling. Data transformation should be performed in-place within Hadoop, without moving the data out again for integration reasons. Translate Finish the data flow cycle by turning analytical data aggregated in Hadoop to data products of business wisdom. Use batch and streaming tools built on top of Hadoop to Interact with data scientists and end users.
  • 6. Hadoop as Enterprise Data Hub “Data Hub” is not a new concept, but: Conventional Data Hub Hadoop Enterprise Data Hub RDBMS or EDW based Hadoop ecosystem based No consistent architectural style: ODS, MDM, messaging or publish- subscribe, etc. 3-phased architecture to cover full enterprise data flow cycle from data source to data products Heavily reply on ETL 3Ts-driven Intermediate, partial, siloed True center of enterprise data … …
  • 7. TRANSFER Sourcing Data into Hadoop Intent Capture continuously all enterprise data at earliest touch points possible, deliver the data from all sources, through all source data systems, to Hadoop, and store the data under HDFS.
  • 8. TRANSFER Motivation To gain distinctive competing capability, enterprises need to build an integrated data infrastructure as the foundation for big data analytics. Use Hadoop as THE centralized enterprise data repository, and make it the grand destination for all enterprise source data.
  • 9. TRANSFER (3 Ts’) Transfer vs. (ETL’s) Extract Traditional ETL - Extract Hadoop - Transfer Bottom-up Top-down Task/project specific Enterprise-wide mandate Passive Proactive Data is not available when needed Data is ready when needed Same datasets are moved around again and again, with no value added Move the data once, and use it many times, each time with value increased
  • 10. TRANSFER Consequences Before After Isolated, disconnected in various siloed data/file systems Consolidated and centralized in Hadoop Monolithically segmented Heterogeneous, diverse, huge Separated and partial Federated and holistic
  • 11. TRANSFER Implementation  Always do a data gap analysis first  Fork the ingestion in both batch and streaming if needed  Have a delivery plan for the data feed  Synchronize data changes between source system and Hadoop
  • 12. TRANSFORM Integrating Data within Hadoop Intent Keep the data flow beyond the ingest phase by transforming the data from dirty to clean, from raw to standardized, and from transactional to analytical, all within Hadoop.
  • 13. TRANSFORM Motivation As the latency or speed from raw data to business insight becomes the focal point of enterprise data analytics, use Hadoop as data integration platform to perform in-place data transformation.
  • 14. TRANSFORM Implementation  Partition enterprise-wide standardized data and job-specific analytical data in HDFS, and retain history.  Use dimensional modeling to transform and standardize, make dimensional data as the atomic unit of enterprise data.  Identify all enterprise data entities, and add finest grain attributes to each entity as dimensional data.  Take a bottom-up approach, also think about data usage across the enterprise, not specific task bound.
  • 15. TRANSFORM (3 Ts’) Transform vs. (ETL’s) Transform Transform in ETL / ELT Transform in 3 Ts in vitro, outside Hadoop in vivo, within Hadoop Use Hadoop as rental space Use Hadoop as integration platform Non-value adding data movement in between data storage and transformation Data is transformed while flowing from one partition to another under HDFS High latency Low latency Network bottleneck Data locality
  • 16. TRANSLATE Making Data Products out of Hadoop Intent Turn analytical data into data products of business wisdom using home-made or commercial tools of analytics built on top of Hadoop. Business decisions supported by data products will help generate more new data, thus a new round of enterprise data flow cycle…
  • 17. TRANSLATE Motivation Low-latency big data analytics requires right platform/tools Use Hadoop as the platform of choice for enterprise data analytics because of its openness and flexibility Choose analytical tools that are flexible, agile, interactive and user friendly
  • 18. TRANSLATE Implementation  Big data analytics takes a team effort  Include statisticians, data scientists and developers  Utilize both generic and Hadoop specific technologies  Consider both batch and streaming based approaches  Provide access to pre-computed view and on-the-fly query  Use both home-made and Hadoop-based commercial tools  Use web-based, mobile friendly UI  Visualize
  • 19. The 3 Ts of Hadoop Continuous Iteration of Enterprise Data Flow
  • 20. Thank You! For further information email: visit: contact@metascale.com www.metascale.com MetaScale is a subsidiary of Sears Holdings Corporation