SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email_address]
A Brief History of MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
What is MapReduce? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Why is MapReduce Useful? ,[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog. To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful programming paradigm. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D
Goal We Want to Count  the # of Times  Each Word Occurs Confidential and proprietary. Copyright © 2008 Aster Data Systems
1 st  Approach No MapReduce 1 st  Approach No MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D hello world mapreduce is a very powerful concept to be or not to be that is the question
Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 4 Final Result File the 5 is 3 mapreduce 2 … …
What Did We Do? ,[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
2 nd  Approach No MapReduce Fully Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question the the the the the database database future world world powerful lazy brown mapreduce mapreduce be be to jumps computers hello is is is question over a that
Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 1 Final Result File the 5 … … . Server 2 Final Result File world 2 … … . Server 3 Final Result File mapreduce 2 … … . Server 4 Final Result File is 3 … … .
2 nd  Approach: No MapReduce, Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
Does it work? Yes Is it a pain? Yes!! Does it take lots of time? Yes! Would you do it? No!!! Confidential and proprietary. Copyright © 2008 Aster Data Systems
Moreover… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Data Redistribution and Grouping Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Input Any file (e.g. documents) Output Stream of <key, value> pairs (e.g. <word, count> pairs) Input All <key, value> pairs with the  same  key grouped (e.g. all <word, count> pairs where word = “the”) Output Anything (e.g. sum of counts for a specific word) Reduce()
The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> Switch <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Map() and Redistribution Phase Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Map() Server A Server B Server C Server D
<the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Grouping and Reduce() Phase (on Server 1) Confidential and proprietary. Copyright © 2008 Aster Data Systems Reduce() Server 1 Final Result File the 5 database 2 future 1 Reduce() Reduce()
What Just Happened? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example! ,[object Object],“ The indexing code is simpler, smaller, and easier to understand, because the code that deals with fault tolerance, distribution and parallelization is hidden within the MapReduce library. For example, the size of one phase of the computation dropped from approximately 3,800 lines of C++ code to approximately 700 lines when expressed using MapReduce .” Google 2004 MapReduce paper Confidential and proprietary. Copyright © 2008 Aster Data Systems
Word Count was Only an Example! ,[object Object],“ We adapt Google’s MapReduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).” Stanford 2006 AI Lab paper Confidential and proprietary. Copyright © 2008 Aster Data Systems
Result? ,[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
But… ,[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Beyond SQL and MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL vs MapReduce: Two different worlds? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Implementing MR in the Database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
The SQL/MR Process Confidential and proprietary. Copyright © 2008 Aster Data Systems
SQL/MR Function: Syntax ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Optional conditions & filters (5) Select output (eg. count) (1) Source table or sub-select (3) Sort before the MR function (4) Java/Python/… MR function (2) <key> for data redistribution Optional MR_Function Arguments Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 1: Tokenization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Example 2: Sessionization Slide  Session Timeout = 60 seconds Clickstream Confidential and proprietary. Copyright © 2008 Aster Data Systems timestamp userid 10:00:00 Shawn1 00:58:24 PrezBush 10:00:24 Shawn1 02:30:33 PrezBush 10:01:23 Shawn1 10:02:40 Shawn1 timestamp userid sessionid 10:00:00 Shawn1 0 10:00:24 Shawn1 0 10:01:23 Shawn1 0 10:02:40 Shawn1 1 timestamp userid sessionid 00:58:24 PrezBush 0 02:30:33 PrezBush 1 INPUT OUTPUT
MR Applications in the Database ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Confidential and proprietary. Copyright © 2008 Aster Data Systems
Summary ,[object Object],[object Object],[object Object],[email_address] (Questions, Comments) asterdata.com/blog (Lots of technical details) 1.888.Aster.Data (Any other information) Confidential and proprietary. Copyright © 2008 Aster Data Systems

Weitere ähnliche Inhalte

Was ist angesagt?

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemAdarsh Pannu
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxSumant Tambe
 

Was ist angesagt? (20)

Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Reactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and RxReactive Stream Processing Using DDS and Rx
Reactive Stream Processing Using DDS and Rx
 

Andere mochten auch

MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiotspetewarden
 
Big data vccorp
Big data vccorpBig data vccorp
Big data vccorpTuan Hoang
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryGlobalsion Software Sdn Bhd
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...AIIM International
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesChris Reynolds
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Non-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresNon-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresJoël Perras
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsScott Abel
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementDATAVERSITY
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco Software
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1Jeff Potts
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project planDonna_Maree_Findlay
 
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le DatBig data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le DatDinh Le Dat (Kevin D.)
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST APIJ V
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
On business capabilities, functions and application features
On business capabilities, functions and application featuresOn business capabilities, functions and application features
On business capabilities, functions and application featuresJörgen Dahlberg
 
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)Trieu Nguyen
 

Andere mochten auch (20)

MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiots
 
Big data vccorp
Big data vccorpBig data vccorp
Big data vccorp
 
DMAvatar
DMAvatarDMAvatar
DMAvatar
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare Industry
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance Companies
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Non-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value StoresNon-Relational Databases & Key/Value Stores
Non-Relational Databases & Key/Value Stores
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your Documents
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture Overview
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project plan
 
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le DatBig data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
Big data 5Vs 2014 - View from World to Vietnam by Dinh Le Dat
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST API
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
On business capabilities, functions and application features
On business capabilities, functions and application featuresOn business capabilities, functions and application features
On business capabilities, functions and application features
 
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
TỔNG QUAN VỀ DỮ LIỆU LỚN (BIGDATA)
 

Ähnlich wie Introduction to MapReduce Data Transformations

What's New in ArcGIS 10.1 Data Interoperability Extension
What's New in ArcGIS 10.1 Data Interoperability ExtensionWhat's New in ArcGIS 10.1 Data Interoperability Extension
What's New in ArcGIS 10.1 Data Interoperability ExtensionSafe Software
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everythingLew Tucker
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptxPaulo Alonso
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The CloudsJacky Chu
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 

Ähnlich wie Introduction to MapReduce Data Transformations (20)

What's New in ArcGIS 10.1 Data Interoperability Extension
What's New in ArcGIS 10.1 Data Interoperability ExtensionWhat's New in ArcGIS 10.1 Data Interoperability Extension
What's New in ArcGIS 10.1 Data Interoperability Extension
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 

Kürzlich hochgeladen

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

Introduction to MapReduce Data Transformations

  • 1. Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email_address]
  • 2. A Brief History of MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 3.
  • 4.
  • 5. The quick brown fox jumps over the lazy dog. To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful programming paradigm. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D
  • 6. Goal We Want to Count the # of Times Each Word Occurs Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 7. 1 st Approach No MapReduce 1 st Approach No MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 8. The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D hello world mapreduce is a very powerful concept to be or not to be that is the question
  • 9. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 4 Final Result File the 5 is 3 mapreduce 2 … …
  • 10.
  • 11. 2 nd Approach No MapReduce Fully Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 12. The quick brown fox jumps over the lazy dog To be or not to be: that is the question. Switch The world only needs five computers. Hello world. In-Database MapReduce is the future. MapReduce is a very powerful concept. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server A Server B Server C Server D the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question the the the the the database database future world world powerful lazy brown mapreduce mapreduce be be to jumps computers hello is is is question over a that
  • 13. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server 1 Final Result File the 5 … … . Server 2 Final Result File world 2 … … . Server 3 Final Result File mapreduce 2 … … . Server 4 Final Result File is 3 … … .
  • 14. 2 nd Approach: No MapReduce, Distributed Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 15. Does it work? Yes Is it a pain? Yes!! Does it take lots of time? Yes! Would you do it? No!!! Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 16.
  • 17. Data Redistribution and Grouping Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Input Any file (e.g. documents) Output Stream of <key, value> pairs (e.g. <word, count> pairs) Input All <key, value> pairs with the same key grouped (e.g. all <word, count> pairs where word = “the”) Output Anything (e.g. sum of counts for a specific word) Reduce()
  • 18. The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> Switch <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Map() and Redistribution Phase Confidential and proprietary. Copyright © 2008 Aster Data Systems Map() Map() Server A Server B Server C Server D
  • 19. <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> Grouping and Reduce() Phase (on Server 1) Confidential and proprietary. Copyright © 2008 Aster Data Systems Reduce() Server 1 Final Result File the 5 database 2 future 1 Reduce() Reduce()
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Beyond SQL and MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 26.
  • 27.
  • 28. The SQL/MR Process Confidential and proprietary. Copyright © 2008 Aster Data Systems
  • 29.
  • 30.
  • 31.
  • 32. Example 2: Sessionization Slide Session Timeout = 60 seconds Clickstream Confidential and proprietary. Copyright © 2008 Aster Data Systems timestamp userid 10:00:00 Shawn1 00:58:24 PrezBush 10:00:24 Shawn1 02:30:33 PrezBush 10:01:23 Shawn1 10:02:40 Shawn1 timestamp userid sessionid 10:00:00 Shawn1 0 10:00:24 Shawn1 0 10:01:23 Shawn1 0 10:02:40 Shawn1 1 timestamp userid sessionid 00:58:24 PrezBush 0 02:30:33 PrezBush 1 INPUT OUTPUT
  • 33.
  • 34.