SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
A Brief Discussion on: Hadoop
       MapReduce, Pig,
JavaFlume,Cascading & Dremel




               Presented By: Somnath Mazumdar
                        29th Nov 2011
MapReduce
è  Based on Google's MapReduce Programming Framework
è  FileSystem: GFS for MapReduce ... HDFS for Hadoop
è  Language: MapReduce is written in C++ but Hadoop is in Java
è  Basic Functions : Map and Reduce inspired by similar primitives in
    LISP and other languages...
Why we should use ???
                  l  Automatic parallelization and distribution
                l    Fault-tolerance
                l    I/O scheduling
                l    Status and monitoring
MapReduce
Map Function:                     Reduce Function:
(1)   Processes input key/value   (1)   Combines all intermediate values
      pair                              for a particular key

                                  (2)    Produces a set of merged output
(2)    Produces set of                   values
       intermediate pairs
                                  Syntax:
Syntax:
                                  reduce (out_key, list(inter_value)) ->
map (key,value)-                        list(out_value)
       >list(key,inter_value)
Programming Model
                           (Hello, 1)
Hello World, Bye           (Bye, 1)
     World!
                      M1
                           (World, 1)               (Hello, 2)
                            (World, 1)              (Bye, 1)
                                               R1   (Welcome, 1)
                                                    (to, 3)
                           (Welcome, 1)
                           (to, 1)
Welcome to UCD,            (to, 1)
Goodbye to UCD.
                      M2
                           (Goodbye, 1)
                           (UCD, 1)
                           (UCD, 1)
                                                    (World, 2)
                                                    (UCD, 2)
     Hello
                            (Hello, 1)         R2   (Goodbye, 2)
                            (to, 1)                 (MapReduce,
  MapReduce,
  Goodbye to
                      M3    (Goodbye, 1)            2)
  MapReduce.                (MapReduce,
                            1)
                            (MapReduce,
                            1)

  HDFS              Map    Intermediate    Reduce       HDFS
                   Phase      Result       Phase
MapReduce
Applications:
(1)    Distributed grep & Distributed sort
(2)    Web link-graph reversal, 
(3)     Web access log stats, 
(4)     Document clustering,
(5)     Machine Learning and so on...



To know more:

è     MapReduce: Simplified Data Processing on Large Clusters
       by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.

è     Hadoop: The Definitive Guide - O'Reilly Media
PIG
è    First Pig developed at Yahoo Research around 2006 later moved to
      Apache Software Foundation
è    Pig is a data flow programming environment for processing large files
      based on MapReduce / Hadoop.
è    High-level platform for creating MapReduce programs used
      with Hadoop and HDFS
è    Apache library that interprets scripts written in Pig Latin and runs
      them on a Hadoop cluster.



                 At Yahoo! 40% of all Hadoop jobs are run with Pig
PIG
WorkFlow:
First step: Load input data. 
   Second step: Manipulate  data with functions like filtering, using
   foreach, distinct or any user defined functions.
   Third step: Group the data. Final stage: Writing data into the DFS or
   repeating the step if another dataset arrives.


Scripts written in PigLatin------------------->Hadoop ready jobs
   Pig Library/Engine




        Take Away Point:: Do more with data not with functions..
Cascading
Query API and Query Planner for defining, sharing, and executing data
  processing workflows.

Supports to create and execute complex data processing workflows on a
   Hadoop cluster using any JVM-based language (Java, JRuby, Clojure,
   etc.).

Originally authored by Chris Wensel (founder of Concurrent, Inc.)
What it offers??
            Data Processing API (core)
            Process Planner
            Process Scheduler
How to use?? 1. Install Hadoop
            2. Put Hadoop job .jar which must contain cascading .jars.
Cascading:‘Source-Pipe-Sink’
How it works??
Source: Data is captured from sources.
Pipes: are created independent from the data they will process. Supports
   reusable ‘pipes’ concept.
Sinks: Results are stored in output files or ‘sinks’.
Data Processing API provides Source-Pipe-Sink mechanism.
Once tied to data sources and sinks, it is called a ‘flow’(Topological
  Scheduler). These flows can be grouped into a
  ‘cascade’(CascadeConnector class), and the process scheduler will
  ensure a given flow does not execute until all its dependencies are
  satisfied.
Cascading
Pipe Assembly------MR Job Planner---->graph of dependent MapReduce
   jobs.
Also provides External Data Interfaces for data...


It efficiently supports splits, joins, grouping, and sorting.


Usages: log file analysis, bioinformatics, machine learning, predictive
   analytics, web content mining etc.


Cascading is cited as one of the top five most powerful Hadoop projects
                            by SD Times in 2011.
FlumeJava
Java Library API that makes easy to develop,test and run
  efficient data parallel pipelines.
Born on May 2009 @ Google Lab
Library is a collection of immutable parallel classes.
Flumejava:
1. abstracts how data is presented as in memory data structure or
    as file
2. abstracts away the implementation details like local loop or
   remote MR job.
3. Implements parallel job using deferred evaluation
FlumeJava
How it works???
Step1: invoke the parallel operation.
Step2: Do not run. Do the following ..
       2.1. Records the operation and the arguments.
       2.2. save them into an internal execution plan graph
  structure.
       2.3. Construct the execution plan for whole computation.
Step3: Optimizes the execution plan.
Step4: Execute them.
Faster than typical MR pipeline with same logical struct. & easier.
FlumeJava
Data Model:
Pcollection<T>: central class, an immutable bag of elements of type T
Can be unordered (collection(efficient)) or ordered (sequence).
PTable<K, V>:Second central class
Immutable multi-map with keys of class K and values of class V
Operators:
parallelDo(PCollection<T>): Core parallel primitives
groupByKey(PTable<Pair<K,V>>)
combineValues(PTable<Pair<K, Collection<V>>):
flatten(): logical view of multiple PCollections as one Pcollection
Join()
Dremel
A distributed system for interactive analysis of large datasets since
  2006 in Google.
Provides custom, scalable data management solution built over shared
   clusters of commodity machines.
Three Features/Key aspects:
1. Storage Format: column-striped storage representation for non
    relational nested data (lossless representation).
Why nested?
It backs a platform-neutral, extensible mechanism for serializing
   structured data at Google.
What is main aim??
Store all values of a given field consecutively to improve retrieval
  efficiency.
Dremel
2. Query Language: Provides a high-level, SQL-like language to express
   ad hoc queries.
    It efficiently implementable on columnar nested storage.
      Fields are referenced using path expressions.
Supports nested subqueries, inter and intra-record aggregation, joins
  etc.
3. Execution:Multi-level serving tree concept (distributed search engine)
      Several queries can execute simultaneously.
             Query dispatcher schedules queries based on priorities and
   balances load
I am lost..Are MR and Dremel
            same??
    Features           MapReduce aka MR                Dremel
Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab
      Type            Distributed & parallel   Distributed interactive
                     programming framework      ad hoc query system
Scalable & Fault              Yes                        Yes
    Tolerant
 Data processing        Record oriented           Column oriented
Batch processing              Yes                        No
In situ processing             No                        Yes

  Take away point:: Dremel it complements MapReduce-based
                            computing.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Weitere ähnliche Inhalte

Was ist angesagt?

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
Andra Lungu
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 

Was ist angesagt? (20)

Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache Beam
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Automated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative InfrastructureAutomated Spark Deployment With Declarative Infrastructure
Automated Spark Deployment With Declarative Infrastructure
 

Andere mochten auch

How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 

Andere mochten auch (20)

Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
제2회 한글형태소분석기 기술 세니마 발표(solr 활용 입문) by 김지훈
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and ElasticsearchElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
ElasticES-Hadoop: Bridging the world of Hadoop and Elasticsearch
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
March 2011 HUG: HDFS Federation
March 2011 HUG: HDFS FederationMarch 2011 HUG: HDFS Federation
March 2011 HUG: HDFS Federation
 
Introduction to Redis - LA Hacker News
Introduction to Redis - LA Hacker NewsIntroduction to Redis - LA Hacker News
Introduction to Redis - LA Hacker News
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Redis And python at pycon_2011
Redis And python at pycon_2011Redis And python at pycon_2011
Redis And python at pycon_2011
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 

Ähnlich wie Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 

Ähnlich wie Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra (20)

Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Hadoop
HadoopHadoop
Hadoop
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop
HadoopHadoop
Hadoop
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

  • 1. A Brief Discussion on: Hadoop MapReduce, Pig, JavaFlume,Cascading & Dremel Presented By: Somnath Mazumdar 29th Nov 2011
  • 2. MapReduce è  Based on Google's MapReduce Programming Framework è  FileSystem: GFS for MapReduce ... HDFS for Hadoop è  Language: MapReduce is written in C++ but Hadoop is in Java è  Basic Functions : Map and Reduce inspired by similar primitives in LISP and other languages... Why we should use ??? l  Automatic parallelization and distribution l  Fault-tolerance l  I/O scheduling l  Status and monitoring
  • 3. MapReduce Map Function: Reduce Function: (1)  Processes input key/value (1)  Combines all intermediate values pair for a particular key (2)  Produces a set of merged output (2)  Produces set of values intermediate pairs Syntax: Syntax: reduce (out_key, list(inter_value)) -> map (key,value)- list(out_value) >list(key,inter_value)
  • 4. Programming Model (Hello, 1) Hello World, Bye (Bye, 1) World! M1 (World, 1) (Hello, 2) (World, 1) (Bye, 1) R1 (Welcome, 1) (to, 3) (Welcome, 1) (to, 1) Welcome to UCD, (to, 1) Goodbye to UCD. M2 (Goodbye, 1) (UCD, 1) (UCD, 1) (World, 2) (UCD, 2) Hello (Hello, 1) R2 (Goodbye, 2) (to, 1) (MapReduce, MapReduce, Goodbye to M3 (Goodbye, 1) 2) MapReduce. (MapReduce, 1) (MapReduce, 1) HDFS Map Intermediate Reduce HDFS Phase Result Phase
  • 5. MapReduce Applications: (1)  Distributed grep & Distributed sort (2)  Web link-graph reversal,  (3)   Web access log stats,  (4)   Document clustering, (5)   Machine Learning and so on... To know more: è  MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google, Inc. è  Hadoop: The Definitive Guide - O'Reilly Media
  • 6.
  • 7. PIG è  First Pig developed at Yahoo Research around 2006 later moved to Apache Software Foundation è  Pig is a data flow programming environment for processing large files based on MapReduce / Hadoop. è  High-level platform for creating MapReduce programs used with Hadoop and HDFS è  Apache library that interprets scripts written in Pig Latin and runs them on a Hadoop cluster. At Yahoo! 40% of all Hadoop jobs are run with Pig
  • 8. PIG WorkFlow: First step: Load input data.  Second step: Manipulate  data with functions like filtering, using foreach, distinct or any user defined functions. Third step: Group the data. Final stage: Writing data into the DFS or repeating the step if another dataset arrives. Scripts written in PigLatin------------------->Hadoop ready jobs Pig Library/Engine Take Away Point:: Do more with data not with functions..
  • 9. Cascading Query API and Query Planner for defining, sharing, and executing data processing workflows. Supports to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.). Originally authored by Chris Wensel (founder of Concurrent, Inc.) What it offers?? Data Processing API (core) Process Planner Process Scheduler How to use?? 1. Install Hadoop 2. Put Hadoop job .jar which must contain cascading .jars.
  • 10. Cascading:‘Source-Pipe-Sink’ How it works?? Source: Data is captured from sources. Pipes: are created independent from the data they will process. Supports reusable ‘pipes’ concept. Sinks: Results are stored in output files or ‘sinks’. Data Processing API provides Source-Pipe-Sink mechanism. Once tied to data sources and sinks, it is called a ‘flow’(Topological Scheduler). These flows can be grouped into a ‘cascade’(CascadeConnector class), and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied.
  • 11. Cascading Pipe Assembly------MR Job Planner---->graph of dependent MapReduce jobs. Also provides External Data Interfaces for data... It efficiently supports splits, joins, grouping, and sorting. Usages: log file analysis, bioinformatics, machine learning, predictive analytics, web content mining etc. Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011.
  • 12. FlumeJava Java Library API that makes easy to develop,test and run efficient data parallel pipelines. Born on May 2009 @ Google Lab Library is a collection of immutable parallel classes. Flumejava: 1. abstracts how data is presented as in memory data structure or as file 2. abstracts away the implementation details like local loop or remote MR job. 3. Implements parallel job using deferred evaluation
  • 13. FlumeJava How it works??? Step1: invoke the parallel operation. Step2: Do not run. Do the following .. 2.1. Records the operation and the arguments. 2.2. save them into an internal execution plan graph structure. 2.3. Construct the execution plan for whole computation. Step3: Optimizes the execution plan. Step4: Execute them. Faster than typical MR pipeline with same logical struct. & easier.
  • 14. FlumeJava Data Model: Pcollection<T>: central class, an immutable bag of elements of type T Can be unordered (collection(efficient)) or ordered (sequence). PTable<K, V>:Second central class Immutable multi-map with keys of class K and values of class V Operators: parallelDo(PCollection<T>): Core parallel primitives groupByKey(PTable<Pair<K,V>>) combineValues(PTable<Pair<K, Collection<V>>): flatten(): logical view of multiple PCollections as one Pcollection Join()
  • 15. Dremel A distributed system for interactive analysis of large datasets since 2006 in Google. Provides custom, scalable data management solution built over shared clusters of commodity machines. Three Features/Key aspects: 1. Storage Format: column-striped storage representation for non relational nested data (lossless representation). Why nested? It backs a platform-neutral, extensible mechanism for serializing structured data at Google. What is main aim?? Store all values of a given field consecutively to improve retrieval efficiency.
  • 16. Dremel 2. Query Language: Provides a high-level, SQL-like language to express ad hoc queries. It efficiently implementable on columnar nested storage. Fields are referenced using path expressions. Supports nested subqueries, inter and intra-record aggregation, joins etc. 3. Execution:Multi-level serving tree concept (distributed search engine) Several queries can execute simultaneously. Query dispatcher schedules queries based on priorities and balances load
  • 17. I am lost..Are MR and Dremel same?? Features MapReduce aka MR Dremel Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab Type Distributed & parallel Distributed interactive programming framework ad hoc query system Scalable & Fault Yes Yes Tolerant Data processing Record oriented Column oriented Batch processing Yes No In situ processing No Yes Take away point:: Dremel it complements MapReduce-based computing.