SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Intro to Big Data using Hadoop




                       Sergejus Barinovas
                       sergejus.blogas.lt
                       fb.com/ITishnikai
                       @sergejusb
Information is powerful…
but it is how we use it that will define us
Data Explosion


                                      text
                                   audio
                                    video
                                  images



                              relational



                 picture from Big Data Integration
Big Data (globally)


– creates over 30 billion pieces of content per day

– stores 30 petabytes of data



– produces over 90 million tweets per day
Big Data (our example)



– logs over 300 gigabytes of transactions per day

– stores more than 1,5 terabyte of aggregated data
4 Vs of Big Data


              volume
               volume
              velocity
               velocity
              variety
               variety
               variability
              variability
Big Data Challenges


Sort 10TB on 1 node = 2,5 days

   100-node cluster = 35 mins
Big Data Challenges

“Fat” servers implies high cost
 – use cheap commodity nodes instead


Large # of cheap nodes implies often failures
 – leverage automatic fault-tolerance
                      fault-tolerance
Big Data Challenges



We need new data-parallel programming
model for clusters of commodity machines
MapReduce
to the rescue!
MapReduce


Published in 2004 by Google
 – MapReduce: Simplified Data Processing on Large Clusters




Popularized by Apache Hadoop project
 – used by Yahoo!, Facebook, Twitter, Amazon, …
MapReduce
Word Count Example
 Input       Map   Shuffle & Sort   Reduce   Output


the quick                                    the, 3
  brown      Map                             brown, 2
   fox                              Reduce   fox, 2
                                             how, 1
                                             now, 1
 the fox
 ate the     Map
 mouse
                                             quick, 1
                                             ate, 1
                                    Reduce   mouse, 1
how now
 brown       Map                             cow, 1
  cow
Word Count Example
 Input      Map      Shuffle & Sort       Reduce   Output

                  the, 1       the, 1
the quick         quick, 1     brown, 1
  brown     Map   brown, 1     fox, 1
   fox            fox, 1       the, 1     Reduce
                               fox, 1
                               the, 1
                  the, 1       how, 1
 the fox          fox, 1       now, 1
 ate the    Map   ate, 1       brown, 1
 mouse            the, 1
                  mouse, 1
                               quick, 1
                               ate, 1
                  how, 1       mouse, 1   Reduce
how now           now, 1       cow, 1
 brown      Map   brown, 1
  cow             cow, 1
Word Count Example
 Input      Map   Shuffle & Sort           Reduce   Output

                            the, [1,1,1]
the quick                   brown, [1,1]            the, 3
  brown     Map             fox, [1,1]              brown, 2
   fox                      how, [1]       Reduce   fox, 2
                            now, [1]                how, 1
                                                    now, 1
 the fox
 ate the    Map
 mouse
                            quick, [1]              quick, 1
                            ate, [1]                ate, 1
                            mouse, [1]     Reduce   mouse, 1
how now
                            cow, [1]                cow, 1
 brown      Map
  cow
MapReduce philosophy
 – hide complexity

 – make it scalable

 – make it cheap
MapReduce popularized by

 Apache Hadoop project
Hadoop Overview

Open source implementation of
 – Google MapReduce paper

 – Google File System (GFS) paper


First release in 2008 by Yahoo!
 – wide adoption by Facebook, Twitter, Amazon, etc.
Hadoop Core



MapReduce (Job Scheduling / Execution System)


    Hadoop Distributed File System (HDFS)
Hadoop Core (HDFS)



     MapReduce (Job Scheduling / Execution System)

          Hadoop Distributed File System (HDFS)

• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes
Hadoop Core (HDFS)



MapReduce (Job Scheduling / Execution System)

    Hadoop Distributed File System (HDFS)



  Name Node                    Data Node
Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks


     MapReduce (Job Scheduling / Execution System)
          Hadoop Distributed File System (HDFS)



       Name Node                      Data Node
Hadoop Core (MapReduce)

  Job Tracker                 Task Tracker




MapReduce (Job Scheduling / Execution System)
    Hadoop Distributed File System (HDFS)



  Name Node                    Data Node
Hadoop Core (Job submission)

                           Task Tracker
Client




            Job Tracker




            Name Node      Data Node
Hadoop Ecosystem

            Pig (ETL)          Hive (BI)       Sqoop (RDBMS)


            MapReduce (Job Scheduling / Execution System)
Zookeeper




                                                               Avro
                  HBase


                 Hadoop Distributed File System (HDFS)
JavaScript MapReduce
var map = function (key, value, context) {
    var words = value.split(/[^a-zA-Z]/);
    for (var i = 0; i < words.length; i++) {
        if (words[i] !== "") {
            context.write(words[i].toLowerCase(), 1);
        }
    }
};
var reduce = function (key, values, context) {
    var sum = 0;
    while (values.hasNext()) {
        sum += parseInt(values.next());
    }
    context.write(key, sum);
};
Pig

words = LOAD '/example/count' AS (
      word: chararray,
      count: int
);
popular_words = ORDER words BY count DESC;
top_popular_words = LIMIT popular_words 10;
DUMP top_popular_words;
Hive
CREATE EXTERNAL TABLE WordCount (
      word string,
      count int
)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY 't'
      LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
Über Demo
Demo
Hadoop in the Cloud
Thanks!
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing inside-BigData.com
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopbigdatasyd
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 

Was ist angesagt? (20)

InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing InfiniCortex and the Renaissance in Polish Supercomputing
InfiniCortex and the Renaissance in Polish Supercomputing
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 

Andere mochten auch

YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerVertiCloud Inc
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 

Andere mochten auch (6)

YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Ähnlich wie Intro to Big Data using Hadoop

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomBTI360
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Miningaravindan_raghu
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)Takumi Asai
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloudrhatr
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Mapreduse model
Mapreduse modelMapreduse model
Mapreduse modelKalyaniwan
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptxShree Shree
 

Ähnlich wie Intro to Big Data using Hadoop (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Mapreduse model
Mapreduse modelMapreduse model
Mapreduse model
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 

Mehr von Sergejus Barinovas

Bringing Developers to the Next Level
Bringing Developers to the Next LevelBringing Developers to the Next Level
Bringing Developers to the Next LevelSergejus Barinovas
 
True story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureTrue story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureSergejus Barinovas
 
Continuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliveryContinuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliverySergejus Barinovas
 
Windows Azure from practical point of view
Windows Azure from practical point of viewWindows Azure from practical point of view
Windows Azure from practical point of viewSergejus Barinovas
 
Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Sergejus Barinovas
 
Optimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessaryOptimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessarySergejus Barinovas
 
Kaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąKaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąSergejus Barinovas
 
Introduction to Windows Azure Platform
Introduction to Windows Azure PlatformIntroduction to Windows Azure Platform
Introduction to Windows Azure PlatformSergejus Barinovas
 
Moving applications to the cloud
Moving applications to the cloudMoving applications to the cloud
Moving applications to the cloudSergejus Barinovas
 
Cloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformCloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformSergejus Barinovas
 

Mehr von Sergejus Barinovas (15)

Bringing Developers to the Next Level
Bringing Developers to the Next LevelBringing Developers to the Next Level
Bringing Developers to the Next Level
 
True story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azureTrue story of re architecting website for scale on windows azure
True story of re architecting website for scale on windows azure
 
Continuous Happiness by Continuous Delivery
Continuous Happiness by Continuous DeliveryContinuous Happiness by Continuous Delivery
Continuous Happiness by Continuous Delivery
 
Windows Azure from practical point of view
Windows Azure from practical point of viewWindows Azure from practical point of view
Windows Azure from practical point of view
 
Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012Flashback: QCon San Francisco 2012
Flashback: QCon San Francisco 2012
 
Optimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessaryOptimizing ASP.NET application performance: tough but necessary
Optimizing ASP.NET application performance: tough but necessary
 
Release Often Release Safely
Release Often Release SafelyRelease Often Release Safely
Release Often Release Safely
 
Kaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimąKaip Agile skatina gerųjų praktikų panaudojimą
Kaip Agile skatina gerųjų praktikų panaudojimą
 
Introduction to Windows Azure Platform
Introduction to Windows Azure PlatformIntroduction to Windows Azure Platform
Introduction to Windows Azure Platform
 
Web Scale with NoSQL
Web Scale with NoSQLWeb Scale with NoSQL
Web Scale with NoSQL
 
Moving applications to the cloud
Moving applications to the cloudMoving applications to the cloud
Moving applications to the cloud
 
NoSQL - what's that
NoSQL - what's thatNoSQL - what's that
NoSQL - what's that
 
Demystifying HTML5
Demystifying HTML5Demystifying HTML5
Demystifying HTML5
 
Architecting Windows Azure
Architecting Windows AzureArchitecting Windows Azure
Architecting Windows Azure
 
Cloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure PlatformCloud Computing and Microsoft Azure Platform
Cloud Computing and Microsoft Azure Platform
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Intro to Big Data using Hadoop

  • 1. Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb
  • 2. Information is powerful… but it is how we use it that will define us
  • 3. Data Explosion text audio video images relational picture from Big Data Integration
  • 4. Big Data (globally) – creates over 30 billion pieces of content per day – stores 30 petabytes of data – produces over 90 million tweets per day
  • 5. Big Data (our example) – logs over 300 gigabytes of transactions per day – stores more than 1,5 terabyte of aggregated data
  • 6. 4 Vs of Big Data volume volume velocity velocity variety variety variability variability
  • 7. Big Data Challenges Sort 10TB on 1 node = 2,5 days 100-node cluster = 35 mins
  • 8. Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance fault-tolerance
  • 9. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines
  • 11. MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, …
  • 13. Word Count Example Input Map Shuffle & Sort Reduce Output the quick the, 3 brown Map brown, 2 fox Reduce fox, 2 how, 1 now, 1 the fox ate the Map mouse quick, 1 ate, 1 Reduce mouse, 1 how now brown Map cow, 1 cow
  • 14. Word Count Example Input Map Shuffle & Sort Reduce Output the, 1 the, 1 the quick quick, 1 brown, 1 brown Map brown, 1 fox, 1 fox fox, 1 the, 1 Reduce fox, 1 the, 1 the, 1 how, 1 the fox fox, 1 now, 1 ate the Map ate, 1 brown, 1 mouse the, 1 mouse, 1 quick, 1 ate, 1 how, 1 mouse, 1 Reduce how now now, 1 cow, 1 brown Map brown, 1 cow cow, 1
  • 15. Word Count Example Input Map Shuffle & Sort Reduce Output the, [1,1,1] the quick brown, [1,1] the, 3 brown Map fox, [1,1] brown, 2 fox how, [1] Reduce fox, 2 now, [1] how, 1 now, 1 the fox ate the Map mouse quick, [1] quick, 1 ate, [1] ate, 1 mouse, [1] Reduce mouse, 1 how now cow, [1] cow, 1 brown Map cow
  • 16. MapReduce philosophy – hide complexity – make it scalable – make it cheap
  • 17. MapReduce popularized by Apache Hadoop project
  • 18. Hadoop Overview Open source implementation of – Google MapReduce paper – Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc.
  • 19. Hadoop Core MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)
  • 20. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes
  • 21. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 22. Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 23. Hadoop Core (MapReduce) Job Tracker Task Tracker MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  • 24. Hadoop Core (Job submission) Task Tracker Client Job Tracker Name Node Data Node
  • 25. Hadoop Ecosystem Pig (ETL) Hive (BI) Sqoop (RDBMS) MapReduce (Job Scheduling / Execution System) Zookeeper Avro HBase Hadoop Distributed File System (HDFS)
  • 26. JavaScript MapReduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  • 27. Pig words = LOAD '/example/count' AS ( word: chararray, count: int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;
  • 28. Hive CREATE EXTERNAL TABLE WordCount ( word string, count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Hinweis der Redaktion

  1. So, which is really the “enterprise” now?
  2. Volume – exceeds physical limits of vertical scalabilityVelocity – decision window small compared to data change rateVariety – many different formats makes integration expensiveVariability – many options or variable interpretations confound analysis
  3. --run MapReducerunJs(&quot;/example/mr/WordCount.js&quot;, &quot;/example/data/davinci.txt&quot;, &quot;/example/count&quot;);--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY &apos;\\t&apos; LINES TERMINATED BY &apos;\\n&apos; STORED AS TEXTFILELOCATION &quot;/example/count&quot;;--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec(&quot;select * from WordCount order by count desc limit 10;&quot;);--execute LINQ style Hive queryhive.from(&quot;WordCount&quot;).orderBy(&quot;count DESC&quot;).take(10).run();--execute Pig scriptwords = LOAD &apos;/example/count&apos; AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from(&quot;/example/count&quot;, &quot;word: chararray, count: int&quot;).orderBy(&quot;count DESC&quot;).take(10).run();
  4. --run MapReducerunJs(&quot;/example/mr/WordCount.js&quot;, &quot;/example/data/davinci.txt&quot;, &quot;/example/count&quot;);--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY &apos;\\t&apos; LINES TERMINATED BY &apos;\\n&apos; STORED AS TEXTFILELOCATION &quot;/example/count&quot;;--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec(&quot;select * from WordCount order by count desc limit 10;&quot;);--execute LINQ style Hive queryhive.from(&quot;WordCount&quot;).orderBy(&quot;count DESC&quot;).take(10).run();--execute Pig scriptwords = LOAD &apos;/example/count&apos; AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from(&quot;/example/count&quot;, &quot;word: chararray, count: int&quot;).orderBy(&quot;count DESC&quot;).take(10).run();