Intro to Big Data using Hadoop

•Als PPTX, PDF herunterladen•

2 gefällt mir•4,064 views

Sergejus Barinovas

Introduction to Big Data and Apache Hadoop project. MapReduce vizualization

Technologie Unterhaltung & Humor

Intro to Big Data using Hadoop

Sergejus Barinovas
sergejus.blogas.lt
fb.com/ITishnikai
@sergejusb

Information is powerful…
but it is how we use it that will define us

Data Explosion

text
audio
video
images

relational

picture from Big Data Integration

Big Data (globally)

– creates over 30 billion pieces of content per day

– stores 30 petabytes of data

– produces over 90 million tweets per day

Big Data (our example)

– logs over 300 gigabytes of transactions per day

– stores more than 1,5 terabyte of aggregated data

4 Vs of Big Data

volume
volume
velocity
velocity
variety
variety
variability
variability

Big Data Challenges

Sort 10TB on 1 node = 2,5 days

100-node cluster = 35 mins

Big Data Challenges

“Fat” servers implies high cost
– use cheap commodity nodes instead

Large # of cheap nodes implies often failures
– leverage automatic fault-tolerance
fault-tolerance

Big Data Challenges

We need new data-parallel programming
model for clusters of commodity machines

MapReduce

Published in 2004 by Google
– MapReduce: Simplified Data Processing on Large Clusters

Popularized by Apache Hadoop project
– used by Yahoo!, Facebook, Twitter, Amazon, …

Word Count Example
Input Map Shuffle & Sort Reduce Output

the quick the, 3
brown Map brown, 2
fox Reduce fox, 2
how, 1
now, 1
the fox
ate the Map
mouse
quick, 1
ate, 1
Reduce mouse, 1
how now
brown Map cow, 1
cow

Word Count Example
Input Map Shuffle & Sort Reduce Output

the, 1 the, 1
the quick quick, 1 brown, 1
brown Map brown, 1 fox, 1
fox fox, 1 the, 1 Reduce
fox, 1
the, 1
the, 1 how, 1
the fox fox, 1 now, 1
ate the Map ate, 1 brown, 1
mouse the, 1
mouse, 1
quick, 1
ate, 1
how, 1 mouse, 1 Reduce
how now now, 1 cow, 1
brown Map brown, 1
cow cow, 1

Word Count Example
Input Map Shuffle & Sort Reduce Output

the, [1,1,1]
the quick brown, [1,1] the, 3
brown Map fox, [1,1] brown, 2
fox how, [1] Reduce fox, 2
now, [1] how, 1
now, 1
the fox
ate the Map
mouse
quick, [1] quick, 1
ate, [1] ate, 1
mouse, [1] Reduce mouse, 1
how now
cow, [1] cow, 1
brown Map
cow

MapReduce philosophy
– hide complexity

– make it scalable

– make it cheap

MapReduce popularized by

Apache Hadoop project

Hadoop Overview

Open source implementation of
– Google MapReduce paper

– Google File System (GFS) paper

First release in 2008 by Yahoo!
– wide adoption by Facebook, Twitter, Amazon, etc.

Hadoop Core

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Hadoop Core (HDFS)

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes

Hadoop Core (HDFS)

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Name Node Data Node

Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks

MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)

Name Node Data Node

Hadoop Core (MapReduce)

Job Tracker Task Tracker

MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)

Name Node Data Node

Hadoop Core (Job submission)

Task Tracker
Client

Job Tracker

Name Node Data Node

Hadoop Ecosystem

Pig (ETL) Hive (BI) Sqoop (RDBMS)

MapReduce (Job Scheduling / Execution System)
Zookeeper

Avro
HBase

Hadoop Distributed File System (HDFS)

$JavaScript MapReduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };$

Pig

words = LOAD '/example/count' AS (
word: chararray,
count: int
);
popular_words = ORDER words BY count DESC;
top_popular_words = LIMIT popular_words 10;
DUMP top_popular_words;

Hive
CREATE EXTERNAL TABLE WordCount (
word string,
count int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Weitere ähnliche Inhalte

Was ist angesagt?

InfiniCortex and the Renaissance in Polish Supercomputing inside-BigData.com

Hadoop pigWei-Yu Chen

Introduction to Hadoopbigdatasyd

Terabyte-scale image similarity search: experience and best practiceDenis Shestakov

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.

Hadoop Essential for Oracle ProfessionalsChien Chung Shen

Hadoop introductionshubham kuwar

Practical Hadoop using PigDavid Wellman

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu

Distributed computing the Google wayEduard Hildebrandt

Another Intro To HadoopAdeel Ahmad

BIG DATA: Apache HadoopOleksiy Krotov

The Family of HadoopNam Nham

Hive and data analysis using pandasPurna Chander K

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.

R, Hadoop and Amazon Web ServicesPortland R User Group

hadoop&zingzingopen

Hadoop at Rakuten, 2011/07/06Rakuten Group, Inc.

Shark SQL and Rich Analytics at ScaleDataWorks Summit

Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba

Was ist angesagt? (20)

InfiniCortex and the Renaissance in Polish Supercomputing

Hadoop pig

Introduction to Hadoop

Terabyte-scale image similarity search: experience and best practice

Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...

Hadoop Essential for Oracle Professionals

Hadoop introduction

Practical Hadoop using Pig

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

Distributed computing the Google way

Another Intro To Hadoop

BIG DATA: Apache Hadoop

The Family of Hadoop

Hive and data analysis using pandas

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

R, Hadoop and Amazon Web Services

hadoop&zing

Hadoop at Rakuten, 2011/07/06

Shark SQL and Rich Analytics at Scale

Hive vs Pig for HadoopSourceCodeReading

Andere mochten auch

YARN - Hadoop's Resource ManagerVertiCloud Inc

Mapreduce in SearchAmund Tveit

The google MapReduceRomain Jacotin

How Google Does Big Data - DevNexus 2014James Chittenden

Introducing Apache Giraph for Large Scale Graph Processingsscdotopen

The Google File System (GFS)Romain Jacotin

Andere mochten auch (6)

YARN - Hadoop's Resource Manager

Mapreduce in Search

The google MapReduce

How Google Does Big Data - DevNexus 2014

Introducing Apache Giraph for Large Scale Graph Processing

The Google File System (GFS)

Ähnlich wie Intro to Big Data using Hadoop

Introduction to Hadoop and MapReduceDr Ganesh Iyer

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto

Tackling Big Data with the Elephant in the RoomBTI360

WOOster: A Map-Reduce based Platform for Graph Miningaravindan_raghu

Hadoop ecosystemMohamed Ali Mahmoud khouder

HadoopScott Leberknight

データ解析技術入門(Hadoop編)Takumi Asai

Elephant in the cloudrhatr

HadoopThe Hadoop Java Software FrameworkThoughtWorks

Hadoop and mysql by Chris SchneiderDmitry Makarchuk

Big Data Analytics Projects - Real World with PentahoMark Kromer

Hadoop and Mapreduce for .NET User GroupCsaba Toth

L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar

Hadoop Family and Ecosystemtcloudcomputing-tw

TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer

Hadoop scalabilityWANdisco Plc

Mapreduse modelKalyaniwan

2012 apache hadoop_map_reduce_windows_azureDataPlato, Crossing the line

12-BigDataMapReduce.pptxShree Shree

Ähnlich wie Intro to Big Data using Hadoop (20)

Introduction to Hadoop and MapReduce

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Tackling Big Data with the Elephant in the Room

WOOster: A Map-Reduce based Platform for Graph Mining

Hadoop ecosystem

Hadoop

データ解析技術入門(Hadoop編)

Elephant in the cloud

HadoopThe Hadoop Java Software Framework

Hadoop and mysql by Chris Schneider

Big Data Analytics Projects - Real World with Pentaho

Hadoop and Mapreduce for .NET User Group

L19CloudMapReduce introduction for cloud computing .ppt

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Hadoop Family and Ecosystem

TheEdge10 : Big Data is Here - Hadoop to the Rescue

Hadoop scalability

Mapreduse model

2012 apache hadoop_map_reduce_windows_azure

12-BigDataMapReduce.pptx

Mehr von Sergejus Barinovas

Bringing Developers to the Next LevelSergejus Barinovas

True story of re architecting website for scale on windows azureSergejus Barinovas

Continuous Happiness by Continuous DeliverySergejus Barinovas

Windows Azure from practical point of viewSergejus Barinovas

Flashback: QCon San Francisco 2012Sergejus Barinovas

Optimizing ASP.NET application performance: tough but necessarySergejus Barinovas

Release Often Release SafelySergejus Barinovas

Kaip Agile skatina gerųjų praktikų panaudojimąSergejus Barinovas

Introduction to Windows Azure PlatformSergejus Barinovas

Web Scale with NoSQLSergejus Barinovas

Moving applications to the cloudSergejus Barinovas

NoSQL - what's thatSergejus Barinovas

Demystifying HTML5Sergejus Barinovas

Architecting Windows AzureSergejus Barinovas

Cloud Computing and Microsoft Azure PlatformSergejus Barinovas

Mehr von Sergejus Barinovas (15)

Bringing Developers to the Next Level

True story of re architecting website for scale on windows azure

Continuous Happiness by Continuous Delivery

Windows Azure from practical point of view

Flashback: QCon San Francisco 2012

Optimizing ASP.NET application performance: tough but necessary

Release Often Release Safely

Kaip Agile skatina gerųjų praktikų panaudojimą

Introduction to Windows Azure Platform

Web Scale with NoSQL

Moving applications to the cloud

NoSQL - what's that

Demystifying HTML5

Architecting Windows Azure

Cloud Computing and Microsoft Azure Platform

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Developing An App To Navigate The Roads of BrazilV3cube

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

CNv6 Instructor Chapter 6 Quality of Service

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Developing An App To Navigate The Roads of Brazil

Axa Assurance Maroc - Insurer Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

2024: Domino Containers - The Next Step. News from the Domino Container commu...

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

How to Troubleshoot Apps for the Modern Connected Worker

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Automating Google Workspace (GWS) & more with Apps Script

Finology Group – Insurtech Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Intro to Big Data using Hadoop

1. Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb

2. Information is powerful… but it is how we use it that will define us

3. Data Explosion text audio video images relational picture from Big Data Integration

4. Big Data (globally) – creates over 30 billion pieces of content per day – stores 30 petabytes of data – produces over 90 million tweets per day

5. Big Data (our example) – logs over 300 gigabytes of transactions per day – stores more than 1,5 terabyte of aggregated data

6. 4 Vs of Big Data volume volume velocity velocity variety variety variability variability

7. Big Data Challenges Sort 10TB on 1 node = 2,5 days 100-node cluster = 35 mins

8. Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance fault-tolerance

9. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines

10. MapReduce to the rescue!

11. MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, …

12. MapReduce

13. Word Count Example Input Map Shuffle & Sort Reduce Output the quick the, 3 brown Map brown, 2 fox Reduce fox, 2 how, 1 now, 1 the fox ate the Map mouse quick, 1 ate, 1 Reduce mouse, 1 how now brown Map cow, 1 cow

14. Word Count Example Input Map Shuffle & Sort Reduce Output the, 1 the, 1 the quick quick, 1 brown, 1 brown Map brown, 1 fox, 1 fox fox, 1 the, 1 Reduce fox, 1 the, 1 the, 1 how, 1 the fox fox, 1 now, 1 ate the Map ate, 1 brown, 1 mouse the, 1 mouse, 1 quick, 1 ate, 1 how, 1 mouse, 1 Reduce how now now, 1 cow, 1 brown Map brown, 1 cow cow, 1

15. Word Count Example Input Map Shuffle & Sort Reduce Output the, [1,1,1] the quick brown, [1,1] the, 3 brown Map fox, [1,1] brown, 2 fox how, [1] Reduce fox, 2 now, [1] how, 1 now, 1 the fox ate the Map mouse quick, [1] quick, 1 ate, [1] ate, 1 mouse, [1] Reduce mouse, 1 how now cow, [1] cow, 1 brown Map cow

16. MapReduce philosophy – hide complexity – make it scalable – make it cheap

17. MapReduce popularized by Apache Hadoop project

18. Hadoop Overview Open source implementation of – Google MapReduce paper – Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc.

19. Hadoop Core MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)

20. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes

21. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node

22. Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node

23. Hadoop Core (MapReduce) Job Tracker Task Tracker MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node

24. Hadoop Core (Job submission) Task Tracker Client Job Tracker Name Node Data Node

25. Hadoop Ecosystem Pig (ETL) Hive (BI) Sqoop (RDBMS) MapReduce (Job Scheduling / Execution System) Zookeeper Avro HBase Hadoop Distributed File System (HDFS)

26. JavaScript MapReduce var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };

27. Pig words = LOAD '/example/count' AS ( word: chararray, count: int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;

28. Hive CREATE EXTERNAL TABLE WordCount ( word string, count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

29. Über Demo Demo Hadoop in the Cloud

30. Thanks! Questions?

Hinweis der Redaktion

So, which is really the “enterprise” now?
Volume – exceeds physical limits of vertical scalabilityVelocity – decision window small compared to data change rateVariety – many different formats makes integration expensiveVariability – many options or variable interpretations confound analysis
--run MapReducerunJs("/example/mr/WordCount.js", "/example/data/davinci.txt", "/example/count");--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' STORED AS TEXTFILELOCATION "/example/count";--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec("select * from WordCount order by count desc limit 10;");--execute LINQ style Hive queryhive.from("WordCount").orderBy("count DESC").take(10).run();--execute Pig scriptwords = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from("/example/count", "word: chararray, count: int").orderBy("count DESC").take(10).run();
--run MapReducerunJs("/example/mr/WordCount.js", "/example/data/davinci.txt", "/example/count");--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' STORED AS TEXTFILELOCATION "/example/count";--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec("select * from WordCount order by count desc limit 10;");--execute LINQ style Hive queryhive.from("WordCount").orderBy("count DESC").take(10).run();--execute Pig scriptwords = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from("/example/count", "word: chararray, count: int").orderBy("count DESC").take(10).run();

Intro to Big Data using Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Intro to Big Data using Hadoop

Ähnlich wie Intro to Big Data using Hadoop (20)

Mehr von Sergejus Barinovas

Mehr von Sergejus Barinovas (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Intro to Big Data using Hadoop

Hinweis der Redaktion