SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Downloaden Sie, um offline zu lesen
INTRODUCTION TO HADOOP 
BreizhJug 
Rennes – 2014-11-06 
David Morin - @davAtBzh
Me 
David Morin 
@davAtBzh 
Solutions Engineer at
3 
What is Hadoop ?
4 
An elephant – This one ?
5 
No, this one !
6 
The father
7 
Let's go !
8 
Let's go !
9 
Timeline
10 
How did the story begin ? 
=> Deal with high volume of data
11 
Big Data – Big Server ?
12 
Big Data – Big Server ?
13 
Big Data – Big Problems ?
14 
Big Data – Big Problems ?
15 
Split is the key
16 
How to find data ?
17 
Define a master
18 
Try again
19 
Not so bad
20 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance
21 
HDFS 
HDFS
22 
Hadoop Distributed FileSystem
23 
Hadoop fundamentals 
● Distributed FileSystem for high volume of data 
● Use of common servers (limit costs) 
● Scalable / fault tolerance ??
24 
Hadoop Distributed FileSystem
25 
MapReduce 
HDFS MapReduce
26 
Mapreduce
27 
Mapreduce : word count 
Map Reduce
28 
Data Locality Optimization
29 
Mapreduce in action
30 
Hadoop v1 : drawbacks 
– One Namenode : SPOF 
– One Jobtracker : SPOF and un-scalable (nodes limitation) 
– MapReduce only : open this platform to non MR 
applications 
– MapReduce v1 : do not fit well with iterative algorithms 
used by Machine Learning
31 
Hadoop v2 
Improvements : 
– HDFS v2 : Secondary namenode 
– YARN (Yet Another Resource Negociator) 
● JobTracker => Resource Manager + Applications 
Master (more than one) 
● Can be used by non MapReduce applications 
– MapReduce v2 : uses Yarn
32 
Hadoop v2
33 
YARN
34 
YARN
35 
YARN
36 
YARN
37 
YARN
38 
YARN
39 
What about monitoring ? 
● Command line : hadoop job, yarn 
● IHM to monitor cluster status 
● IHM to check status of running jobs 
● Access to logs files about nodes activity from the IHM
40 
What about monitoring ?
41 
What can we do with Hadoop ? 
(Me) 2 projects in Credit Mutuel Arkea : 
– LAB : Anti-money laundering 
– Operational reporting for a B2B customer
42 
LAB : Context 
● Tracfin : supervised by the Economic and Financial 
department in France
43 
LAB : Context 
● Difficulties to provide accurate alerts : complexity to 
maintain the system and develop new features
44 
LAB : Context 
● Batch Cobol (z/OS) : started at 19h00 until 9h00 
the day after
45 
LAB : Migration to Hadoop 
● Pig : Pig dataflow model fits well for this kind of 
process (lot of data manipulation)
46 
LAB : Migration to Hadoop 
● Lot of data in input : +1 for Pig
47 
LAB : Migration to Hadoop 
● A lot of jobs tasks can be parallelized : +1 for 
Hadoop
48 
LAB : Migration to Hadoop 
● Time spent for data manipulation reduced by more 
than 50 %
49 
LAB : Migration to Hadoop 
● Previous Job was a batch : MapReduce Ok
50 
Operational Reporting 
Context : 
– Provide a large variety of reporting to a B2B partner 
Why Hadoop : 
– New project 
– Huge amount of different data sources as input : Pig Help 
me ! 
– Batch is ok
51
52 
Pig – Why a new langage ? 
● With Pig write MR Jobs becomes easy 
● Dataflow model : data is the key ! 
● Langage : PigLatin 
● No limit : Used Defined Functions 
http://pig.apache.org/docs/r0.13.0/ 
https://github.com/linkedin/datafu 
https://github.com/twitter/elephant-bird 
https://cwiki.apache.org/confluence/display/PIG/PiggyBank
53 
● Pig-Wordcount 
-- Load file on HDFS 
lines = LOAD '/user/XXX/file.txt' AS (line:chararray); 
-- Iterate on each line 
-- We use TOKENISE to split by word and FLATTEN to obtain a tuple 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; 
-- Group by word 
grouped = GROUP words BY word; 
-- Count number of occurences for each group (word) 
wordcount = FOREACH grouped GENERATE group, COUNT(words); 
-- Display results on sysout 
DUMP wordcount; 
Pig “Hello world”
54 
Import … 
Pig vs MapReduce 
public class WordCount2 { 
public static class TokenizerMapper 
extends Mapper<Object, Text, Text, IntWritable>{ 
static enum CountersEnum { INPUT_WORDS } 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
private boolean caseSensitive; 
private Set<String> patternsToSkip = new HashSet<String>(); 
private Configuration conf; 
private BufferedReader fis; 
... 
=> 130 lines of code !
55 
● SQL like : HQL 
● Metastore : data abstraction and data discovery 
● UDFs 
Hive
56 
Hive “Hello world” 
● Hive-Wordcount 
-- Create table with structure (DDL) 
CREATE TABLE docs (line STRING); 
-- Load data.. 
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; 
-- Create table for results 
-- Select data from previous table, split lines and group by word 
-- And Count records per group 
CREATE TABLE word_counts AS 
SELECT word, count(1) AS count FROM 
(SELECT explode(split(line, 's')) AS word FROM docs) w 
GROUP BY word 
ORDER BY word;
57 
Zookeeper 
Purpose : Coordinate relations between the 
different actors. Provide a global configuration 
we have pushed.
58 
Zookeeper 
● Distributed coordination service
59 
Zookeeper 
● Dynamic configuration 
● Distributed locking
60 
Kafka 
● Messaging System with a specific design 
● Topic / Point to Point in the same time 
● Suitable for high volume of data 
https://kafka.apache.org/
61 
Hadoop : Batch but not only..
62 
Tez 
● Interactive processing uppon Hive and Pig
63 
HBase 
● Online database (realtime querying) 
● NoSQL : columm oriented database 
● Based on Google BigTable 
● Storage on HDFS
64 
Storm 
● Streaming mode 
● Plug well with Apache Kafka 
● Allow data manipulation during input 
http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos 
http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
65 
Cascading 
● Application development platform on Hadoop 
● APIs in Java : standard API, data processing, data 
integration, scheduler API
66 
Scalding 
● Scala API for Cascading
67 
Phoenix 
● Relational DB Layer over Hbase 
● HBase access delivered as a JDBC client 
● Perf : on the order of milliseconds for small 
queries, or seconds for tens of millions of rows
68 
Spark 
● Big data analytics in-memory / disk 
● Complements Hadoop 
● Fast and more flexible 
https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark 
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
69 
??

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoopSandeep Patil
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillVince Gonzalez
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 

Was ist angesagt? (20)

Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Introduction to Big Data and hadoop
Introduction to Big Data and hadoopIntroduction to Big Data and hadoop
Introduction to Big Data and hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 

Ähnlich wie Hadoop breizhjug

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction葵慶 李
 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cvrevuri
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocketSeedRocket
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 

Ähnlich wie Hadoop breizhjug (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cv
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Apache drill
Apache drillApache drill
Apache drill
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Mr hadoop seedrocket
Mr hadoop seedrocketMr hadoop seedrocket
Mr hadoop seedrocket
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 

Kürzlich hochgeladen

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Hadoop breizhjug

  • 1. INTRODUCTION TO HADOOP BreizhJug Rennes – 2014-11-06 David Morin - @davAtBzh
  • 2. Me David Morin @davAtBzh Solutions Engineer at
  • 3. 3 What is Hadoop ?
  • 4. 4 An elephant – This one ?
  • 5. 5 No, this one !
  • 10. 10 How did the story begin ? => Deal with high volume of data
  • 11. 11 Big Data – Big Server ?
  • 12. 12 Big Data – Big Server ?
  • 13. 13 Big Data – Big Problems ?
  • 14. 14 Big Data – Big Problems ?
  • 15. 15 Split is the key
  • 16. 16 How to find data ?
  • 17. 17 Define a master
  • 19. 19 Not so bad
  • 20. 20 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance
  • 22. 22 Hadoop Distributed FileSystem
  • 23. 23 Hadoop fundamentals ● Distributed FileSystem for high volume of data ● Use of common servers (limit costs) ● Scalable / fault tolerance ??
  • 24. 24 Hadoop Distributed FileSystem
  • 25. 25 MapReduce HDFS MapReduce
  • 27. 27 Mapreduce : word count Map Reduce
  • 28. 28 Data Locality Optimization
  • 29. 29 Mapreduce in action
  • 30. 30 Hadoop v1 : drawbacks – One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation) – MapReduce only : open this platform to non MR applications – MapReduce v1 : do not fit well with iterative algorithms used by Machine Learning
  • 31. 31 Hadoop v2 Improvements : – HDFS v2 : Secondary namenode – YARN (Yet Another Resource Negociator) ● JobTracker => Resource Manager + Applications Master (more than one) ● Can be used by non MapReduce applications – MapReduce v2 : uses Yarn
  • 39. 39 What about monitoring ? ● Command line : hadoop job, yarn ● IHM to monitor cluster status ● IHM to check status of running jobs ● Access to logs files about nodes activity from the IHM
  • 40. 40 What about monitoring ?
  • 41. 41 What can we do with Hadoop ? (Me) 2 projects in Credit Mutuel Arkea : – LAB : Anti-money laundering – Operational reporting for a B2B customer
  • 42. 42 LAB : Context ● Tracfin : supervised by the Economic and Financial department in France
  • 43. 43 LAB : Context ● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features
  • 44. 44 LAB : Context ● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after
  • 45. 45 LAB : Migration to Hadoop ● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)
  • 46. 46 LAB : Migration to Hadoop ● Lot of data in input : +1 for Pig
  • 47. 47 LAB : Migration to Hadoop ● A lot of jobs tasks can be parallelized : +1 for Hadoop
  • 48. 48 LAB : Migration to Hadoop ● Time spent for data manipulation reduced by more than 50 %
  • 49. 49 LAB : Migration to Hadoop ● Previous Job was a batch : MapReduce Ok
  • 50. 50 Operational Reporting Context : – Provide a large variety of reporting to a B2B partner Why Hadoop : – New project – Huge amount of different data sources as input : Pig Help me ! – Batch is ok
  • 51. 51
  • 52. 52 Pig – Why a new langage ? ● With Pig write MR Jobs becomes easy ● Dataflow model : data is the key ! ● Langage : PigLatin ● No limit : Used Defined Functions http://pig.apache.org/docs/r0.13.0/ https://github.com/linkedin/datafu https://github.com/twitter/elephant-bird https://cwiki.apache.org/confluence/display/PIG/PiggyBank
  • 53. 53 ● Pig-Wordcount -- Load file on HDFS lines = LOAD '/user/XXX/file.txt' AS (line:chararray); -- Iterate on each line -- We use TOKENISE to split by word and FLATTEN to obtain a tuple words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- Group by word grouped = GROUP words BY word; -- Count number of occurences for each group (word) wordcount = FOREACH grouped GENERATE group, COUNT(words); -- Display results on sysout DUMP wordcount; Pig “Hello world”
  • 54. 54 Import … Pig vs MapReduce public class WordCount2 { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ static enum CountersEnum { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>(); private Configuration conf; private BufferedReader fis; ... => 130 lines of code !
  • 55. 55 ● SQL like : HQL ● Metastore : data abstraction and data discovery ● UDFs Hive
  • 56. 56 Hive “Hello world” ● Hive-Wordcount -- Create table with structure (DDL) CREATE TABLE docs (line STRING); -- Load data.. LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; -- Create table for results -- Select data from previous table, split lines and group by word -- And Count records per group CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, 's')) AS word FROM docs) w GROUP BY word ORDER BY word;
  • 57. 57 Zookeeper Purpose : Coordinate relations between the different actors. Provide a global configuration we have pushed.
  • 58. 58 Zookeeper ● Distributed coordination service
  • 59. 59 Zookeeper ● Dynamic configuration ● Distributed locking
  • 60. 60 Kafka ● Messaging System with a specific design ● Topic / Point to Point in the same time ● Suitable for high volume of data https://kafka.apache.org/
  • 61. 61 Hadoop : Batch but not only..
  • 62. 62 Tez ● Interactive processing uppon Hive and Pig
  • 63. 63 HBase ● Online database (realtime querying) ● NoSQL : columm oriented database ● Based on Google BigTable ● Storage on HDFS
  • 64. 64 Storm ● Streaming mode ● Plug well with Apache Kafka ● Allow data manipulation during input http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign
  • 65. 65 Cascading ● Application development platform on Hadoop ● APIs in Java : standard API, data processing, data integration, scheduler API
  • 66. 66 Scalding ● Scala API for Cascading
  • 67. 67 Phoenix ● Relational DB Layer over Hbase ● HBase access delivered as a JDBC client ● Perf : on the order of milliseconds for small queries, or seconds for tens of millions of rows
  • 68. 68 Spark ● Big data analytics in-memory / disk ● Complements Hadoop ● Fast and more flexible https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  • 69. 69 ??