SlideShare a Scribd company logo
1 of 66
Introduction to Big Data

Byung-Won On, PhD
Seoul National University
December 18, 2012
Outline
 Big Data
 MapReduce
    ◦ Word Count
    ◦ k-means Clustering Algorithm
   NoSQL
    ◦ Neptune
   Demo
    ◦ Hadoop Installation
    ◦ Word Count using MapReduce
                                     2
Big data in my opinion




                         3
Main keywords related to Big
data
 Data in TB or PB
 Volume
 Velocity
 Variety
 Value
 Complexity


                     Source: Gartner 2012.11


                                               4
Big Data Platform




Source: 안창원, 황승구, 빅 데이터 기술과 주요 이슈, 정보과학
             회지 30권 6호 pp.10-17, 2012
                                          5
Big Data Technology
   Distributed File Systems
    ◦ GFS, HDFS
   Databases
    ◦ Oracle, DB2, MySQL (RDBMS)
    ◦ Bigtable, Hbase, Cassandra, MongoDB (NoSQL)
   Parallel Programming Model
    ◦ MapReduce, Hive, Pig
   Analytics & Visualization
    ◦ Mahout, R, Tableau, Nutch




                                                    6
HADOOP & MAPREDUCE




                 7
Motivated Example
   20 billion web pages * 20KB per web
    page
    ◦ About 400TB
   A computer reads 30-35MB/sec from
    disk
    ◦ It takes about 4 months to read




                                          8
Cluster Architecture




         Source:
                                     9
         http://cs246.stanford.edu
Challenges
 How do we distribute computation?
 How can we make it easy to write
  parallel programs?
 How can we handle machine failure?




                                       10
Approach
   Distributed File Systems
    ◦ Google File System, Hadoop Distributed
      File System
   Parallel Programming Framework
    ◦ MapReduce in Hadoop




                                               11
Distributed File System
   Problem
    ◦ If nodes fail, how to store data persistently?
   Solution
    ◦ Hadoop Distributed File System (HDFS),
      providing global file name space
   Properties of Data for HDFS
    ◦ Huge files ~ xx TB
    ◦ Data is rarely updated in place (i.e.,
      immutable files)
    ◦ Read/append operations are common

                                                  12
Distributed File System
   Name Node
    ◦ Store meta data
    ◦ Active/Stand by
   Data Nodes
    ◦   File is split into contiguous chunks
    ◦   Each chunk ~ 64MB
    ◦   Each chunk is replicated ~ 3x
    ◦   Keep replicas in different racks
   Client (to access files)
    ◦ Contact the name node to find data nodes
    ◦ Directly connect to data nodes to access
      data
                                                 13
MapReduce Programming
Architecture




                        14
Overview
 Sequentially read big data
 Map
    ◦ Extract something you care about
   Group by key
    ◦ Sort and shuffle
   Reduce
    ◦ Aggregate, summarize, filter, transform
   Write the result

                                                15
Map Step




       Source:
                                   16
       http://cs246.stanford.edu
Reduce Step




       Source:
                                   17
       http://cs246.stanford.edu
Algorithm
   Input
    ◦ A set of key/value pairs
   A programmer specifies two methods
    ◦ Map(k,v) => <k’,v’>*
      Takes a key value pair and outputs a set of key
       value pairs
      Ex) key is the filename & value is a single line
       in the file
    ◦ Reduce(k’,<v’>*) => <k’,v’’>*
      All values v’ with the same key k’ are reduced
       together and processed in v’ order

                                                          18
Word Counting using
MapReduce




         Source:
                                     19
         http://cs246.stanford.edu
Word Counting using
MapReduce
   Map(key, value)
    //key: document name; value: text of the
      document
    for each word w in value:
     emit(w,1)

   Reduce(key, values)
    //key: a word; value: an iterator over counts
    Result = 0
    for each count v in values
     result += v
    emit(key, result)
                                                    20
MapReduce Task
 Partition the input data
 Schedule the program execution
  across nodes
 Handle machine failures
 Managing inter communication
  between nodes




                                   21
Parallel Processing




                      22
Name Node
   Task status
    ◦ Idle, in-progress, completed
   Idle tasks are get scheduled when workers
    are available
   When a map task finishes, it sends to name
    node the location and sizes of its
    intermediate files, one for each reduce
    worker
   Name node pushes this information to
    reducers
   Name node regularly pings workers to detect
    failures
                                                  23
Failover
   Map worker failure
    ◦ Map tasks are completed or in-progress at
      worker are reset to idle
    ◦ Reduce workers are notified when task is
      rescheduled on another worker
   Reduce worker failure
    ◦ Only in-progress tasks are reset to idle
   Master failure
    ◦ MapReduce task is aborted and client is
      notified

                                                 24
Set-up
 Map tasks: M
 Reduce tasks: R
 M, R are much larger than # of nodes
  in cluster
 One chunk data per map is common
 Improve dynamic load balancing
 Speed recovery from worker failure
 Often R is smaller than M
    ◦ Note that output is spread across R files

                                                  25
Combiners
 A map task will produce many pairs of
  (k,v1), (k,v2), … for the same key k
 Popular words in word count
 Pre aggregate values at the mapper
    ◦ Combine(k, list(v1)) -> v2
    ◦ Combiner is the same as the reduce
      function



                                           26
Map Step




           27
Reduce Step




              28
Reduce Step




              29
K-MEANS USING
MAPREDUCE




                30
Mixed Entities in Web
                                   The search result includes a
                                   mixture of web pages with
                                   different Tom Mitchells




                                   Separate different web pages
                                   into different groups (called
                                   clusters)

Byung-Won On, Ingyu Lee and Dongwon Lee, Scalable
clustering methods for the name disambiguation problem,
Knowledge and Information Systems 31(1):129-151 (2012)
                                                                   31
Clustering

Web pages of two different persons with the same name spellings are all mixed in the pool

                                                                   a2
                             a2                        a1
                                                              a3
                   a1
                                       a3
                                  b1
                                                                   b1
                        b2
                                                         b2




                                                                                            32
k-means
                                               w4
                                w2
1. Random selection of                               w5
                           w3
cluster centroid                 w1

                                               w4                                w4
                                w2                                   w2
2. Measure distance                                  w5                               w5
                           w3                                   w3
between centroid and                 w1
                                                                      w1

object                                                                     From w4
                           From w3
3. Assign each object to
each centroid based on                         w2
                                                          w4

short distance                                                  w5
                                          w3
                                                w1
4. Choose new centroid
in each cluster based on
                                                                     w4
mean in each cluster                                      w2
                                                                            w5
                                                     w3
5. Repeat step 2 to step                                   w1

4 until convergence
criterion is met.
                                                                                           33
k-means using MapReduce
   Do
    ◦ Map
      Input is a data point and k centers are
       broadcasted
      Find the closest center among k centers for the
       input point
    ◦ Reduce
      Input is one of k centers and all data points
       having this center as their closest center
      Calculate the new center using data points
   Until all of new centers are not
    changed                                            34
Map Step




           35
Reduce Step




              36
NOSQL




        37
Relational DB

 Manage data in GB or TB
 Store important data
  (transaction, personnel, …)
 Guarantee both consistency and
  availability
 Oracle, DB2, MS SQL Server, MySQL



                                      38
Not Only SQL (NoSQL)
   Manage unstructured data such as text
    in TB or PB
   Guarantee partition tolerance
   Guarantee either consistency or
    availability
   Flexible Schema
   No SQL and join operations
   Big Table (Google), Dynamo (Amazon),
    Hbase (Yahoo), Cassandra (Facebook),
    MongoDB, Neptune (NHN)
    ◦ Big Table = Hbase = Neptune
                                            39
Neptune: For Managing Big
Data
 Analyzing log data from Internet portal
  or online game service
 Calculating PageRank or similarities
  between web pages
 Search for personalization
 Social network analysis, recommender
  systems, blog clustering, etc.



                                        40
System Architecture




                      41
Component Nodes
   Master Node
    ◦ Assign tablets to TabletServers, considering the
      number and size of tablets
   TabletServers
    ◦ Provide insertion/deletion with clients
    ◦ Store a few thousands tablets (100~200MB per
      tablet) => a few hundreds GB
    ◦ In-Memory & Disk DB
    ◦ Merge tablets if # of files is increasing (Improvement
      of search operation)
    ◦ Split tablets if the size of a file is large
    ◦ (Improvement of performance)
   Changelog Servers
    ◦ Store transaction log

                                                               42
Data Format
                 Logical data unit
                  ◦ Table
                 Row
                  ◦ Rowkey created by
                    systems automatically
                  ◦ Sorted in ascending order
                 Column
                  ◦ Column key, timestamp
                  ◦ Sorted in lexical order
                  ◦ Get operation
                     Return a recent data
                  ◦ Column oriented indexing
               Divide a table into a set
                of tablets by rowkey
               Store tablets in cluster

                                               43
Meta Data




Meta data is stored in the shared memory of
Pleiades
                                              44
Real Time Processing
                 Reasonable
                  performance
                  ◦ A few ms response
                    time
                 In-Memory DB
                 Minor compaction
                  ◦ When a memory table
                    is full
                 Major compaction
                  ◦ Combine multiple
                    tables for fast search
                    operation
                 Garbage collection
                                             45
MapReduce




            46
Client API & Shell Command




                             47
Failover




* Active master sets NEPTUNE_MASTER lock to
Pleiades releases NEPTUNE_MASTER lock if
* Pleiades
active master is fault & slave master gets the lock
                                                      48
Concluding Remarks
   Big Data
    ◦ Volume, Velocity, Variety
   Store/Manage Big Data
    ◦ Hadoop, NoSQL in cluster
   Parallel Programming
    ◦ MapReduce
   Analytics (Mining & Visualization)
    ◦ Mahout, R


                                         49
Future Plan: Infra
   A pilot system for Big Data (2012. 12)
    ◦ 1 Manage Server
    ◦ 1 Name Node
      2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD
    ◦ 5 Data Nodes
      2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD
    ◦ Rack Mount
    ◦ Gigabit Switch Hub
    ◦ Hadoop & CDH

                                                 50
Future Plan: Research
 Developing machine learning,
  modeling and optimization algorithms
  for mining/visualizing public data in TB
 Re-designing existing data mining
  algorithms using MapReduce
    ◦ Data Mining Algorithms: Clustering,
      Classification, Probabilistic Modeling,
      Association Rule Mining, Graph Analysis,
      etc.
    ◦ Serialization algorithms => Parallel
      algorithms                               51
Reference
 G. Shim, MapReduce Algorithms for
  Big Data Analysis, VLDB 2012 Tutorial
 J. Schindler, I/O Characteristics of
  NoSQL Databases, VLDB 2012
  Tutorial
 J. Leskovec, Mining Massive Datasets,
  Available: http://cs246.stanford.edu
 김형준, Neptune: 대용량 분산 데이터
  관리 시스템, NHN Tech. Report, 2008
 T. White, Hadoop: The Definitive
  Guide, O’Reilly 2012                52
DEMONSTRATION




                53
Outline
 Hadoop Installation
 Word Counting using MapReduce




                                  54
Software for Hadoop
  Installation


VirtualBox 4.1.22
https://www.virtualbox.org/




               http://www.centos.org/                http://www.oracle.com/tec
                                                     hnetwork/java/index.html




                                    http://apache.tt.co.kr/hadoop/commo
                                    n/hadoop-1.0.4/hadoop-1.0.4.tar.gz
                                                                           55
Hadoop Installation




                                        Configur
     JDK            Hadoop
                                         ation



           tar xvf hadoop-1.0.4-bin.tar.gz
                                                   56
Three Modes for Hadoop
Installation



               Pseudo-         Fully
 Standalone   distributed   distributed




                                          57
Configuration
 hadoop-1.0.4/conf
 Hadoop-env.sh



   core-site.xml
   hdfs-site.xml
   mapred-site.xml


                      58
Pseudo-Distributed Mode

 core-site.xml


 mapred-site.xml


 hdfs-site.xml




                          59
Completion of Hadoop
Installation




                       60
Word Count using
     MapReduce




               61
Word Count using
     MapReduce




               62
Word Count using
     MapReduce




               63
Word Count using
     MapReduce




               64
Text Data
Input file   Output file




                           65
Patent ID Data
Input file       Output file




                               66

More Related Content

What's hot

KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Naoki Yanai
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2rusersla
 
Path-based MXML Storage and Querying
Path-based MXML Storage and QueryingPath-based MXML Storage and Querying
Path-based MXML Storage and QueryingGiannis Tsakonas
 

What's hot (6)

KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2
 
Path-based MXML Storage and Querying
Path-based MXML Storage and QueryingPath-based MXML Storage and Querying
Path-based MXML Storage and Querying
 

Viewers also liked

Module Outline
Module OutlineModule Outline
Module OutlineAlexis Wei
 
Creación de un blog y su publicacion en slide share
Creación de un blog y su publicacion en slide share Creación de un blog y su publicacion en slide share
Creación de un blog y su publicacion en slide share andreacalvasite
 
Net proiect-diagnostic-si-strategiile-firmei
Net proiect-diagnostic-si-strategiile-firmeiNet proiect-diagnostic-si-strategiile-firmei
Net proiect-diagnostic-si-strategiile-firmeiConstantin Prisecaru
 
киев слово (2)
киев слово (2)киев слово (2)
киев слово (2)Miki- Mike
 
Juan andrés campo giraldo
Juan andrés campo giraldoJuan andrés campo giraldo
Juan andrés campo giraldoJuanAndresCampo
 
assignment 1 for cts - individual
assignment 1 for cts - individual assignment 1 for cts - individual
assignment 1 for cts - individual Alexis Wei
 
киев слово3
киев слово3киев слово3
киев слово3Miki- Mike
 
cl assignment 2a - individual
cl assignment 2a - individualcl assignment 2a - individual
cl assignment 2a - individualAlexis Wei
 
KKykerTeachingPhilosophy#1
KKykerTeachingPhilosophy#1KKykerTeachingPhilosophy#1
KKykerTeachingPhilosophy#1Krista Kyker
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...Waqas Nawaz
 
Brief steaMakers
Brief steaMakersBrief steaMakers
Brief steaMakersjmachuca
 
Lecture 4 asymptotic notations
Lecture 4   asymptotic notationsLecture 4   asymptotic notations
Lecture 4 asymptotic notationsjayavignesh86
 

Viewers also liked (17)

Module Outline
Module OutlineModule Outline
Module Outline
 
Creación de un blog y su publicacion en slide share
Creación de un blog y su publicacion en slide share Creación de un blog y su publicacion en slide share
Creación de un blog y su publicacion en slide share
 
Net proiect-diagnostic-si-strategiile-firmei
Net proiect-diagnostic-si-strategiile-firmeiNet proiect-diagnostic-si-strategiile-firmei
Net proiect-diagnostic-si-strategiile-firmei
 
киев слово (2)
киев слово (2)киев слово (2)
киев слово (2)
 
Juan andrés campo giraldo
Juan andrés campo giraldoJuan andrés campo giraldo
Juan andrés campo giraldo
 
assignment 1 for cts - individual
assignment 1 for cts - individual assignment 1 for cts - individual
assignment 1 for cts - individual
 
киев слово3
киев слово3киев слово3
киев слово3
 
Pres E Bda 2011
Pres  E Bda 2011Pres  E Bda 2011
Pres E Bda 2011
 
cl assignment 2a - individual
cl assignment 2a - individualcl assignment 2a - individual
cl assignment 2a - individual
 
KKykerTeachingPhilosophy#1
KKykerTeachingPhilosophy#1KKykerTeachingPhilosophy#1
KKykerTeachingPhilosophy#1
 
img-331121145
img-331121145img-331121145
img-331121145
 
ANIKET SINGHAL1
ANIKET SINGHAL1ANIKET SINGHAL1
ANIKET SINGHAL1
 
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
 
Brief steaMakers
Brief steaMakersBrief steaMakers
Brief steaMakers
 
Presentación lectura crítica
Presentación lectura críticaPresentación lectura crítica
Presentación lectura crítica
 
Lecture 4 asymptotic notations
Lecture 4   asymptotic notationsLecture 4   asymptotic notations
Lecture 4 asymptotic notations
 
Spark Me
Spark MeSpark Me
Spark Me
 

Similar to Big data

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012Steven Francia
 
A framework of distributed indexing and data
A framework of distributed indexing and dataA framework of distributed indexing and data
A framework of distributed indexing and dataHarshavardhan Achrekar
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
MongoDB MapReduce Business Intelligence
MongoDB MapReduce Business IntelligenceMongoDB MapReduce Business Intelligence
MongoDB MapReduce Business IntelligenceShafaq Abdullah
 
Knowledg graphs yosi mass
Knowledg graphs yosi massKnowledg graphs yosi mass
Knowledg graphs yosi massdiannepatricia
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database OverviewSteve Min
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real WorldAchim Friedland
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 

Similar to Big data (20)

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
A framework of distributed indexing and data
A framework of distributed indexing and dataA framework of distributed indexing and data
A framework of distributed indexing and data
 
Sector CloudSlam 09
Sector CloudSlam 09Sector CloudSlam 09
Sector CloudSlam 09
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
MongoDB MapReduce Business Intelligence
MongoDB MapReduce Business IntelligenceMongoDB MapReduce Business Intelligence
MongoDB MapReduce Business Intelligence
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
MongoDB and hadoop
MongoDB and hadoopMongoDB and hadoop
MongoDB and hadoop
 
Knowledg graphs yosi mass
Knowledg graphs yosi massKnowledg graphs yosi mass
Knowledg graphs yosi mass
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
NewSQL Database Overview
NewSQL Database OverviewNewSQL Database Overview
NewSQL Database Overview
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Scalding
ScaldingScalding
Scalding
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World1st UIM-GDB - Connections to the Real World
1st UIM-GDB - Connections to the Real World
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 

More from Waqas Nawaz

Design and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract ViewDesign and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract ViewWaqas Nawaz
 
(Icca 2014) shortest path analysis in social graphs
(Icca 2014) shortest path analysis in social graphs(Icca 2014) shortest path analysis in social graphs
(Icca 2014) shortest path analysis in social graphsWaqas Nawaz
 
(Icmia 2013) personalized community detection using collaborative similarity ...
(Icmia 2013) personalized community detection using collaborative similarity ...(Icmia 2013) personalized community detection using collaborative similarity ...
(Icmia 2013) personalized community detection using collaborative similarity ...Waqas Nawaz
 
Andrewng webinar moocs
Andrewng webinar moocsAndrewng webinar moocs
Andrewng webinar moocsWaqas Nawaz
 
Oritentation session at Kyung Hee University for new students 2014
Oritentation session at Kyung Hee University for new students 2014Oritentation session at Kyung Hee University for new students 2014
Oritentation session at Kyung Hee University for new students 2014Waqas Nawaz
 
Fast directional weighted median filter for removal of random valued impulse ...
Fast directional weighted median filter for removal of random valued impulse ...Fast directional weighted median filter for removal of random valued impulse ...
Fast directional weighted median filter for removal of random valued impulse ...Waqas Nawaz
 
Social Media and We
Social Media and WeSocial Media and We
Social Media and WeWaqas Nawaz
 
Social Media vs. Social Relationships
Social Media vs. Social RelationshipsSocial Media vs. Social Relationships
Social Media vs. Social RelationshipsWaqas Nawaz
 
Fourteen steps to a clearly written technical paper
Fourteen steps to a clearly written technical paperFourteen steps to a clearly written technical paper
Fourteen steps to a clearly written technical paperWaqas Nawaz
 
강의(영어) 한국의Smu(이재창)-2012
강의(영어) 한국의Smu(이재창)-2012강의(영어) 한국의Smu(이재창)-2012
강의(영어) 한국의Smu(이재창)-2012Waqas Nawaz
 
Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Waqas Nawaz
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringWaqas Nawaz
 

More from Waqas Nawaz (12)

Design and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract ViewDesign and analysis of algorithms - Abstract View
Design and analysis of algorithms - Abstract View
 
(Icca 2014) shortest path analysis in social graphs
(Icca 2014) shortest path analysis in social graphs(Icca 2014) shortest path analysis in social graphs
(Icca 2014) shortest path analysis in social graphs
 
(Icmia 2013) personalized community detection using collaborative similarity ...
(Icmia 2013) personalized community detection using collaborative similarity ...(Icmia 2013) personalized community detection using collaborative similarity ...
(Icmia 2013) personalized community detection using collaborative similarity ...
 
Andrewng webinar moocs
Andrewng webinar moocsAndrewng webinar moocs
Andrewng webinar moocs
 
Oritentation session at Kyung Hee University for new students 2014
Oritentation session at Kyung Hee University for new students 2014Oritentation session at Kyung Hee University for new students 2014
Oritentation session at Kyung Hee University for new students 2014
 
Fast directional weighted median filter for removal of random valued impulse ...
Fast directional weighted median filter for removal of random valued impulse ...Fast directional weighted median filter for removal of random valued impulse ...
Fast directional weighted median filter for removal of random valued impulse ...
 
Social Media and We
Social Media and WeSocial Media and We
Social Media and We
 
Social Media vs. Social Relationships
Social Media vs. Social RelationshipsSocial Media vs. Social Relationships
Social Media vs. Social Relationships
 
Fourteen steps to a clearly written technical paper
Fourteen steps to a clearly written technical paperFourteen steps to a clearly written technical paper
Fourteen steps to a clearly written technical paper
 
강의(영어) 한국의Smu(이재창)-2012
강의(영어) 한국의Smu(이재창)-2012강의(영어) 한국의Smu(이재창)-2012
강의(영어) 한국의Smu(이재창)-2012
 
Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)Presentation on Graph Clustering (vldb 09)
Presentation on Graph Clustering (vldb 09)
 
Collaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph ClusteringCollaborative Similarity Measure for Intra-Graph Clustering
Collaborative Similarity Measure for Intra-Graph Clustering
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 

Big data

  • 1. Introduction to Big Data Byung-Won On, PhD Seoul National University December 18, 2012
  • 2. Outline  Big Data  MapReduce ◦ Word Count ◦ k-means Clustering Algorithm  NoSQL ◦ Neptune  Demo ◦ Hadoop Installation ◦ Word Count using MapReduce 2
  • 3. Big data in my opinion 3
  • 4. Main keywords related to Big data  Data in TB or PB  Volume  Velocity  Variety  Value  Complexity Source: Gartner 2012.11 4
  • 5. Big Data Platform Source: 안창원, 황승구, 빅 데이터 기술과 주요 이슈, 정보과학 회지 30권 6호 pp.10-17, 2012 5
  • 6. Big Data Technology  Distributed File Systems ◦ GFS, HDFS  Databases ◦ Oracle, DB2, MySQL (RDBMS) ◦ Bigtable, Hbase, Cassandra, MongoDB (NoSQL)  Parallel Programming Model ◦ MapReduce, Hive, Pig  Analytics & Visualization ◦ Mahout, R, Tableau, Nutch 6
  • 8. Motivated Example  20 billion web pages * 20KB per web page ◦ About 400TB  A computer reads 30-35MB/sec from disk ◦ It takes about 4 months to read 8
  • 9. Cluster Architecture Source: 9 http://cs246.stanford.edu
  • 10. Challenges  How do we distribute computation?  How can we make it easy to write parallel programs?  How can we handle machine failure? 10
  • 11. Approach  Distributed File Systems ◦ Google File System, Hadoop Distributed File System  Parallel Programming Framework ◦ MapReduce in Hadoop 11
  • 12. Distributed File System  Problem ◦ If nodes fail, how to store data persistently?  Solution ◦ Hadoop Distributed File System (HDFS), providing global file name space  Properties of Data for HDFS ◦ Huge files ~ xx TB ◦ Data is rarely updated in place (i.e., immutable files) ◦ Read/append operations are common 12
  • 13. Distributed File System  Name Node ◦ Store meta data ◦ Active/Stand by  Data Nodes ◦ File is split into contiguous chunks ◦ Each chunk ~ 64MB ◦ Each chunk is replicated ~ 3x ◦ Keep replicas in different racks  Client (to access files) ◦ Contact the name node to find data nodes ◦ Directly connect to data nodes to access data 13
  • 15. Overview  Sequentially read big data  Map ◦ Extract something you care about  Group by key ◦ Sort and shuffle  Reduce ◦ Aggregate, summarize, filter, transform  Write the result 15
  • 16. Map Step Source: 16 http://cs246.stanford.edu
  • 17. Reduce Step Source: 17 http://cs246.stanford.edu
  • 18. Algorithm  Input ◦ A set of key/value pairs  A programmer specifies two methods ◦ Map(k,v) => <k’,v’>*  Takes a key value pair and outputs a set of key value pairs  Ex) key is the filename & value is a single line in the file ◦ Reduce(k’,<v’>*) => <k’,v’’>*  All values v’ with the same key k’ are reduced together and processed in v’ order 18
  • 19. Word Counting using MapReduce Source: 19 http://cs246.stanford.edu
  • 20. Word Counting using MapReduce  Map(key, value) //key: document name; value: text of the document for each word w in value: emit(w,1)  Reduce(key, values) //key: a word; value: an iterator over counts Result = 0 for each count v in values result += v emit(key, result) 20
  • 21. MapReduce Task  Partition the input data  Schedule the program execution across nodes  Handle machine failures  Managing inter communication between nodes 21
  • 23. Name Node  Task status ◦ Idle, in-progress, completed  Idle tasks are get scheduled when workers are available  When a map task finishes, it sends to name node the location and sizes of its intermediate files, one for each reduce worker  Name node pushes this information to reducers  Name node regularly pings workers to detect failures 23
  • 24. Failover  Map worker failure ◦ Map tasks are completed or in-progress at worker are reset to idle ◦ Reduce workers are notified when task is rescheduled on another worker  Reduce worker failure ◦ Only in-progress tasks are reset to idle  Master failure ◦ MapReduce task is aborted and client is notified 24
  • 25. Set-up  Map tasks: M  Reduce tasks: R  M, R are much larger than # of nodes in cluster  One chunk data per map is common  Improve dynamic load balancing  Speed recovery from worker failure  Often R is smaller than M ◦ Note that output is spread across R files 25
  • 26. Combiners  A map task will produce many pairs of (k,v1), (k,v2), … for the same key k  Popular words in word count  Pre aggregate values at the mapper ◦ Combine(k, list(v1)) -> v2 ◦ Combiner is the same as the reduce function 26
  • 27. Map Step 27
  • 31. Mixed Entities in Web The search result includes a mixture of web pages with different Tom Mitchells Separate different web pages into different groups (called clusters) Byung-Won On, Ingyu Lee and Dongwon Lee, Scalable clustering methods for the name disambiguation problem, Knowledge and Information Systems 31(1):129-151 (2012) 31
  • 32. Clustering Web pages of two different persons with the same name spellings are all mixed in the pool a2 a2 a1 a3 a1 a3 b1 b1 b2 b2 32
  • 33. k-means w4 w2 1. Random selection of w5 w3 cluster centroid w1 w4 w4 w2 w2 2. Measure distance w5 w5 w3 w3 between centroid and w1 w1 object From w4 From w3 3. Assign each object to each centroid based on w2 w4 short distance w5 w3 w1 4. Choose new centroid in each cluster based on w4 mean in each cluster w2 w5 w3 5. Repeat step 2 to step w1 4 until convergence criterion is met. 33
  • 34. k-means using MapReduce  Do ◦ Map  Input is a data point and k centers are broadcasted  Find the closest center among k centers for the input point ◦ Reduce  Input is one of k centers and all data points having this center as their closest center  Calculate the new center using data points  Until all of new centers are not changed 34
  • 35. Map Step 35
  • 37. NOSQL 37
  • 38. Relational DB  Manage data in GB or TB  Store important data (transaction, personnel, …)  Guarantee both consistency and availability  Oracle, DB2, MS SQL Server, MySQL 38
  • 39. Not Only SQL (NoSQL)  Manage unstructured data such as text in TB or PB  Guarantee partition tolerance  Guarantee either consistency or availability  Flexible Schema  No SQL and join operations  Big Table (Google), Dynamo (Amazon), Hbase (Yahoo), Cassandra (Facebook), MongoDB, Neptune (NHN) ◦ Big Table = Hbase = Neptune 39
  • 40. Neptune: For Managing Big Data  Analyzing log data from Internet portal or online game service  Calculating PageRank or similarities between web pages  Search for personalization  Social network analysis, recommender systems, blog clustering, etc. 40
  • 42. Component Nodes  Master Node ◦ Assign tablets to TabletServers, considering the number and size of tablets  TabletServers ◦ Provide insertion/deletion with clients ◦ Store a few thousands tablets (100~200MB per tablet) => a few hundreds GB ◦ In-Memory & Disk DB ◦ Merge tablets if # of files is increasing (Improvement of search operation) ◦ Split tablets if the size of a file is large ◦ (Improvement of performance)  Changelog Servers ◦ Store transaction log 42
  • 43. Data Format  Logical data unit ◦ Table  Row ◦ Rowkey created by systems automatically ◦ Sorted in ascending order  Column ◦ Column key, timestamp ◦ Sorted in lexical order ◦ Get operation  Return a recent data ◦ Column oriented indexing  Divide a table into a set of tablets by rowkey  Store tablets in cluster 43
  • 44. Meta Data Meta data is stored in the shared memory of Pleiades 44
  • 45. Real Time Processing  Reasonable performance ◦ A few ms response time  In-Memory DB  Minor compaction ◦ When a memory table is full  Major compaction ◦ Combine multiple tables for fast search operation  Garbage collection 45
  • 46. MapReduce 46
  • 47. Client API & Shell Command 47
  • 48. Failover * Active master sets NEPTUNE_MASTER lock to Pleiades releases NEPTUNE_MASTER lock if * Pleiades active master is fault & slave master gets the lock 48
  • 49. Concluding Remarks  Big Data ◦ Volume, Velocity, Variety  Store/Manage Big Data ◦ Hadoop, NoSQL in cluster  Parallel Programming ◦ MapReduce  Analytics (Mining & Visualization) ◦ Mahout, R 49
  • 50. Future Plan: Infra  A pilot system for Big Data (2012. 12) ◦ 1 Manage Server ◦ 1 Name Node  2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD ◦ 5 Data Nodes  2 CPU * 6 Core, 24GB RAM, 1TB HDD & SSD ◦ Rack Mount ◦ Gigabit Switch Hub ◦ Hadoop & CDH 50
  • 51. Future Plan: Research  Developing machine learning, modeling and optimization algorithms for mining/visualizing public data in TB  Re-designing existing data mining algorithms using MapReduce ◦ Data Mining Algorithms: Clustering, Classification, Probabilistic Modeling, Association Rule Mining, Graph Analysis, etc. ◦ Serialization algorithms => Parallel algorithms 51
  • 52. Reference  G. Shim, MapReduce Algorithms for Big Data Analysis, VLDB 2012 Tutorial  J. Schindler, I/O Characteristics of NoSQL Databases, VLDB 2012 Tutorial  J. Leskovec, Mining Massive Datasets, Available: http://cs246.stanford.edu  김형준, Neptune: 대용량 분산 데이터 관리 시스템, NHN Tech. Report, 2008  T. White, Hadoop: The Definitive Guide, O’Reilly 2012 52
  • 54. Outline  Hadoop Installation  Word Counting using MapReduce 54
  • 55. Software for Hadoop Installation VirtualBox 4.1.22 https://www.virtualbox.org/ http://www.centos.org/ http://www.oracle.com/tec hnetwork/java/index.html http://apache.tt.co.kr/hadoop/commo n/hadoop-1.0.4/hadoop-1.0.4.tar.gz 55
  • 56. Hadoop Installation Configur JDK Hadoop ation tar xvf hadoop-1.0.4-bin.tar.gz 56
  • 57. Three Modes for Hadoop Installation Pseudo- Fully Standalone distributed distributed 57
  • 58. Configuration  hadoop-1.0.4/conf  Hadoop-env.sh  core-site.xml  hdfs-site.xml  mapred-site.xml 58
  • 59. Pseudo-Distributed Mode  core-site.xml  mapred-site.xml  hdfs-site.xml 59
  • 61. Word Count using MapReduce 61
  • 62. Word Count using MapReduce 62
  • 63. Word Count using MapReduce 63
  • 64. Word Count using MapReduce 64
  • 65. Text Data Input file Output file 65
  • 66. Patent ID Data Input file Output file 66