Making Big Data, small

MAKING BIG DATA, SMALL
Using distributed systems for processing, analysing and managing
large huge data sets

Marcin Jedyk
Software Professional’s Network, Cheshire Datasystems Ltd

WARM-UP QUESTIONS
 How many of you heard about Big Data before?
 How many about NoSQL?

 Hadoop?

AGENDA.
 Intro – motivation, goal and ‘not about…’
 What is Big Data?
 NoSQL and systems classification
 Hadoop & HDFS
 MapReduce & live demo
 HBase

AGENDA
 Pig
 Building Hadoop cluster

 Conclusions

 Q&A

MOTIVATION
 Data is everywhere – why not to analyse it?
 With Hadoop and NoSQL systems, building
distributed systems is easier than before
 Relying on software & cheap hardware rather
than expensive hardware works better!

GOAL
 To explain basic ideas behind Big Data
 To present different approaches towards BD

 To show that Big Data systems are easy to build

 To show you where to start with such systems

WHAT IT IS NOT ABOUT?
 Not a detailed lecture on a single system
 Not about advanced techniques in Big Data

 Not only about technology – but also about its
application

WHAT IS BIG DATA?
 Data characterised by 3 Vs:
 Volume

 Variety

 Velocity

 The interesting ones: variety & velocity

WHAT IS BIG DATA
 Data of high velocity: cannot store? Process on
the fly!
 Data of high variety: doesn’t fit into relational
schema? Don’t use schema, use NoSQL!
 Data which is impractical to process on a single
server

NO-SQL
 Hand in and with Big Data
 NoSQL – an umbrella term for non-relational
data bases or data storages
 It’s not always possible to replace RDBMS with
NoSQL! (opposite is also true)

NO-SQL
 NoSQL DBs are built around different principles
 Key-value stores: Redis, Riak
 Document stores: i.e. MongoDB – record as a
document; each entry has its own meta-data (JSON like,
BSON)
 Table stores: i.e. Hbase – data persisted in multiple
columns (even millions), billions of rows and multiple
versions of records

HADOOP
 Existed before ‘Big Data’ buzzword emerged
 A simple idea – MapReduce

 A primary purpose – to crunch tera- and
petabytes of data
 HDFS as underlying distributed file system

HADOOP – ARCHITECTURE BY EXAMPLE
 Image you need to process 1TB of logs
 What would you need?

 A server!

 But 1TB is quite a lot of data… we want it
quicker!
 Ok, what about distributed environment?

 So what about that Hadoop stuff?
 Each node can: store data & process it (DataNode
& TaskTracker)

 How about allocating jobs to slaves? We need a
JobTracker!

 How about HDFS, how data blocks are
assembled into files?
 NameNode does it.

 NameNode – manages HDFS metadata, doesn’t
deal with files directly
 JobTracker – schedules, allocates and monitors
job execution on slaves – TaskTrackers
 TaskTracker – runs MapReduce operations
 DataNode – stores blocks of HDFS – default
replication level for each block: 3

HADOOP - LIMITATIONS
 DataNodes & TaskTrackers are fault tollerant
 NameNode & JobTracker are NOT! (existing
workaround for this problem)
 HDFS deals nicely with large files, doesn’t do
well with billions of small files

MAP_REDUCE
 MapReduce – parallelisation approach
 Two main stages:
 Map – do an actual bit of work, i.e.: extract info
 Reduce – summarise, aggregate or filter outputs from
Map operation
 For each job, multiple Map and Reduce operations
– each may run on different node = parallelism

MAP_REDUCE – AN EXAMPLE
 Let’s process 1TB of raw logs and extract traffic by
host.
 After submitting a job, JobTracker allocates tasks
to slaves – possibly divided into 64MB packs =
16384 Map operations!
 Map - analyse logs and return them as set of
<key,value>
 Reduce -> merge output of Map operations

 Take a look at mocked log extract:
[IP – bandwidth]
10.0.0.1 – 1234
10.0.0.1 – 900
10.0.0.2 – 1230
10.0.0.3 – 999

 It’s important to define key, in this case IP
<10.0.0.1;2134>
<10.0.0.2;1230>
<10.0.0.3;999>
 Now, assume another Map operation returned:
<10.0.0.1;1500>
<10.0.0.3;1000>
<10.0.0.4;500>

Now, Reduce will merge those results:
<10.0.0.1;3624>
<10.0.0.2;2230>
<10.0.0.3;1499>
<10.0.0.4;500>

MAP_REDUCE
 Selecting a key is important
 It’s possible to define composite key, i.e.
IP+date
 For more complex tasks, it’s possible to chain
MapReduce jobs

HBASE
 Another layer on top of Hadoop/HDFS
 A distributed data storage

 Not a replacement for RDBMS!

 Can be used with MapReduce

 Good for unstructured data – no need to worry
about exact schema in advance

PIG – HBASE ENHANCEMENT
 HBase - missing proper query language
 Pig – makes life easier for HBase users

 Translates queries into MapReduce jobs

 When working with Pig or HBase, forget what
you know about SQL – it makes your life easier

BUILDING HADOOP CLUSTER
 Post production servers are ok
 Don’t take ‘cheap hardware’ too literally
 Good connection between nodes is a must!
 >=1Gbps between nodes
 >=10Gbps between racks
 1 disk per CPU core
 More RAM, more caching!

FINAL CONCLUSIONS
 Hadoop and NoSQL-like DB/DS scale very well
 Hadoop ideal for crunching huge data sets

 Does very well in production environment

 Cluster of slaves is fault tolerant, NameNode
and JobTracker are not!

EXTERNAL RESOURCES
 Trending Topic – build on Wikipedia access logs:
http://goo.gl/BWWO1
 Building web crawler with Hadoop:
http://goo.gl/xPTlJ
 Analysing adverse drug events:
http://goo.gl/HFXAx
 Moving average for large data sets:
http://goo.gl/O4oml

EXTERNAL RESOURCES – USEFUL LINKS
http://www.slideshare.net/fullscreen/jpatanooga/la-hug-dec-2011-
recommendation-talk/1
https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation+Guide
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://hstack.org/hbase-performance-testing/
http://www.theregister.co.uk/2012/06/12/hortonworks_data_platform_one/
http://wiki.apache.org/hadoop/MachineScaling
http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-
ladis2009.pdf
http://www.cloudera.com/resource-types/video/
http://hstack.org/why-were-using-hbase-part-2/

Making Big Data, small

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Making Big Data, small

Ähnlich wie Making Big Data, small (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Making Big Data, small