Big Data - JAX2011 (Pavlo Baron)

Pavlo Baron http://www.pbit.org [email_address] @pavlobaron

So, you think you’re having Big Data?

Know your data your scenarios how to scale

Know your data your scenarios how to scale the technology

Know your data your scenarios how to scale the technology when to stop

Where does your data actually come from ?

Do you have a million well structured records?

Or a couple of Gigabytes of storage?

Does your data get modified every now and then ?

Do you look at your data once a month to create a management report?

Or is your data an unstructured chaos?

Do you get flooded by tera-/petabytes of data?

Or do you simply get bombed with data?

Does your data flow on streams at a very high rate from different locations?

Or do you have to read The Matrix ?

Do you need to distribute your data over the whole world

Or does your existence depend on (the quality of) your data?

Is it the storage that you need to focus on?

Or are you more preparing data?

Or do you have your customers spread all over the world ?

Or do you have complex statistical analysis to do?

Or do you have to filter data as it comes?

Or is it necessary to visualize the data?

To scale for Big Data means to...

Chop in bite-size , manageable pieces

Separate reading from writing

Separate archive from accessible data

Trash everything that has only to be analyzed in real-time

Strive after spatial proximity to processors

Relax new hardware startup procedure

Consider hardware fallibility

Strive after spatial proximity to users

Consider network unreliability

Design with eventual actuality/consistency in mind

Design with Byzantine faults in mind

Consider latency an adjustment screw

Consider availability an adjustment screw

Design for theoretically unlimited amount of data

Design for frequent structure changes

Design for the all-in-one mix

It’s not sufficient anymore just to throw it at your Oracle DB and to hope it works

You have no chance without science

To manage Big Data means to learn/know

To manage Big Data means to learn/know algorithms/ADTs

To manage Big Data means to learn/know algorithms/ADTs computing systems

To manage Big Data means to learn/know algorithms/ADTs computing systems networking

To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems

To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems

To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems distributed systems

So, you know your data . You know your scenarios . You know the theory

Now pick the right tools for the job

I have thousands of log records per second . I want to store them immediately , but reliably for later statistics. How would I do that?

Consider DHTs , P2P systems, distributed data stores etc.

In order to write fast , distribute to several nodes with sloppy non-durable write quorum

Build upon a system implementing consistent hashing . Don’t try home-made sharding as distribution replacement - you will fail adding new nodes

Try Riak . It’s derived from Amazon Dynamo. It implements consistent hashing, gossip architecture, hinted handoff, vector clocks, merkle trees, sloppy quorum etc.

I need to analyze these records in real-time for patterns and send alerts when any of them match. How would I do that?

Consider CEP – complex event processing with an EPL – event processing language

Choose a sliding time window just big enough to recognize causality and fire events on thresholds, deviations, anomalies etc.

Try Esper . It implements CEP/EPL

I collect my data at different locations all over the world , but want to do statistical analysis in my headquarters or at one other location . How would I do that?

Aggregate your data and push it e.g. once a day out to the cloud . That’s a sort of replication if you like

Choose a cloud based data store which can store big objects . The store should provide similar consistency characteristics as your local data store

Try AWS (S3) or Rackspace (OpenStack/Swift) or a private cloud . They are either directly Dynamo based or implement similar concepts

That’s a lot of data and distribution. I need to quickly push it from a location into the cloud while data keeps coming in . How would I do that?

Use MapReduce to distribute the aggregation job to a group of nodes in order to quickly get the overall aggregation and cloud storage done

Map , sort , combine and reduce to whatever representation you need

Separate MapReduce splitting, jobs and intermediate storage from the local store to keep them independent and thus to read local store snapshots while still writing the new data

Try Hadoop . It implements MapReduce with an own file system (HDFS), distribution etc. It is highly extensible

I need to do some statistical analysis and visualize my data. How would I do that?

Choose a general purpose platform for statistical computing and graphics

Try R . It allows statistical analysis of whatever type of data and its graphical plotting. It’s highly extensible

And these are only some of the possible Q&As. There are more areas such as NoSQL , content preparation for CDN , data mining etc. which we didn’t consider

The experiment – live demo Source code will be available on http://github.com/pavlobaron Detail description will be available on http://archi-jab-ture.blogspot.com

Situation Data Center (US) Data Center (EUR) Data Center (AFR) Votes Votes Reports

SoapUI Simulation HTTP server Esper Alert Riak Object Storage Hadoop R

We store votes as they come with sloppy write quorum. We store on several nodes in a regional cluster. We match patterns on a stream, not on saved data. We push aggregated day data from regions to the cloud using distributed MapReduce. We use scalable, distributed components with HA options. Etc. Does it scale ?

Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages

Big Data - JAX2011 (Pavlo Baron)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Big Data - JAX2011 (Pavlo Baron)

Ähnlich wie Big Data - JAX2011 (Pavlo Baron) (20)

Mehr von Pavlo Baron

Mehr von Pavlo Baron (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data - JAX2011 (Pavlo Baron)