SlideShare ist ein Scribd-Unternehmen logo
1 von 81
Big
     Big
    Big
   Big
 Big
data developer
Pavlo Baron
pavlo.baron@codecentric.de
              @pavlobaron
Disclaimer:

I will not flame-war,
rant, bash or do
anything similar around
concrete tools.

I'm tired of doing
that
Hey, dude.

I'm a big data
developer. We have
enormous data.

I continuously
read articles on
highscalability.com
Big data
is easy.

It's like normal data,
but, well,
even bigger.
Aha. So this
is not big, right?
That's big, huh?
Peta bytes of data every hour
on different continents, with
complex relations and with
the need to analyze them
in almost real time for anomalies
and to visualize them for
your management
You can easily call this big data.
Everything below this you can
call Mickey Mouse data
Good news is: you can
intentionally grow – collect
more data. And you should!
np, dude!

Data is data.

Same principles
Really?

Let's consider storage, ok?
np, dude!

Storage is just a
database
Really?

Storage capacity of one
single box, no matter how big
it is, is limited
np, dude!

You're talking
about SQL, right?

SQL is
boring grandpa
stuff and
doesn't scale
Oh.

When you expect big data, you
need to scale very far and thus
build on distribution and
combine theoretically
unlimited amount of machines to
one single distributed storage
There is no way around.

Except you invent the BlackHoleDB
np, dude!

NoSQL scales and is
cool.

They achieve this
through sharding,
ya know?

Sharding hides
this distribution
stuff
Hem.

Building upon distribution is
much harder than anything you've
seen or done before
Except you fed a crowd with
7 breads and walked upon the
water
np, dude!

From three of
CAP, I'll just
pick two.

As easy as this
The only thing that is absolutely
certain about distributed systems is
that parts of them will fail and you
will have no idea where and what
the hell is going on
So your P must be a given in a
distributed system. And you want to
play with C vs. A, not just take black
or white
np, dude!

Sharding works
seamlessly. I
don't need to take
care of anything
Seriously?

For example, one of the hardest
challenges with big data
is to distribute/shard parts over
several machines still having fast
traversals and reads, thus
keeping related data together.

Valid for graph and any other data
store, also NoSQL, kind of
Another hard challenge with sharding
is to avoid naive hashing.

Naive hashing would make you depend
on the number of nodes and would not
allow you to easily add or remove
nodes to/from the system
And still, the trade-off between data
locality, consistency, availability,
read/write/search speed, latency etc.
is hard
np, dude!

NoSQL would
write asynchronously
and do map/reduce
to find data
Of course.

You will love eventual
consistency, especially when you need
a transaction around a complex
money transfer
np, dude!

I don't have money
to transfer. But I
need to store lots
of data.

I can throw any
amount of it at
NoSQL, and it
will just work
Really?

So you'd just throw
something into your database
and hope it works?
What if you throw and
miss the target?
Data locality, redundancy,
consistent hashing and eventual
consistency combined with use
case driven storage design
are key principles in succeeding
with a huge distributed data storage.

That's big data development
How about data provisioning?
np, dude!

It's database
being the bottle
neck, not my
web servers
When you have thousands or millions
parallel requests per second, begging
for data, the first mile will (also) quickly
become the bottle neck.

Requests will get queued and discarded
as soon as your server doesn't bring
data fast enough through the pipe
np, dude!

I'll get me some
bad-ass sexy
hardware
I bet you will.

But under high load, your hardware
will more or less quickly start to crack
You'll burn your hard disks, boards and
cards. And wires. And you'll heat up
to a maximum
It's not about sexy hardware, but
about being able to quickly replace it.

Ideally while the system keeps running
But anyway.

To keep the first mile scalable and
fast, would lead to some expensive
network infrastructure.

You need to get the maximum out
of your servers in order to reduce
their number
np, dude!

I will use an event
driven C10K
problem solving
awesome web
server. Or I'll write
one on my own
Maybe.

But when your users are coming from
all over the world, it won't help you
much since the network latency from
them to your server will kill them
You would have to go for a CDN one
day, statically pre-computing content.

You would use their infrastructure
and reduce the number of hits on
your own servers to a minimum
np, dude!

I'll push my whole
platform out to the
cloud. It's even
more flexible and
scales like hell
Well.

You cannot reliably predict on which
physical machine and actually how
close to the data you program will run.

Whenever virtual machines or
storage fragments get moved, your
world stops
You can easily force data locality and
shorter stop-the-world-phases
by paying higher bills
Data locality, geographic spatiality,
dedicated virtualization and content
pre-computability combined with use
case driven cloudification
are key principles in succeeding
with provisioning of huge data amounts.

That's big data development
Let's talk about processing, ok?
np, dude!

All easily done
by a map/reduce
tool
Almost agreed.

map/reduce has two basic phases:
even “map” and “reduce”
The slowest of those two
is definitely “split”.

Moving data from one huge pile to
another before map/reduce is
damn expensive
np, dude!

I'll write my data
straight to the
storage of my
map/reduce tool.

It will then tear
It can.

But what if you need to search during
the map phase – full-text, meta?
np, dude!

I'll use a cool
indexing search
engine or library.

It can find my data
in a snap
Would it?

A very hard challenge is to partition
the index and to couple its related
parts to the corresponding data.

With data locality of course, having
index pieces on the related machines
Data and index locality and direct filling
of data pots as data flies by combined
with use case driven technology
usage are key principles in succeeding
with processing of huge data amounts.

That's big data development
So, how about analytics, dude?
np, dude!

It's classic use case
for map/reduce.

I can do this
afterwards and on
the fly
Are you sure?

So, you expect one tool to do both,
real-time and post fact analytics?
What did you smoke last night?
You don't want to believe
in map/reduce in (near) real-time,
don't you?
np, dude!

I'll get me some
rocket fast
hardware
I'm sure you will. But:

You cannot predict and fix the
map/reduce time.

You cannot ensure
the completeness of data.

You cannot
guarantee causality knowledge
Distribution has got your soul
If you need to predict better,
to be able to know about data/event
causality, to be fast you need to CEP
data streams as data flies by.

There is no (simple, fast) way around
But the most important thing is:

None of the BI tools you know will
adequately support your NoSQL
data store, so you're all alone
in the world of proprietary
immature tool combinations.

The world of pain.
np, dude!

My map/reduce
tool can even hide
math from me, so
I can concentrate
on doing stuff
There is no point in fearing
math/statistics. You just need it
Separation of immediate and
post fact analytics and CEP of
data streams as data flies by combined
with use case driven technology
usage and statistical knowledge
are key principles in succeeding
with analytics of huge data amounts.

That's big data development
Oh, we forgot visualization
np, dude!

I just have no
idea about it
Me neither.

I just know that you can't visualize
huge data amounts using classic
spreadsheets. There are better ways,
tools, ideas to do this – find them

That's big data development
hem, dude...

You're a smart-ass.

Is it that you want
to say?
Almost.

In one of my humble moments I would
suggest you to do the following:
Stop thinking you gain adequately deep
knowledge through reading half-baked
blog posts. Get yourself some of those:
Statistics, Visualization
                                Distribution
                                 Network
                           Different languages
                               Tools, chains
Know and use full stack




                                Data stores
                            Different platforms
                                    OS
                                  Storage
                                 Machine
                                Algorithms
                                   Math
Know your point of pain.

You must be Twitter, Facebook or
Google to have them all same time.

If you're none of them, you can have
one or two. Or even none.

Go for them with the right chain tool
First and the most important tool in the
chain is your brain
Thank you
Most images originate from
          istockphoto.com

      except few ones taken
from Wikipedia or Flickr (CC)
         and product pages
 or generated through public
           online generators

Weitere ähnliche Inhalte

Andere mochten auch

Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Pavlo Baron
 
Programa (extens) PSC Premià de Mar
Programa (extens) PSC Premià de MarPrograma (extens) PSC Premià de Mar
Programa (extens) PSC Premià de Martxuscruz
 
Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)txuscruz
 
October 2011 website powerpoint
October 2011 website powerpointOctober 2011 website powerpoint
October 2011 website powerpointharfordcountymd
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)Pavlo Baron
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Pavlo Baron
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPPavlo Baron
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)Pavlo Baron
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)Pavlo Baron
 
Future Things/Robotic Products
Future Things/Robotic ProductsFuture Things/Robotic Products
Future Things/Robotic ProductsSusie Herbstritt
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)Pavlo Baron
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Pavlo Baron
 

Andere mochten auch (14)

Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)Big Data - JAX2011 (Pavlo Baron)
Big Data - JAX2011 (Pavlo Baron)
 
Programa (extens) PSC Premià de Mar
Programa (extens) PSC Premià de MarPrograma (extens) PSC Premià de Mar
Programa (extens) PSC Premià de Mar
 
Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)
 
October 2011 website powerpoint
October 2011 website powerpointOctober 2011 website powerpoint
October 2011 website powerpoint
 
Asma
AsmaAsma
Asma
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTP
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)
 
Future Things/Robotic Products
Future Things/Robotic ProductsFuture Things/Robotic Products
Future Things/Robotic Products
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)
 
Kokkola
KokkolaKokkola
Kokkola
 

Ähnlich wie The Big Data Developer (@pavlobaron)

Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10benoitg
 
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesMax De Marzi
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0George Ang
 
No Sql Bigdata Drone
No Sql Bigdata DroneNo Sql Bigdata Drone
No Sql Bigdata DroneDony Riyanto
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdatabalu kvm
 
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Anna Kuhn
 
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Pavlo Baron
 
How to Build Tools for Data Scientists That Don't Suck
How to Build Tools for Data Scientists That Don't SuckHow to Build Tools for Data Scientists That Don't Suck
How to Build Tools for Data Scientists That Don't SuckDiana Tkachenko
 
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakiStreaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakijavier ramirez
 
Devry Acct 571 Complete Course
Devry Acct 571 Complete CourseDevry Acct 571 Complete Course
Devry Acct 571 Complete CourseStephanie Roberts
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011D Ruth Bavousett
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 

Ähnlich wie The Big Data Developer (@pavlobaron) (20)

Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - Notes
 
Big Data
Big DataBig Data
Big Data
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Art Of Distributed P0
Art Of Distributed P0Art Of Distributed P0
Art Of Distributed P0
 
No Sql Bigdata Drone
No Sql Bigdata DroneNo Sql Bigdata Drone
No Sql Bigdata Drone
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata2951085 dzone-2016guidetobigdata
2951085 dzone-2016guidetobigdata
 
Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?Big Data vs. Small Data...what's the difference?
Big Data vs. Small Data...what's the difference?
 
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)
 
How to Build Tools for Data Scientists That Don't Suck
How to Build Tools for Data Scientists That Don't SuckHow to Build Tools for Data Scientists That Don't Suck
How to Build Tools for Data Scientists That Don't Suck
 
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowakiStreaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Devry Acct 571 Complete Course
Devry Acct 571 Complete CourseDevry Acct 571 Complete Course
Devry Acct 571 Complete Course
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 
Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 

Mehr von Pavlo Baron

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve itPavlo Baron
 
Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Pavlo Baron
 
Qcon2015 living database
Qcon2015 living databaseQcon2015 living database
Qcon2015 living databasePavlo Baron
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Pavlo Baron
 
The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)Pavlo Baron
 
data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)Pavlo Baron
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Pavlo Baron
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)Pavlo Baron
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Pavlo Baron
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Pavlo Baron
 
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Pavlo Baron
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)Pavlo Baron
 
NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)Pavlo Baron
 
The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)Pavlo Baron
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Pavlo Baron
 

Mehr von Pavlo Baron (15)

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it
 
Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)
 
Qcon2015 living database
Qcon2015 living databaseQcon2015 living database
Qcon2015 living database
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)
 
The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)
 
data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)
 
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)
 
NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)
 
The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)
 

Kürzlich hochgeladen

UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Kürzlich hochgeladen (20)

UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

The Big Data Developer (@pavlobaron)

  • 1. Big Big Big Big Big data developer
  • 3. Disclaimer: I will not flame-war, rant, bash or do anything similar around concrete tools. I'm tired of doing that
  • 4. Hey, dude. I'm a big data developer. We have enormous data. I continuously read articles on highscalability.com
  • 5. Big data is easy. It's like normal data, but, well, even bigger.
  • 6. Aha. So this is not big, right?
  • 8. Peta bytes of data every hour on different continents, with complex relations and with the need to analyze them in almost real time for anomalies and to visualize them for your management
  • 9. You can easily call this big data. Everything below this you can call Mickey Mouse data
  • 10. Good news is: you can intentionally grow – collect more data. And you should!
  • 11. np, dude! Data is data. Same principles
  • 13. np, dude! Storage is just a database
  • 14. Really? Storage capacity of one single box, no matter how big it is, is limited
  • 15. np, dude! You're talking about SQL, right? SQL is boring grandpa stuff and doesn't scale
  • 16. Oh. When you expect big data, you need to scale very far and thus build on distribution and combine theoretically unlimited amount of machines to one single distributed storage
  • 17. There is no way around. Except you invent the BlackHoleDB
  • 18. np, dude! NoSQL scales and is cool. They achieve this through sharding, ya know? Sharding hides this distribution stuff
  • 19. Hem. Building upon distribution is much harder than anything you've seen or done before
  • 20. Except you fed a crowd with 7 breads and walked upon the water
  • 21. np, dude! From three of CAP, I'll just pick two. As easy as this
  • 22. The only thing that is absolutely certain about distributed systems is that parts of them will fail and you will have no idea where and what the hell is going on
  • 23. So your P must be a given in a distributed system. And you want to play with C vs. A, not just take black or white
  • 24. np, dude! Sharding works seamlessly. I don't need to take care of anything
  • 25. Seriously? For example, one of the hardest challenges with big data is to distribute/shard parts over several machines still having fast traversals and reads, thus keeping related data together. Valid for graph and any other data store, also NoSQL, kind of
  • 26. Another hard challenge with sharding is to avoid naive hashing. Naive hashing would make you depend on the number of nodes and would not allow you to easily add or remove nodes to/from the system
  • 27. And still, the trade-off between data locality, consistency, availability, read/write/search speed, latency etc. is hard
  • 28. np, dude! NoSQL would write asynchronously and do map/reduce to find data
  • 29. Of course. You will love eventual consistency, especially when you need a transaction around a complex money transfer
  • 30. np, dude! I don't have money to transfer. But I need to store lots of data. I can throw any amount of it at NoSQL, and it will just work
  • 31. Really? So you'd just throw something into your database and hope it works?
  • 32. What if you throw and miss the target?
  • 33. Data locality, redundancy, consistent hashing and eventual consistency combined with use case driven storage design are key principles in succeeding with a huge distributed data storage. That's big data development
  • 34. How about data provisioning?
  • 35. np, dude! It's database being the bottle neck, not my web servers
  • 36. When you have thousands or millions parallel requests per second, begging for data, the first mile will (also) quickly become the bottle neck. Requests will get queued and discarded as soon as your server doesn't bring data fast enough through the pipe
  • 37. np, dude! I'll get me some bad-ass sexy hardware
  • 38. I bet you will. But under high load, your hardware will more or less quickly start to crack
  • 39. You'll burn your hard disks, boards and cards. And wires. And you'll heat up to a maximum
  • 40. It's not about sexy hardware, but about being able to quickly replace it. Ideally while the system keeps running
  • 41. But anyway. To keep the first mile scalable and fast, would lead to some expensive network infrastructure. You need to get the maximum out of your servers in order to reduce their number
  • 42. np, dude! I will use an event driven C10K problem solving awesome web server. Or I'll write one on my own
  • 43. Maybe. But when your users are coming from all over the world, it won't help you much since the network latency from them to your server will kill them
  • 44. You would have to go for a CDN one day, statically pre-computing content. You would use their infrastructure and reduce the number of hits on your own servers to a minimum
  • 45. np, dude! I'll push my whole platform out to the cloud. It's even more flexible and scales like hell
  • 46. Well. You cannot reliably predict on which physical machine and actually how close to the data you program will run. Whenever virtual machines or storage fragments get moved, your world stops
  • 47. You can easily force data locality and shorter stop-the-world-phases by paying higher bills
  • 48. Data locality, geographic spatiality, dedicated virtualization and content pre-computability combined with use case driven cloudification are key principles in succeeding with provisioning of huge data amounts. That's big data development
  • 49. Let's talk about processing, ok?
  • 50. np, dude! All easily done by a map/reduce tool
  • 51. Almost agreed. map/reduce has two basic phases: even “map” and “reduce”
  • 52. The slowest of those two is definitely “split”. Moving data from one huge pile to another before map/reduce is damn expensive
  • 53. np, dude! I'll write my data straight to the storage of my map/reduce tool. It will then tear
  • 54. It can. But what if you need to search during the map phase – full-text, meta?
  • 55. np, dude! I'll use a cool indexing search engine or library. It can find my data in a snap
  • 56. Would it? A very hard challenge is to partition the index and to couple its related parts to the corresponding data. With data locality of course, having index pieces on the related machines
  • 57. Data and index locality and direct filling of data pots as data flies by combined with use case driven technology usage are key principles in succeeding with processing of huge data amounts. That's big data development
  • 58. So, how about analytics, dude?
  • 59. np, dude! It's classic use case for map/reduce. I can do this afterwards and on the fly
  • 60. Are you sure? So, you expect one tool to do both, real-time and post fact analytics?
  • 61. What did you smoke last night?
  • 62. You don't want to believe in map/reduce in (near) real-time, don't you?
  • 63. np, dude! I'll get me some rocket fast hardware
  • 64. I'm sure you will. But: You cannot predict and fix the map/reduce time. You cannot ensure the completeness of data. You cannot guarantee causality knowledge
  • 65. Distribution has got your soul
  • 66. If you need to predict better, to be able to know about data/event causality, to be fast you need to CEP data streams as data flies by. There is no (simple, fast) way around
  • 67. But the most important thing is: None of the BI tools you know will adequately support your NoSQL data store, so you're all alone in the world of proprietary immature tool combinations. The world of pain.
  • 68. np, dude! My map/reduce tool can even hide math from me, so I can concentrate on doing stuff
  • 69. There is no point in fearing math/statistics. You just need it
  • 70. Separation of immediate and post fact analytics and CEP of data streams as data flies by combined with use case driven technology usage and statistical knowledge are key principles in succeeding with analytics of huge data amounts. That's big data development
  • 71. Oh, we forgot visualization
  • 72. np, dude! I just have no idea about it
  • 73. Me neither. I just know that you can't visualize huge data amounts using classic spreadsheets. There are better ways, tools, ideas to do this – find them That's big data development
  • 74. hem, dude... You're a smart-ass. Is it that you want to say?
  • 75. Almost. In one of my humble moments I would suggest you to do the following:
  • 76. Stop thinking you gain adequately deep knowledge through reading half-baked blog posts. Get yourself some of those:
  • 77. Statistics, Visualization Distribution Network Different languages Tools, chains Know and use full stack Data stores Different platforms OS Storage Machine Algorithms Math
  • 78. Know your point of pain. You must be Twitter, Facebook or Google to have them all same time. If you're none of them, you can have one or two. Or even none. Go for them with the right chain tool
  • 79. First and the most important tool in the chain is your brain
  • 81. Most images originate from istockphoto.com except few ones taken from Wikipedia or Flickr (CC) and product pages or generated through public online generators