SlideShare ist ein Scribd-Unternehmen logo
1 von 95
Big Data Pavlo Baron
Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
So, you  think  you’re having Big Data?
Don’t  think
Know
Know your data
Know your data your scenarios
Know your data your scenarios how to scale
Know your data your scenarios how to scale the technology
Know your data your scenarios how to scale the technology when to stop
Know your data your scenarios how to scale the technology when to stop
Where  does your data actually come  from ?
Do you have a million well structured records?
Or a couple of  Gigabytes  of storage?
Does your data get modified every now and then ?
Do you look at your data  once a month  to create a management report?
Or is your data an  unstructured  chaos?
Do you get flooded by  tera-/petabytes  of data?
Or do you simply get bombed  with data?
Does your data flow on  streams  at a very  high rate  from different locations?
Or do you have to read The Matrix ?
Do you need to distribute your data over the  whole  world
Or does your  existence depend on (the quality of) your data?
Know your data your scenarios how to scale the technology when to stop
Is it the  storage  that you need to focus on?
Or are you more preparing  data?
Or do you have your customers  spread all  over the world ?
Or do you have complex  statistical analysis  to do?
Or do you have to  filter  data as it comes?
Or is it necessary to  visualize  the data?
Know your data your scenarios how to scale the technology when to stop
To  scale  for Big Data means to...
Chop  in smaller pieces
Chop in  bite-size , manageable pieces
Separate  reading from writing
Minimize  hard relations
Separate archive  from accessible data
Trash  everything that has only to be  analyzed in  real-time
Parallelize  and distribute
Strive after spatial proximity to processors
Utilize  commodity hardware
Relax  new hardware startup procedure
Consider  hardware fallibility
Strive after spatial proximity  to users
Consider  network unreliability
Design with  eventual actuality/consistency  in mind
Design with  Byzantine faults in mind
Consider  latency  an adjustment screw
Consider  availability  an adjustment screw
Be prepared for disaster
Utilize the fog/clouds
Design  for theoretically unlimited amount of data
Design  for frequent structure changes
Design  for the all-in-one mix
Know your data your scenarios how to scale the technology when to stop
It’s  not sufficient  anymore just to  throw  it  at  your Oracle  DB and to  hope  it works
You have no chance without  science
To  manage  Big Data means to learn/know
To  manage  Big Data means to learn/know algorithms/ADTs
To  manage  Big Data means to learn/know algorithms/ADTs computing systems
To  manage  Big Data means to learn/know algorithms/ADTs computing systems networking
To  manage  Big Data means to learn/know algorithms/ADTs computing systems networking operating systems
To  manage  Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems
To  manage  Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems distributed systems
So, you know your  data . You know your  scenarios . You know the  theory
Now pick the right tools  for the job
I have  thousands of log  records per second . I want to  store  them immediately , but reliably  for later statistics. How would I do that?
Consider  DHTs ,  P2P  systems, distributed data stores  etc.
In order to  write fast , distribute to several nodes with  sloppy non-durable  write quorum
Build upon a system implementing  consistent hashing . Don’t try home-made sharding as distribution replacement - you will fail adding new nodes
Try  Riak . It’s derived from Amazon Dynamo. It implements consistent hashing, gossip architecture, hinted handoff, vector clocks, merkle trees, sloppy quorum etc.
I need to  analyze these records in  real-time for  patterns  and send alerts  when any of them match. How would I do that?
Consider  CEP  – complex event processing with an EPL – event processing language
Choose a sliding time window just big enough to  recognize causality  and  fire events  on thresholds, deviations, anomalies etc.
Try  Esper . It implements CEP/EPL
I collect my data at different locations  all  over the world , but want to do  statistical analysis  in my headquarters or  at one  other  location . How would I do that?
Aggregate  your data and  push  it e.g. once a day out  to the cloud . That’s a sort of replication if you like
Choose a  cloud based  data store which can store  big objects . The store should provide similar consistency characteristics as  your  local data store
Try  AWS (S3)  or Rackspace   (OpenStack/Swift) or a  private cloud . They are either directly Dynamo based or implement similar concepts
That’s a lot of data and distribution. I need to quickly push  it from a location into the cloud while  data keeps coming in . How would I do that?
Use  MapReduce  to  distribute the aggregation job  to a  group of nodes  in order to  quickly   get  the overall aggregation and cloud storage  done
Map ,  sort ,  combine  and  reduce to whatever representation you need
Separate  MapReduce splitting, jobs and intermediate storage  from  the  local store to keep them independent and thus to read local store snapshots while still writing the new data
Try  Hadoop . It implements MapReduce with an own file system (HDFS), distribution etc. It is highly extensible
I need to do some  statistical analysis  and  visualize my data. How would I do that?
Choose a  general purpose platform  for statistical computing and graphics
Try  R . It allows statistical analysis of whatever type of data and its graphical plotting. It’s highly extensible
And these are only some of the possible Q&As. There are more areas such as  NoSQL , content preparation for  CDN , data mining  etc. which we didn’t consider
Know your data your scenarios how to scale the technology when to stop
The experiment – live demo Source code will be available on http://github.com/pavlobaron Detail description will be available on http://archi-jab-ture.blogspot.com
Situation Data Center (US) Data Center (EUR) Data Center (AFR) Votes Votes Reports
SoapUI Simulation HTTP server Esper Alert Riak Object Storage Hadoop R
We store votes as they come with sloppy write quorum. We store on several nodes in a regional cluster. We match patterns on a stream, not on saved data. We push aggregated day data from regions to the cloud using distributed MapReduce. We use scalable, distributed components with HA options. Etc. Does it  scale ?
Thank you
Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages

Weitere ähnliche Inhalte

Was ist angesagt?

High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionSri Ambati
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2Oodsc
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningSri Ambati
 

Was ist angesagt? (20)

High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Introduction to GPUs for Machine Learning
Introduction to GPUs for Machine LearningIntroduction to GPUs for Machine Learning
Introduction to GPUs for Machine Learning
 

Andere mochten auch

From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)Pavlo Baron
 
October 2011 website powerpoint
October 2011 website powerpointOctober 2011 website powerpoint
October 2011 website powerpointharfordcountymd
 
Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Pavlo Baron
 
Assistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic ChildrenAssistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic ChildrenSusie Herbstritt
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)Pavlo Baron
 
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Pavlo Baron
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Pavlo Baron
 
Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)txuscruz
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)Pavlo Baron
 
Q1_networks
Q1_networksQ1_networks
Q1_networksginiskid
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPPavlo Baron
 
Programa (extens) PSC Premià de Mar
Programa (extens) PSC Premià de MarPrograma (extens) PSC Premià de Mar
Programa (extens) PSC Premià de Martxuscruz
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)Pavlo Baron
 
Future Things/Robotic Products
Future Things/Robotic ProductsFuture Things/Robotic Products
Future Things/Robotic ProductsSusie Herbstritt
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)Pavlo Baron
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Pavlo Baron
 

Andere mochten auch (19)

f6k & l10n
f6k & l10nf6k & l10n
f6k & l10n
 
From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)From Hand To Mouth (@pavlobaron)
From Hand To Mouth (@pavlobaron)
 
Asma
AsmaAsma
Asma
 
October 2011 website powerpoint
October 2011 website powerpointOctober 2011 website powerpoint
October 2011 website powerpoint
 
Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)Near realtime analytics - technology choice (@pavlobaron)
Near realtime analytics - technology choice (@pavlobaron)
 
Assistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic ChildrenAssistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic Children
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)
 
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)
 
Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)Programa del PSC Premià de Mar (versió extensa)
Programa del PSC Premià de Mar (versió extensa)
 
The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)The Big Data Developer (@pavlobaron)
The Big Data Developer (@pavlobaron)
 
Q1_networks
Q1_networksQ1_networks
Q1_networks
 
JUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTPJUGS June'11 - Erlang/OTP
JUGS June'11 - Erlang/OTP
 
Programa (extens) PSC Premià de Mar
Programa (extens) PSC Premià de MarPrograma (extens) PSC Premià de Mar
Programa (extens) PSC Premià de Mar
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)
 
Future Things/Robotic Products
Future Things/Robotic ProductsFuture Things/Robotic Products
Future Things/Robotic Products
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)
 
Kokkola
KokkolaKokkola
Kokkola
 

Ähnlich wie Big Data - JAX2011 (Pavlo Baron)

Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Pavlo Baron
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data scienceSusan Ibach
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 

Ähnlich wie Big Data - JAX2011 (Pavlo Baron) (20)

Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)
 
So your boss says you need to learn data science
So your boss says you need to learn data scienceSo your boss says you need to learn data science
So your boss says you need to learn data science
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big Data
Big DataBig Data
Big Data
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Lighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris PeetersLighthouse - an open-source library to build data lakes - Kris Peeters
Lighthouse - an open-source library to build data lakes - Kris Peeters
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Final deck
Final deckFinal deck
Final deck
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

Mehr von Pavlo Baron

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve itPavlo Baron
 
Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Pavlo Baron
 
Qcon2015 living database
Qcon2015 living databaseQcon2015 living database
Qcon2015 living databasePavlo Baron
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Pavlo Baron
 
The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)Pavlo Baron
 
data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)Pavlo Baron
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Pavlo Baron
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)Pavlo Baron
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Pavlo Baron
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Pavlo Baron
 
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Pavlo Baron
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)Pavlo Baron
 
NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)Pavlo Baron
 
The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)Pavlo Baron
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Pavlo Baron
 

Mehr von Pavlo Baron (15)

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it
 
Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)
 
Qcon2015 living database
Qcon2015 living databaseQcon2015 living database
Qcon2015 living database
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)
 
The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)
 
data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)
 
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)
 
NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)
 
The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)
 

Kürzlich hochgeladen

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Big Data - JAX2011 (Pavlo Baron)

  • 2. Pavlo Baron http://www.pbit.org [email_address] @pavlobaron
  • 3. So, you think you’re having Big Data?
  • 7. Know your data your scenarios
  • 8. Know your data your scenarios how to scale
  • 9. Know your data your scenarios how to scale the technology
  • 10. Know your data your scenarios how to scale the technology when to stop
  • 11. Know your data your scenarios how to scale the technology when to stop
  • 12. Where does your data actually come from ?
  • 13. Do you have a million well structured records?
  • 14. Or a couple of Gigabytes of storage?
  • 15. Does your data get modified every now and then ?
  • 16. Do you look at your data once a month to create a management report?
  • 17. Or is your data an unstructured chaos?
  • 18. Do you get flooded by tera-/petabytes of data?
  • 19. Or do you simply get bombed with data?
  • 20. Does your data flow on streams at a very high rate from different locations?
  • 21. Or do you have to read The Matrix ?
  • 22. Do you need to distribute your data over the whole world
  • 23. Or does your existence depend on (the quality of) your data?
  • 24. Know your data your scenarios how to scale the technology when to stop
  • 25. Is it the storage that you need to focus on?
  • 26. Or are you more preparing data?
  • 27. Or do you have your customers spread all over the world ?
  • 28. Or do you have complex statistical analysis to do?
  • 29. Or do you have to filter data as it comes?
  • 30. Or is it necessary to visualize the data?
  • 31. Know your data your scenarios how to scale the technology when to stop
  • 32. To scale for Big Data means to...
  • 33. Chop in smaller pieces
  • 34. Chop in bite-size , manageable pieces
  • 35. Separate reading from writing
  • 36. Minimize hard relations
  • 37. Separate archive from accessible data
  • 38. Trash everything that has only to be analyzed in real-time
  • 39. Parallelize and distribute
  • 40. Strive after spatial proximity to processors
  • 41. Utilize commodity hardware
  • 42. Relax new hardware startup procedure
  • 43. Consider hardware fallibility
  • 44. Strive after spatial proximity to users
  • 45. Consider network unreliability
  • 46. Design with eventual actuality/consistency in mind
  • 47. Design with Byzantine faults in mind
  • 48. Consider latency an adjustment screw
  • 49. Consider availability an adjustment screw
  • 50. Be prepared for disaster
  • 52. Design for theoretically unlimited amount of data
  • 53. Design for frequent structure changes
  • 54. Design for the all-in-one mix
  • 55. Know your data your scenarios how to scale the technology when to stop
  • 56. It’s not sufficient anymore just to throw it at your Oracle DB and to hope it works
  • 57. You have no chance without science
  • 58. To manage Big Data means to learn/know
  • 59. To manage Big Data means to learn/know algorithms/ADTs
  • 60. To manage Big Data means to learn/know algorithms/ADTs computing systems
  • 61. To manage Big Data means to learn/know algorithms/ADTs computing systems networking
  • 62. To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems
  • 63. To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems
  • 64. To manage Big Data means to learn/know algorithms/ADTs computing systems networking operating systems database systems distributed systems
  • 65. So, you know your data . You know your scenarios . You know the theory
  • 66. Now pick the right tools for the job
  • 67. I have thousands of log records per second . I want to store them immediately , but reliably for later statistics. How would I do that?
  • 68. Consider DHTs , P2P systems, distributed data stores etc.
  • 69. In order to write fast , distribute to several nodes with sloppy non-durable write quorum
  • 70. Build upon a system implementing consistent hashing . Don’t try home-made sharding as distribution replacement - you will fail adding new nodes
  • 71. Try Riak . It’s derived from Amazon Dynamo. It implements consistent hashing, gossip architecture, hinted handoff, vector clocks, merkle trees, sloppy quorum etc.
  • 72. I need to analyze these records in real-time for patterns and send alerts when any of them match. How would I do that?
  • 73. Consider CEP – complex event processing with an EPL – event processing language
  • 74. Choose a sliding time window just big enough to recognize causality and fire events on thresholds, deviations, anomalies etc.
  • 75. Try Esper . It implements CEP/EPL
  • 76. I collect my data at different locations all over the world , but want to do statistical analysis in my headquarters or at one other location . How would I do that?
  • 77. Aggregate your data and push it e.g. once a day out to the cloud . That’s a sort of replication if you like
  • 78. Choose a cloud based data store which can store big objects . The store should provide similar consistency characteristics as your local data store
  • 79. Try AWS (S3) or Rackspace (OpenStack/Swift) or a private cloud . They are either directly Dynamo based or implement similar concepts
  • 80. That’s a lot of data and distribution. I need to quickly push it from a location into the cloud while data keeps coming in . How would I do that?
  • 81. Use MapReduce to distribute the aggregation job to a group of nodes in order to quickly get the overall aggregation and cloud storage done
  • 82. Map , sort , combine and reduce to whatever representation you need
  • 83. Separate MapReduce splitting, jobs and intermediate storage from the local store to keep them independent and thus to read local store snapshots while still writing the new data
  • 84. Try Hadoop . It implements MapReduce with an own file system (HDFS), distribution etc. It is highly extensible
  • 85. I need to do some statistical analysis and visualize my data. How would I do that?
  • 86. Choose a general purpose platform for statistical computing and graphics
  • 87. Try R . It allows statistical analysis of whatever type of data and its graphical plotting. It’s highly extensible
  • 88. And these are only some of the possible Q&As. There are more areas such as NoSQL , content preparation for CDN , data mining etc. which we didn’t consider
  • 89. Know your data your scenarios how to scale the technology when to stop
  • 90. The experiment – live demo Source code will be available on http://github.com/pavlobaron Detail description will be available on http://archi-jab-ture.blogspot.com
  • 91. Situation Data Center (US) Data Center (EUR) Data Center (AFR) Votes Votes Reports
  • 92. SoapUI Simulation HTTP server Esper Alert Riak Object Storage Hadoop R
  • 93. We store votes as they come with sloppy write quorum. We store on several nodes in a regional cluster. We match patterns on a stream, not on saved data. We push aggregated day data from regions to the cloud using distributed MapReduce. We use scalable, distributed components with HA options. Etc. Does it scale ?
  • 95. Most images originate from istockphoto.com except few ones taken from Wikipedia and product pages