Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

The Big Data Developer (@pavlobaron)

124.759 Aufrufe

Veröffentlicht am

Slides of the talk I gave at the BigDataCon 2012 in Mayence, Germany and in Hamburg at the SEACON in 2012

Veröffentlicht in: Technologie

The Big Data Developer (@pavlobaron)

  1. Big Big Big Big Bigdata developer
  2. Pavlo Baronpavlo.baron@codecentric.de @pavlobaron
  3. Disclaimer:I will not flame-war,rant, bash or doanything similar aroundconcrete tools.Im tired of doingthat
  4. Hey, dude.Im a big datadeveloper. We haveenormous data.I continuouslyread articles onhighscalability.com
  5. Big datais easy.Its like normal data,but, well,even bigger.
  6. Aha. So thisis not big, right?
  7. Thats big, huh?
  8. Peta bytes of data every houron different continents, withcomplex relations and withthe need to analyze themin almost real time for anomaliesand to visualize them foryour management
  9. You can easily call this big data.Everything below this you cancall Mickey Mouse data
  10. Good news is: you canintentionally grow – collectmore data. And you should!
  11. np, dude!Data is data.Same principles
  12. Really?Lets consider storage, ok?
  13. np, dude!Storage is just adatabase
  14. Really?Storage capacity of onesingle box, no matter how bigit is, is limited
  15. np, dude!Youre talkingabout SQL, right?SQL isboring grandpastuff anddoesnt scale
  16. Oh.When you expect big data, youneed to scale very far and thusbuild on distribution andcombine theoreticallyunlimited amount of machines toone single distributed storage
  17. There is no way around.Except you invent the BlackHoleDB
  18. np, dude!NoSQL scales and iscool.They achieve thisthrough sharding,ya know?Sharding hidesthis distributionstuff
  19. Hem.Building upon distribution ismuch harder than anything youveseen or done before
  20. Except you fed a crowd with7 breads and walked upon thewater
  21. np, dude!From three ofCAP, Ill justpick two.As easy as this
  22. The only thing that is absolutelycertain about distributed systems isthat parts of them will fail and youwill have no idea where and whatthe hell is going on
  23. So your P must be a given in adistributed system. And you want toplay with C vs. A, not just take blackor white
  24. np, dude!Sharding worksseamlessly. Idont need to takecare of anything
  25. Seriously?For example, one of the hardestchallenges with big datais to distribute/shard parts overseveral machines still having fasttraversals and reads, thuskeeping related data together.Valid for graph and any other datastore, also NoSQL, kind of
  26. Another hard challenge with shardingis to avoid naive hashing.Naive hashing would make you dependon the number of nodes and would notallow you to easily add or removenodes to/from the system
  27. And still, the trade-off between datalocality, consistency, availability,read/write/search speed, latency etc.is hard
  28. np, dude!NoSQL wouldwrite asynchronouslyand do map/reduceto find data
  29. Of course.You will love eventualconsistency, especially when you needa transaction around a complexmoney transfer
  30. np, dude!I dont have moneyto transfer. But Ineed to store lotsof data.I can throw anyamount of it atNoSQL, and itwill just work
  31. Really?So youd just throwsomething into your databaseand hope it works?
  32. What if you throw andmiss the target?
  33. Data locality, redundancy,consistent hashing and eventualconsistency combined with usecase driven storage designare key principles in succeedingwith a huge distributed data storage.Thats big data development
  34. How about data provisioning?
  35. np, dude!Its databasebeing the bottleneck, not myweb servers
  36. When you have thousands or millionsparallel requests per second, beggingfor data, the first mile will (also) quicklybecome the bottle neck.Requests will get queued and discardedas soon as your server doesnt bringdata fast enough through the pipe
  37. np, dude!Ill get me somebad-ass sexyhardware
  38. I bet you will.But under high load, your hardwarewill more or less quickly start to crack
  39. Youll burn your hard disks, boards andcards. And wires. And youll heat upto a maximum
  40. Its not about sexy hardware, butabout being able to quickly replace it.Ideally while the system keeps running
  41. But anyway.To keep the first mile scalable andfast, would lead to some expensivenetwork infrastructure.You need to get the maximum outof your servers in order to reducetheir number
  42. np, dude!I will use an eventdriven C10Kproblem solvingawesome webserver. Or Ill writeone on my own
  43. Maybe.But when your users are coming fromall over the world, it wont help youmuch since the network latency fromthem to your server will kill them
  44. You would have to go for a CDN oneday, statically pre-computing content.You would use their infrastructureand reduce the number of hits onyour own servers to a minimum
  45. np, dude!Ill push my wholeplatform out to thecloud. Its evenmore flexible andscales like hell
  46. Well.You cannot reliably predict on whichphysical machine and actually howclose to the data you program will run.Whenever virtual machines orstorage fragments get moved, yourworld stops
  47. You can easily force data locality andshorter stop-the-world-phasesby paying higher bills
  48. Data locality, geographic spatiality,dedicated virtualization and contentpre-computability combined with usecase driven cloudificationare key principles in succeedingwith provisioning of huge data amounts.Thats big data development
  49. Lets talk about processing, ok?
  50. np, dude!All easily doneby a map/reducetool
  51. Almost agreed.map/reduce has two basic phases:even “map” and “reduce”
  52. The slowest of those twois definitely “split”.Moving data from one huge pile toanother before map/reduce isdamn expensive
  53. np, dude!Ill write my datastraight to thestorage of mymap/reduce tool.It will then tear
  54. It can.But what if you need to search duringthe map phase – full-text, meta?
  55. np, dude!Ill use a coolindexing searchengine or library.It can find my datain a snap
  56. Would it?A very hard challenge is to partitionthe index and to couple its relatedparts to the corresponding data.With data locality of course, havingindex pieces on the related machines
  57. Data and index locality and direct fillingof data pots as data flies by combinedwith use case driven technologyusage are key principles in succeedingwith processing of huge data amounts.Thats big data development
  58. So, how about analytics, dude?
  59. np, dude!Its classic use casefor map/reduce.I can do thisafterwards and onthe fly
  60. Are you sure?So, you expect one tool to do both,real-time and post fact analytics?
  61. What did you smoke last night?
  62. You dont want to believein map/reduce in (near) real-time,dont you?
  63. np, dude!Ill get me somerocket fasthardware
  64. Im sure you will. But:You cannot predict and fix themap/reduce time.You cannot ensurethe completeness of data.You cannotguarantee causality knowledge
  65. Distribution has got your soul
  66. If you need to predict better,to be able to know about data/eventcausality, to be fast you need to CEPdata streams as data flies by.There is no (simple, fast) way around
  67. But the most important thing is:None of the BI tools you know willadequately support your NoSQLdata store, so youre all alonein the world of proprietaryimmature tool combinations.The world of pain.
  68. np, dude!My map/reducetool can even hidemath from me, soI can concentrateon doing stuff
  69. There is no point in fearingmath/statistics. You just need it
  70. Separation of immediate andpost fact analytics and CEP ofdata streams as data flies by combinedwith use case driven technologyusage and statistical knowledgeare key principles in succeedingwith analytics of huge data amounts.Thats big data development
  71. Oh, we forgot visualization
  72. np, dude!I just have noidea about it
  73. Me neither.I just know that you cant visualizehuge data amounts using classicspreadsheets. There are better ways,tools, ideas to do this – find themThats big data development
  74. hem, dude...Youre a smart-ass.Is it that you wantto say?
  75. Almost.In one of my humble moments I wouldsuggest you to do the following:
  76. Stop thinking you gain adequately deepknowledge through reading half-bakedblog posts. Get yourself some of those:
  77. Statistics, Visualization Distribution Network Different languages Tools, chainsKnow and use full stack Data stores Different platforms OS Storage Machine Algorithms Math
  78. Know your point of pain.You must be Twitter, Facebook orGoogle to have them all same time.If youre none of them, you can haveone or two. Or even none.Go for them with the right chain tool
  79. First and the most important tool in thechain is your brain
  80. Thank you
  81. Most images originate from istockphoto.com except few ones takenfrom Wikipedia or Flickr (CC) and product pages or generated through public online generators