Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 1
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 2
Speakers
Dr. Michael Stonebraker
Co-Founder,
...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #1
Not Planning to Move Most EVERYTHING...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
It may take a decade, but it is the right thing...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
YABUT...
5
Security
● Cloud security is
likely ...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
YABUT...
6
Where does App run?
● Decision suppo...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 7
Blunder #2
Not Planning for AI/ML to be Disru...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #2
Not Planning for AI/ML to be Disrupt...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
9
Pay up to get some AI/ML exper...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 10
Blunder #3
Not Solving your REAL Data Scienc...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #3
Not Solving your REAL Data Science P...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 12
Blunder #4
Belief that Traditional Data Inte...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #4
Belief that Traditional Data Integra...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
ETL
What’s attempted:
● Decide what data source...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
MDM
● Once you have run ETL, you need “match/me...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
At scale, you need a solution th...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 17
Blunder #5
Belief that Data Warehouses will ...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #5
Belief that Data Warehouses will Sol...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 19
Blunder #6
Belief that Hadoop/Spark will Sol...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #6
Belief that Hadoop/Spark will Solve ...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do with your Hadoop/Spark cluster?
●...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 22
Blunder #7
Belief that Data Lakes will Solve...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #7
Belief that Data Lakes will Solve al...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Why?
● Schemas don’t match
○ You call it salary...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
The Net Result
● Your analytics will be garbage...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
● You don’t have a data lake; yo...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 27
Blunder #8
Outsourcing your new stuff to Pal...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #8
Outsourcing your new stuff to Palant...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
This is a catch 22
● Your maintenance is boring...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
So what to do?
1. Start by solving Blunder #2
(...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 31
Blunder #9
Succumbing to the “Innovator’s Di...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #9
Succumbing to the “Innovator’s Dilem...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Net-Net
● Have to be willing to give up your cu...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 34
Blunder #10
Not Paying Up for a Few “Rocket ...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #10
Not Paying Up for a Few “Rocket Sci...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 36
Blunder #11 (Bonus)
Working for a Company Th...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Blunder #11 (Bonus)
Working for a Company That ...
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021
Questions?
38
How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 39
Thank You!
To learn more about Tamr visit ta...
Nächste SlideShare
Wird geladen in …5
×

Slides: How to Avoid the 10 Big Data Analytics Blunders — Best Practices for Success in 2021

As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what not to do when planning for your organization’s big data initiatives.

Michael Stonebraker shares the top 10 big data blunders that he has witnessed in the last decade or so. As a pioneer of database research and technology for more than 40 years, Michael understands the mistakes enterprises often made and knows how to correct and avoid them. By learning about the major blunders, you’ll know how best to future-proof your big data management and digital transformation needs. Common blunders include problems from not planning on moving everything to the cloud to believing that a data warehouse will solve all your problems to succumbing to the “innovator’s dilemma.” To illustrate the blunders, he shares a variety of corrective tips, strategies, and real-world examples.

  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Slides: How to Avoid the 10 Big Data Analytics Blunders — Best Practices for Success in 2021

  1. 1. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 1
  2. 2. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 2 Speakers Dr. Michael Stonebraker Co-Founder, Tamr Anthony Deighton Chief Product Officer, Tamr
  3. 3. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #1 Not Planning to Move Most EVERYTHING to the Cloud 3
  4. 4. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 It may take a decade, but it is the right thing to do ● Dewitt vignette ● Hamilton vignette ● Elasticity!!! ● Data will move easier than applications -- decision support first 4
  5. 5. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 YABUT... 5 Security ● Cloud security is likely better than yours ● Misconfiguration, rogue employees Cost ● Likely that you are cheating Geographic Restrictions ● Cloud guys respect this Legal Restrictions ● Hopefully a short term problem Other Restrictions ● Your CEO doesn’t approve (see item 11 to come)
  6. 6. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 YABUT... 6 Where does App run? ● Decision support: move the app ● Other stuff: ○ Start with local deployment; move to remote data (SLOWLY!!!) ○ Migrate to cloud-native as you have resources, starting with the most costly ones ○ This may be a lot of work and may take a decade or more ○ Issue is legacy code/hardware
  7. 7. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 7 Blunder #2 Not Planning for AI/ML to be Disruptive
  8. 8. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #2 Not Planning for AI/ML to be Disruptive ML (whether deep or conventional) is getting much better ● Will displace workers with easy-to-explain jobs ● Think autonomous vehicles, automatic checkout, drone delivery, actuary calculations Likely to be disruptive ● You can be a disruptor or get disrupted - Your choice ● Think Uber/Lyft or taxis 8
  9. 9. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 So what to do? 9 Pay up to get some AI/ML experts ● They are in short supply and very expensive ● Don’t contract this out (See Blunder #8) Get going on the coming arms race ● You will be a winner or a loser in a winner-take-all sweepstakes
  10. 10. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 10 Blunder #3 Not Solving your REAL Data Science Problem
  11. 11. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #3 Not Solving your REAL Data Science Problem Typical data scientist spends 90+% of his/her time on data discovery, data integration and data cleaning ● Irobot vignette ● Merck vignette Nobody quotes less than 80%!!! ● Without clean data ML is worthless!!! ○ More accurately without “clean enough” data, ML is worthless Obvious directive: Get a strategy in place to do this ● Start by giving Chief Data Officer (CDO) read access to ALL enterprise data! 11
  12. 12. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 12 Blunder #4 Belief that Traditional Data Integration Techniques Will Solve Issue #3
  13. 13. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #4 Belief that Traditional Data Integration Techniques Will Solve Issue #3 Exact Transformation and Load (Available from a variety of vendors) 13 Master Data Management (Also available from the usual suspects)
  14. 14. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 ETL What’s attempted: ● Decide what data sources to integrate (top dow) ● Build a global data model (up front) ● For each data source ○ Send a programmer to interview the data set owner ○ He then builds an extractor, data cleaning routines (in a proprietary scripting language) ○ And loads data into the global schema 14 Why it doesn’t work: ● I have never seen this technique work for more than 20 data sources ○ Too human intensive ● Building a global schema upfront is way too different at scale ○ Remember enterprise wide data models from 15-20 years ago... ● Most enterprises I know have way more than 20 data sources ○ Merck has 4000+/- Oracle data bases ○ A data lake ○ Countless files ○ And data from the web is also important
  15. 15. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 MDM ● Once you have run ETL, you need “match/merge” ● MDM suggests building “golden records” by ○ Implementing match rues (e.g. two entities are the same if they have the same address) ○ Implementing merge rules (e.g. take the most recent value and ignore older ones) Doesn’t Scale! ● GE classification problem: 20M spend transactions to be classified into a pre-built hierarchy ● 500 rules classified only 10% of the spend transaction 15
  16. 16. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 So what to do? At scale, you need a solution that leverages ML and statistics ● OK to use rules to generate training data ● That’s what Tamr did on the GE problem 16 +
  17. 17. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 17 Blunder #5 Belief that Data Warehouses will Solve all your Problems
  18. 18. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #5 Belief that Data Warehouses will Solve all your Problems 18 Data warehouses are good at customer facing structured data FROM A FEW DATA SOURCES ● But not text, images, video, … ● Use the technology for what it is good for ○ Do not perform unnatural acts! ○ And get rid of the “high price spread”, if you bought into it ○ And remember that your warehouse will move to the cloud (see Blunder #1)
  19. 19. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 19 Blunder #6 Belief that Hadoop/Spark will Solve all your Problems
  20. 20. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #6 Belief that Hadoop/Spark will Solve all your Problems 20 ● Hadoop/Spark is not very good at anything ○ E.g. Spark/SQL is not competitive (but getting better) ○ E.g. Spark/Streaming is not competitive (last time I looked) ● Use “best of breed” not “lowest common denominator” -- at least for your “secret sauce” ○ This is a universal blunder -- desire to use only one vendor ○ Hadoop/Spark is not very good at anything ● And… ○ Spark/Hadoop is useless on Blunders #3 and #4 (i.e. data integration)
  21. 21. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 So what to do with your Hadoop/Spark cluster? ● Repurpose it or a Data Lake ● Repurpose it for Data Integration ● Throw it Away ○ Hardware lifetime is 3 years (maybe) ○ Remember Blunder #1 21
  22. 22. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 22 Blunder #7 Belief that Data Lakes will Solve all your Problems
  23. 23. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #7 Belief that Data Lakes will Solve all your Problems 23 Conventional Wisdom Just load all your data into a “data lake” and you will be able to correlate all data sets Important Fact (Tattoo this on your Brain): Independently constructed data sets are never “plug compatible”
  24. 24. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Why? ● Schemas don’t match ○ You call it salary; I call it wages ● Units don’t match ○ You use Euros; I use $$$ ● Semantics don’t match ○ My salaries are gross before taxes; yours are net after taxes with a lunch allowance 24 ● Time granularity doesn't match ○ You have annual data; I have monthly data ● Data is dirty ○ 99 means null (sometimes) ○ Null means “data missing” or “data not allowed” or... ● Duplicates must be removed ○ And there are no keys ○ I am Mike Stonebraker in one data set; M.R. Stonebreaker in a second one
  25. 25. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 The Net Result ● Your analytics will be garbage ○ “GIGO” ● Your ML models will fail ○ I.e. produce garbage ○ Again “GIGO” 25
  26. 26. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 So what to do? ● You don’t have a data lake; you have a data swamp ● Need a data curation system ○ Which will solve the aforementioned problems ○ And this will not be trivial!! ● Traditional technology likely to fail (See Blunder #4) ● This is an 800 pound gorilla ○ Make sure you put your best people on it!!!! ○ Chances are your in-house solution is crap ○ Use modern technology (from startups) not your “home brew” ● If you want the best technology, you have to deal with startups!!!! 26
  27. 27. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 27 Blunder #8 Outsourcing your new stuff to Palantir, IBM, Mu Sigma
  28. 28. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #8 Outsourcing your new stuff to Palantir, IBM, Mu Sigma 28 ● Typical enterprise spends 95% of its IT resources keeping current (legacy) code running ○ i.e. Maintenance ○ Most are dug in pretty deep ○ Often have the best people “keeping the lights on” ● “Shiny new stuff” gets outsourced ○ Often because here is no appropriate talent internally
  29. 29. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 This is a catch 22 ● Your maintenance is boring! ○ So creative people quit ○ So there is no good talent to work on the new stuff ○ And you can’t hire great talent (Takes great people to hire great people) ● Your new stuff is your “secret sauce” over the next decade or so… ○ Please don’t outsource it. This is long-term suicide ○ Instead outsource the diddly-crap (e-mail et. al.) ○ Software is your secret sauce -- invest in your own people 29
  30. 30. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 So what to do? 1. Start by solving Blunder #2 (Not planning for AI/ML to change most everything) 1. Outsource the borning maintenance 2. Cancel the Palantir contract 30
  31. 31. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 31 Blunder #9 Succumbing to the “Innovator’s Dilemma”
  32. 32. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #9 Succumbing to the “Innovator’s Dilemma” 32 ● Must read book by Clayton Christensen ● Stream shovel example ○ Cable stream shovels - big payload ○ Hydraulics - much safer, but low payload ● Used for “small jobs” ○ Payloads increased and hydraulics won ○ Cable guys went out of business
  33. 33. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Net-Net ● Have to be willing to give up your current business model ● And reinvent yourself ● Possibly losing some current customers in the process ○ Otherwise, you go out of business in the long run ○ Taxi licenses in Cambridge have gone from $700k to $10k 33
  34. 34. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 34 Blunder #10 Not Paying Up for a Few “Rocket Scientists”
  35. 35. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #10 Not Paying Up for a Few “Rocket Scientists” 35 ● They will be your guiding light to avoiding these blunders ● They will be “off scale” ○ Your HR folks won’t like what you have to pay ● Chances are they will be weird ○ E.g. no shoes, no socks, no tie, feet on the table, ... ● Please don’t drive them away! ○ As Citibank did to one of my Berkeley students a while ago
  36. 36. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 36 Blunder #11 (Bonus) Working for a Company That is not Trying to do Something about the “Sins of the Past”
  37. 37. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Blunder #11 (Bonus) Working for a Company That is not Trying to do Something about the “Sins of the Past” 37 If you work for a company that is succumbing to (even one) of these blunders then: 1. You should be fixing it a. Be part of the solution, not part of the problem 2. Or looking for a new employer a. Tamr is hiring!
  38. 38. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 Questions? 38
  39. 39. How to Avoid the 10 Big Data Blunders - Best Practices for Success in 2021 39 Thank You! To learn more about Tamr visit tamr.com You’ll receive the 10 Big Data Analytics Blunders Infographic via email.

×