SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Big Data 101
    Bouvet BigOne, 2013-03-14
    Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga

1
2
3
What is big data?

      Big Data is                 Small Data is
      any thing                when is fit in RAM.
       which is                Big Data is when is
     crash Excel.               crash because is
                                 not fit in RAM.




                                          Or, in other words, Big Data is data
                                          in volumes too great to process by
                                          traditional methods.


     https://twitter.com/devops_borat

4
Data accumulation

    • Today, data is accumulating at tremendous
      rates
       –   click streams from web visitors
       –   supermarket transactions
       –   sensor readings
       –   video camera footage
       –   GPS trails
       –   social media interactions
       –   ...
    • It really is becoming a challenge to store
      and process it all in a meaningful way

5
From WWW to VVV

    • Volume
      – data volumes are becoming unmanageable
    • Variety
      – data complexity is growing
      – more types of data captured than previously
    • Velocity
      – some data is arriving so rapidly that it must either
        be processed instantly, or lost
      – this is a whole subfield called “stream processing”




6
The promise of Big Data

• Data contains information of great
  business value
• If you can extract those insights you can
  make far better decisions
• ...but is data really that valuable?
8
9
“quadrupling the average cow's
     milk production since your parents
     were born”



     "When Freddie [as he is known]
     had no daughter records our
     equations predicted from his DNA
     that he would be the best bull,"
     USDA research geneticist Paul
     VanRaden emailed me with a
     detectable hint of pride. "Now he is
     the best progeny tested bull (as
     predicted)."




10
Ok, ok, but ... does it apply to our
     customers?
     • Norwegian Food Safety Authority
        – accumulates data on all farm animals
        – birth, death, movements, medication, samples, ...
     • Hafslund
        – time series from hydroelectric dams, power prices,
          meters of individual customers, ...
     • Social Security Administration
        – data on individual cases, actions taken, outcomes...
     • Statoil
        – massive amounts of data from oil exploration,
          operations, logistics, engineering, ...
     • Retailers
        – see Target example above
        – also, connection between what people buy, weather
          forecast, logistics, ...
11
How to extract insight from data?




        Monthly Retail Sales in New South Wales
       (NSW) Retail Department Stores
12
Estimating real estate prices

     • Take parameters
        –   x1    square meters
        –   x2    number of rooms
        –   x3    number of floors
        –   x4    energy cost per year
        –   x5    meters to nearest subway station
        –   x6    years since built
        –   x7    years since last refurbished
        –   ...
     • a x1 + b x2 + c x3 + ... = price
        – strip out the x-es and you have a vector
        – collect N samples of real flats with prices = matrix
        – welcome to the world of linear algebra
13
Types of algorithms

     •   Clustering
     •   Association learning
     •   Parameter estimation
     •   Recommendation engines
     •   Support Vector Machines
     •   Similarity matching
     •   Neural networks
     •   Bayesian networks
     •   Genetic algorithms


14
Basically, it’s all maths...

     •   Linear algebra
     •   Calculus
     •   Probability theory                      Only 10% in
     •   Graph theory                         devops are know
     •   ...                                     how of work
                                                with Big Data.
                                                 Only 1% are
                                               realize they are
                                              need 2 Big Data
                                                   for fault
                                                  tolerance




15
           https://twitter.com/devops_borat
Big data skills gap

     • Hardly anyone knows this stuff
     • It’s a big field, with lots and lots of theory
     • And it’s all maths, so it’s tricky to learn




     http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap
16
     http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
Two orthogonal aspects

     • Analytics / machine learning
       – learning insights from data
     • Big data
       – handling massive data volumes
     • Can be combined, or used separately




17
How to process Big Data?

     • If relational databases are not enough,
       what is?

                                                 Mining of Big
                                                     Data is
                                                 problem solve
                                                  in 2013 with
                                                      zgrep




18
              https://twitter.com/devops_borat
MapReduce

     • A framework for writing massively parallel
       code
     • Simple, straightforward model
     • Based on “map” and “reduce” functions
       from functional programming (LISP)




19
Things you can do in MapReduce

     • Google’s PageRank algorithm
       – easily expressible in MapReduce
       – one of the first applications of MapReduce
     • SQL
       – relational algebra has straightforward translation
         to the MapReduce model
     • Linear algebra
       – matrix operations are easily MapReducible
       – (PageRank is just a bunch of matrix operations)
     • Recommendation engines
       – also MapReducible (the SON algorithm)
       – ...
20
NoSQL and Big Data

     • Not really that relevant
     • Traditional databases handle big data sets,
       too
     • NoSQL databases have poor analytics
     • MapReduce often works from text files
        – can obviously work from SQL and NoSQL, too
     • NoSQL is more for high throughput
        – basically, AP from the CAP theorem, instead of CP
     • In practice, really Big Data is likely to be a
       mix
        – text files, NoSQL, and SQL
21
The 4th V: Veracity

     “The greatest enemy of knowledge is not
     ignorance, it is the illusion of knowledge.”
                        Daniel Borstin, in The Discoverers (1983)



                                                       95% of time,
                                                      when is clean Big
                                                      Data is get Little
                                                            Data




22
                   https://twitter.com/devops_borat
Data quality

     • A huge problem in practice
       – any manually entered data is suspect
       – most data sets are in practice deeply problematic
     • Even automatically gathered data can be a
       problem
       – systematic problems with sensors
       – errors causing data loss
       – incorrect metadata about the sensor
     • Never, never, never trust the data without
       checking it!
       – garbage in, garbage out, etc

23
Conclusion

     • Vast potential
        – to both big data and machine learning
     • Very difficult to realize that potential
        – requires mathematics, which nobody knows
     • We need to wake up!




24
Where to learn more

     • University of Oslo
       – has courses on linear algebra, probability, graph
         theory, ...
     • Stanford University
       – https://www.coursera.org/course/ml
     • Mining Massive Datasets
       – http://infolab.stanford.edu/~ullman/mmds.html




25

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 

Was ist angesagt? (20)

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big data
Big dataBig data
Big data
 
Big data mining
Big data miningBig data mining
Big data mining
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on Businesses
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 

Andere mochten auch

Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
Thushara M
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
Anishek Kamal
 
Emotive presentation
Emotive presentationEmotive presentation
Emotive presentation
ethansm
 
101 Marketing Charts
101 Marketing Charts101 Marketing Charts
101 Marketing Charts
HubSpot
 

Andere mochten auch (20)

Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
What is big data?
What is big data?What is big data?
What is big data?
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
 
Emotiv epoc introduction
Emotiv epoc introductionEmotiv epoc introduction
Emotiv epoc introduction
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
 
Emotiv System Team 8
Emotiv System  Team 8Emotiv System  Team 8
Emotiv System Team 8
 
Emotive presentation
Emotive presentationEmotive presentation
Emotive presentation
 
Emotiv epoc
Emotiv epocEmotiv epoc
Emotiv epoc
 
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData OperationalCracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
 
Infografia i Visualització UOC Meet
Infografia i Visualització UOC MeetInfografia i Visualització UOC Meet
Infografia i Visualització UOC Meet
 
Project Monitoring and Evaluation
Project Monitoring and EvaluationProject Monitoring and Evaluation
Project Monitoring and Evaluation
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
8 M&E: Data Sources
8 M&E: Data Sources8 M&E: Data Sources
8 M&E: Data Sources
 
101 Marketing Charts
101 Marketing Charts101 Marketing Charts
101 Marketing Charts
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Emotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCIEmotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCI
 

Ähnlich wie Big data 101

Ähnlich wie Big data 101 (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business Intelligence
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Big & Open Data: Challenges for Smartcity
Big & Open Data:  Challenges for SmartcityBig & Open Data:  Challenges for Smartcity
Big & Open Data: Challenges for Smartcity
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Big Data
Big DataBig Data
Big Data
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
Introduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBAIntroduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBA
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 

Mehr von Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Big data 101

  • 1. Big Data 101 Bouvet BigOne, 2013-03-14 Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga 1
  • 2. 2
  • 3. 3
  • 4. What is big data? Big Data is Small Data is any thing when is fit in RAM. which is Big Data is when is crash Excel. crash because is not fit in RAM. Or, in other words, Big Data is data in volumes too great to process by traditional methods. https://twitter.com/devops_borat 4
  • 5. Data accumulation • Today, data is accumulating at tremendous rates – click streams from web visitors – supermarket transactions – sensor readings – video camera footage – GPS trails – social media interactions – ... • It really is becoming a challenge to store and process it all in a meaningful way 5
  • 6. From WWW to VVV • Volume – data volumes are becoming unmanageable • Variety – data complexity is growing – more types of data captured than previously • Velocity – some data is arriving so rapidly that it must either be processed instantly, or lost – this is a whole subfield called “stream processing” 6
  • 7. The promise of Big Data • Data contains information of great business value • If you can extract those insights you can make far better decisions • ...but is data really that valuable?
  • 8. 8
  • 9. 9
  • 10. “quadrupling the average cow's milk production since your parents were born” "When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull," USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. "Now he is the best progeny tested bull (as predicted)." 10
  • 11. Ok, ok, but ... does it apply to our customers? • Norwegian Food Safety Authority – accumulates data on all farm animals – birth, death, movements, medication, samples, ... • Hafslund – time series from hydroelectric dams, power prices, meters of individual customers, ... • Social Security Administration – data on individual cases, actions taken, outcomes... • Statoil – massive amounts of data from oil exploration, operations, logistics, engineering, ... • Retailers – see Target example above – also, connection between what people buy, weather forecast, logistics, ... 11
  • 12. How to extract insight from data? Monthly Retail Sales in New South Wales (NSW) Retail Department Stores 12
  • 13. Estimating real estate prices • Take parameters – x1 square meters – x2 number of rooms – x3 number of floors – x4 energy cost per year – x5 meters to nearest subway station – x6 years since built – x7 years since last refurbished – ... • a x1 + b x2 + c x3 + ... = price – strip out the x-es and you have a vector – collect N samples of real flats with prices = matrix – welcome to the world of linear algebra 13
  • 14. Types of algorithms • Clustering • Association learning • Parameter estimation • Recommendation engines • Support Vector Machines • Similarity matching • Neural networks • Bayesian networks • Genetic algorithms 14
  • 15. Basically, it’s all maths... • Linear algebra • Calculus • Probability theory Only 10% in • Graph theory devops are know • ... how of work with Big Data. Only 1% are realize they are need 2 Big Data for fault tolerance 15 https://twitter.com/devops_borat
  • 16. Big data skills gap • Hardly anyone knows this stuff • It’s a big field, with lots and lots of theory • And it’s all maths, so it’s tricky to learn http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap 16 http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
  • 17. Two orthogonal aspects • Analytics / machine learning – learning insights from data • Big data – handling massive data volumes • Can be combined, or used separately 17
  • 18. How to process Big Data? • If relational databases are not enough, what is? Mining of Big Data is problem solve in 2013 with zgrep 18 https://twitter.com/devops_borat
  • 19. MapReduce • A framework for writing massively parallel code • Simple, straightforward model • Based on “map” and “reduce” functions from functional programming (LISP) 19
  • 20. Things you can do in MapReduce • Google’s PageRank algorithm – easily expressible in MapReduce – one of the first applications of MapReduce • SQL – relational algebra has straightforward translation to the MapReduce model • Linear algebra – matrix operations are easily MapReducible – (PageRank is just a bunch of matrix operations) • Recommendation engines – also MapReducible (the SON algorithm) – ... 20
  • 21. NoSQL and Big Data • Not really that relevant • Traditional databases handle big data sets, too • NoSQL databases have poor analytics • MapReduce often works from text files – can obviously work from SQL and NoSQL, too • NoSQL is more for high throughput – basically, AP from the CAP theorem, instead of CP • In practice, really Big Data is likely to be a mix – text files, NoSQL, and SQL 21
  • 22. The 4th V: Veracity “The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” Daniel Borstin, in The Discoverers (1983) 95% of time, when is clean Big Data is get Little Data 22 https://twitter.com/devops_borat
  • 23. Data quality • A huge problem in practice – any manually entered data is suspect – most data sets are in practice deeply problematic • Even automatically gathered data can be a problem – systematic problems with sensors – errors causing data loss – incorrect metadata about the sensor • Never, never, never trust the data without checking it! – garbage in, garbage out, etc 23
  • 24. Conclusion • Vast potential – to both big data and machine learning • Very difficult to realize that potential – requires mathematics, which nobody knows • We need to wake up! 24
  • 25. Where to learn more • University of Oslo – has courses on linear algebra, probability, graph theory, ... • Stanford University – https://www.coursera.org/course/ml • Mining Massive Datasets – http://infolab.stanford.edu/~ullman/mmds.html 25