SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Introduction to Big Data
Sri Kanajan
Big Data
• When data is too VVV (volume, variety, velocity) to manage with traditional
RDBMS, then you enter BIG DATA!
• Data Storage and Manipulation, at Scale
– MapReduce, Hadoop, relationship to databases (Framework)
– Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type)
– Entity resolution, record linkage, data cleaning (data integration)
• Analytics (Machine Learning)
– Basic statistical modeling, experiment design, overfitting
– Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression
– Unsupervised learning: k-means, multi-dimensional scaling
– Graph Analytics: PageRank, community detection, recursive queries, iterative processing
– Text Analytics: latent semantic analysis
– Collaborative Filtering: slope-one
• Communicating Results
– Visualization, data products, visual data analytics
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization
Big Data Everywhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
Unknown Hidden Relationships within this Data !!!
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Unstructured Text Data
– Log data, Comments, User generated text
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF)
• Real time Data
– You can only scan the data once and need to do
analytics quickly
What does Big Data Give You?
• Without Big Data
– Many data warehouses that were separate and on non distributed
architectures
– Had to modify data structures and unique programming to merge databases
together
– Scaling database size is a continual problem
– Any large scale analytics took days and weeks and large coordination effort
within IT to get database accesses
– Data analysis is a large effort and lots of data tend to remain unanalyzed or
even worse not stored
• With Big Data
– Hadoop provides a single view of all databases that can be distributed
– Database size is a non issue
– Ability to perform advanced statistical analysis on very large datasets very
quickly
– Data analysis is the competitive edge for many companies since barriers of
entry are continually dropping through the development of platforms
Examples
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices, meters of individual
customers, ...
• Social Security Administration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration, operations, logistics,
engineering, ...
• Retailers
– see Target example above
– also, connection between what people buy, weather forecast, logistics, ...
Big Data
Power of Distribution
45 Minutes! 4.5 Minutes!
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop ,MapReduce – Storage, Processing
– Machine Learning – Analytics
– Visualization
Hadoop
• A framework that allows for distributed
processing of large data sets across clusters of
commodity computers using a simple
programming model (I.e. MapReduce)
– Distributed data processing
– Works with structured and unstructured data
– Open source
– Master-slave architecture
– Fault tolerant using commodity hardware
MapReduce
• Programming model on top of Hadoop
• Basic concept is to provide a programming model that
immediately supports parallel processing (SQL on the
other hand does not natively encourage parallel
processing)
• Pig is a framework and programming language to
develop MapReduce
• Note – MapReduce is great for extremely large data
sets with simple relations. SQL is great for medium size
data sets but with complex relationships
– I.e. you have to decide the right technology depending on
your problem space
A Simple Example
• Counting words in a large set of documents
map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
MapReduce
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization
Machine Learning
• Essentially ways to analyze data to extract
valuable information with or without training
data
– Prediction
• predicting a variable from data
– Classification
• assigning records to predefined groups
– Clustering
• splitting records into groups based on similarity
– Association learning
• seeing what often appears together with what
– And many others….
Now you have an optimization
metric by which you can automate
the exploration of all possible
hypotheses !
Problems with this approach??
Two kinds of learning
21
• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct
answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it
makes some kind of sense out of the data
Example: Collaborative Filtering
• Goal: predict what movies/books/… a person may be interested in,
on the basis of
– Past preferences of the person
– Other people with similar past preferences
– The preferences of such people for a new movie/book/…
• One approach based on repeated clustering
– Cluster people on the basis of preferences for movies
– Then cluster movies on the basis of being liked by the same clusters of
people
– Again cluster people based on their preferences for (the newly created
clusters of) movies
– Repeat above till equilibrium
• Above problem is an instance of collaborative filtering, where users
collaborate in the task of filtering information to find information of
interest
22
Outline
• What is Big Data?
• Why is this important now?
• Key Concepts
– Hadoop, MapReduce – Storage architecture
– Machine Learning – Analytics
– Visualization
Is this an effective visual
representation?
Better Mapping? Why?
Diagrams Showing O-Ring Damage
that was Used to Decide to Launch
Challenger in 1987
Representation of the Same Data
Strategies to Increase the Information
Encoded by Spatial Position
• Composition
– Orthogonal placement of axes
– Creates a 2D metric space
Strategies to Increase the Information
Encoded by Spatial Position
• Alignment
Folding
• Continuation of the Axes
Recursion
Overloading
Conclusion
• Big Data is a huge field that combines
expertise from different domains in order to
find interesting information from data
• Extracting interesting information from data is
the next competitive edge for many
companies as information becomes available,
instantly anywhere

Weitere ähnliche Inhalte

Was ist angesagt?

What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?CodePolitan
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Benjamin Taylor
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An IntroductionShankar R
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?Seval Çapraz
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big DataShankar R
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approachShesha R
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 

Was ist angesagt? (20)

What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Hadoop - An Introduction
Hadoop - An IntroductionHadoop - An Introduction
Hadoop - An Introduction
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?What is Datamining? Which algorithms can be used for Datamining?
What is Datamining? Which algorithms can be used for Datamining?
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
Bigdata
BigdataBigdata
Bigdata
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Data Science
Data ScienceData Science
Data Science
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 

Ähnlich wie Big data Intro - Presentation to OCHackerz Meetup Group

Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
big data processing.pptx
big data processing.pptxbig data processing.pptx
big data processing.pptxssuser96aab9
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxNouhaElhaji1
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & HadoopBlackvard
 

Ähnlich wie Big data Intro - Presentation to OCHackerz Meetup Group (20)

Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
big data processing.pptx
big data processing.pptxbig data processing.pptx
big data processing.pptx
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptx
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
TOUG Big Data Challenge and Impact
TOUG Big Data Challenge and ImpactTOUG Big Data Challenge and Impact
TOUG Big Data Challenge and Impact
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyOracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph Technology
 
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 

Kürzlich hochgeladen

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structuredhanjurrannsibayan2
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 

Kürzlich hochgeladen (20)

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 

Big data Intro - Presentation to OCHackerz Meetup Group

  • 1. Introduction to Big Data Sri Kanajan
  • 2. Big Data • When data is too VVV (volume, variety, velocity) to manage with traditional RDBMS, then you enter BIG DATA! • Data Storage and Manipulation, at Scale – MapReduce, Hadoop, relationship to databases (Framework) – Key-value stores and NoSQL; tradeoffs of SQL and NoSQL (Database type) – Entity resolution, record linkage, data cleaning (data integration) • Analytics (Machine Learning) – Basic statistical modeling, experiment design, overfitting – Supervised learning: overview, simple nearest neighbor, decision trees/forests, regression – Unsupervised learning: k-means, multi-dimensional scaling – Graph Analytics: PageRank, community detection, recursive queries, iterative processing – Text Analytics: latent semantic analysis – Collaborative Filtering: slope-one • Communicating Results – Visualization, data products, visual data analytics
  • 3. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
  • 4. Big Data Everywhere! • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network Unknown Hidden Relationships within this Data !!!
  • 5.
  • 6. How much data? • Google processes 20 PB a day (2008) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s Large Hydron Collider (LHC) generates 15 PB a year 640K ought to be enough for anybody.
  • 7. Type of Data • Relational Data (Tables/Transaction/Legacy Data) • Unstructured Text Data – Log data, Comments, User generated text • Semi-structured Data (XML) • Graph Data – Social Network, Semantic Web (RDF) • Real time Data – You can only scan the data once and need to do analytics quickly
  • 8. What does Big Data Give You? • Without Big Data – Many data warehouses that were separate and on non distributed architectures – Had to modify data structures and unique programming to merge databases together – Scaling database size is a continual problem – Any large scale analytics took days and weeks and large coordination effort within IT to get database accesses – Data analysis is a large effort and lots of data tend to remain unanalyzed or even worse not stored • With Big Data – Hadoop provides a single view of all databases that can be distributed – Database size is a non issue – Ability to perform advanced statistical analysis on very large datasets very quickly – Data analysis is the competitive edge for many companies since barriers of entry are continually dropping through the development of platforms
  • 9. Examples • Norwegian Food Safety Authority – accumulates data on all farm animals – birth, death, movements, medication, samples, ... • Hafslund – time series from hydroelectric dams, power prices, meters of individual customers, ... • Social Security Administration – data on individual cases, actions taken, outcomes... • Statoil – massive amounts of data from oil exploration, operations, logistics, engineering, ... • Retailers – see Target example above – also, connection between what people buy, weather forecast, logistics, ...
  • 11. Power of Distribution 45 Minutes! 4.5 Minutes!
  • 12. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop ,MapReduce – Storage, Processing – Machine Learning – Analytics – Visualization
  • 13. Hadoop • A framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model (I.e. MapReduce) – Distributed data processing – Works with structured and unstructured data – Open source – Master-slave architecture – Fault tolerant using commodity hardware
  • 14. MapReduce • Programming model on top of Hadoop • Basic concept is to provide a programming model that immediately supports parallel processing (SQL on the other hand does not natively encourage parallel processing) • Pig is a framework and programming language to develop MapReduce • Note – MapReduce is great for extremely large data sets with simple relations. SQL is great for medium size data sets but with complex relationships – I.e. you have to decide the right technology depending on your problem space
  • 15. A Simple Example • Counting words in a large set of documents map(string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
  • 17. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop, MapReduce – Storage architecture – Machine Learning – Analytics – Visualization
  • 18. Machine Learning • Essentially ways to analyze data to extract valuable information with or without training data – Prediction • predicting a variable from data – Classification • assigning records to predefined groups – Clustering • splitting records into groups based on similarity – Association learning • seeing what often appears together with what – And many others….
  • 19.
  • 20. Now you have an optimization metric by which you can automate the exploration of all possible hypotheses ! Problems with this approach??
  • 21. Two kinds of learning 21 • Supervised – we have training data with correct answers – use training data to prepare the algorithm – then apply it to data without a correct answer • Unsupervised – no training data – throw data into the algorithm, hope it makes some kind of sense out of the data
  • 22. Example: Collaborative Filtering • Goal: predict what movies/books/… a person may be interested in, on the basis of – Past preferences of the person – Other people with similar past preferences – The preferences of such people for a new movie/book/… • One approach based on repeated clustering – Cluster people on the basis of preferences for movies – Then cluster movies on the basis of being liked by the same clusters of people – Again cluster people based on their preferences for (the newly created clusters of) movies – Repeat above till equilibrium • Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest 22
  • 23. Outline • What is Big Data? • Why is this important now? • Key Concepts – Hadoop, MapReduce – Storage architecture – Machine Learning – Analytics – Visualization
  • 24. Is this an effective visual representation?
  • 26. Diagrams Showing O-Ring Damage that was Used to Decide to Launch Challenger in 1987
  • 28. Strategies to Increase the Information Encoded by Spatial Position • Composition – Orthogonal placement of axes – Creates a 2D metric space
  • 29. Strategies to Increase the Information Encoded by Spatial Position • Alignment
  • 33. Conclusion • Big Data is a huge field that combines expertise from different domains in order to find interesting information from data • Extracting interesting information from data is the next competitive edge for many companies as information becomes available, instantly anywhere