SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Introduction to Data
Science and Analytics
Summer School 2015
Srinath Perera
VP Research
WSO2 Inc.
What is Data Science?
Extraction of
knowledge from large
volumes of data that are
structured or
unstructured.
It is a continuation of the
fields data mining and
predictive analytics
Data Science Pipeline
Example ( Road.lk) traffic Feed
1. Data as tweets
2. Extract time,
location, and traffic
level using NLP
3. Explore data
4. Model based on
time, and it is a
holiday
5. Predict traffic given
a time and location.
Real data is messay, often needs to cleaned up before
useful.
o Bad formats - ignore or treat like missing data
o Missing Data - extrapolate or remove data line
o Useless variables - remove
o Wrong data - e.g. aaa, bbb, joe, some might be
deliberate lie, or 99 may be a code for N/A
Data Cleanup
o Transform variables ( date formats, String to int)
o Create derived variables
o Derive country from IP
o age from ID card number
o Normalize strings
o e.g. stemm or use phonetic sounds
o different spellings and nicknames ( William->Bill)
o Feature value rescaling (e.g. most ML algorithms
needs value to rescaled to 0-1 range).
o Enrich (e.g. lookup and add age from profile)
Data Cleanup (Contd.)
Understand, and get a feel for what is expected
(models => densities, constraints) and
unexpected/ residuals (errors, outliers)
o think what this is data about? domain, background,
how it is collected, what each fields mean and range
of values.
o head, tail, count, all descriptives (Mean, Max, median,
percentiles .. ) - Five number Summary. Min. 1st Qu.
Median Mean 3rd Qu. Max.
o run a bunch of count/group-by statements to gauge if
I think it's corrupt.
Data Exploration
o Plot - take random sample and explore ( scatter plot)
o e.g. Draw scatter plot or Trellis Plot
o Find Dependencies between fields
o Calculate Correlation
o Dimensionality reduction
o Cluster and look visualize clusters
o Look at frequency distribution of each field and try to
find a known distribution if possible.
Data Exploration (Contd.)
Data Exploration (Contd.)
Feature Engineering
o Feature engineering is the art of finding feature that leads
simplest decision algorithm. ( Good features allow a
simple model to beat a complex model.)
o Best features may be a subset, or a combination, or
transformed version of the features.
How to do Feature Engineering?
o Manually pick by domain experts and trial and error.
o Search the possible combinations by training and
combining subsets (e.g. Random Forest)
o Use statistical concepts like correlation and
information criteria
o Reduce the features to a low dimension space using
techniques like PCA.
o Automatic Feature Learning though Deep Learning
o ...
Analysis
o Goal of analysis is to extract knowledge
o This knowledge usually come in one of the two forms
o KPI (Key Performance Indicators)
■ Describe key measurement for what is being
measured. (e.g. revenue per year, profit margin,
revenue for sqft in retail, revenue per employer)
o Models to describe or predict the data
■ e.g. Machine Learning models or Statistical models
4 Analysis types by time to decision
o Hindsight ( what happened?)
o Done using Batch Analytics like MapReduce
o Oversight ( what is happening?)
o Done using Realtime Analytics technologies like CEP
o Insight ( why things happening?)
o Done with Data Mining and Unsupervised learning
algorithms like Clustering
o Foresight ( what will happen?)
o Done by building models using Machine learning or
one of other techniques
Data Analytics Tools Landscape
Batch Analytics: SparkSQL
Realtime Analytics: Complex Event
Processing
Interactive Analytics
o Define Indexes on Collected
data ( Streams)
o Issue, dynamic queries and get
results right away. ( Powered
by Apache Lucene)
o Shows multiples events from
same activity together using
custom defined activity IDs
o Useful for data exploration
o Powered by Apache Lucene,
with support for Index
Sharding
Predictive Analytics
o Build models and use them
with WSO2 CEP, BAM and
ESB using WSO2 Machine
Learner Product ( 2015 Q3)
o Build model using R, export
them as PMML, and use
within WSO2 CEP
WSO2 Machine Learner
o Sample, explore, and
understand data
through visualizations
o A wizard to configure,
train machine learning
models, and select the
best model
o Find and use those
models with WSO2
CEP, BAM and ESB
o Powered by Apache
Spark MLLib
Building Decision Models
A model describe how a system behave when inputs
changes. There are many ways to build models.
see https://icrunchdatanews.com/what-are-predictive-models/
o Regression models and ML Models
Time series models
o Statistical models
o Physical Models - based on physical
phenomena. They include 6-DoF flight
models, space flight models Weather
models.
o Mathematical Models
Verification
o All is good, now you have a
model. You must verify that it
is correct before using it in the
real world.
o Prediction can be verified by
waiting for events to occur
o Relationships like causality (e.g.
having free shipping leads a
customer to buy more) must be
verified with A/B testing
o Let’s look at few of pitfalls
Pitfalls: Experiment vs Observation
o If you follow scientific method,
you would do experiments, and
they have control sets ( A/B) tests.
o Bigdata does not have a control set,
it is rather observations. ( we
observe the world as it happens)
o So what we can tell are limited.
o Correlation does not imply
Causality!!
o Send a book home example [1]
o All big buyers have free
shipping
Causality: What can we do?
o Option 1: We can act on
correlation if we can verify the
guess or if correctness is not
critical (Start Investigation,
Check for a disease, Marketing )
o Option 2: We verify correlations
using A/B testing or propensity
analysis
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http:
//www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
Pitfalls: Think about the Missing Data
o WW II, Returned Aircrafts and data on where they
were hit?
o How would you add Armour?
Abraham
Wald
o Dashboard give an “Overall idea” in a glance (e.g. car
dashboard)
o Support for personalization, you can build your own dashboard.
o Also the entry point for Drill down
o How to build?
o WSO2 DAS supports a gadget generation WIzard
o Or you can write your own Gadgets using D3 and Javascript.
Communicate: Dashboards
Communicate: Alerts
o Detecting conditions can
be done via CEP Queries.
Key is the “Last Mile”.
o Email
o SMS
o Push notifications to a
UI
o Pager
o Trigger physical Alarm
o How?
o Select Email sender “Output Adaptor” from CEP, or
send from CEP to ESB, and ESB has lot of
connectors
o How?
o Write data to a database from CEP event tables
o Build Services via WSO2 Data Service
o Expose them as APIs via API Manager
Communicate: APIs
o With mobile Apps, most data
are exposed and shared as
APIs (REST/Json ) to end
users.
o Need to expose analytics
results as API
o Following are some challenges
o Security and Permissions
o API Discovery, Billing,
throttling, quotas & SLA
Communicate: Realtime Soccer Analytics
Watch at: https://www.youtube.com/watch?
v=nRI6buQ0NOM
Data Science Pipeline
Conclusion
o Data Science is extracting
knowledge by analyzing
data
o Discussed the pipeline and
tools you can use to do
that
o Rest of summer school
will look at different
aspects in detail.
o All tools discussed are
available free under
Apache Licence.

Weitere ähnliche Inhalte

Was ist angesagt?

1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introductionkrishna singh
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxVrishit Saraswat
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptxSadhanaParameswaran
 
Ppt on data science
Ppt on data science Ppt on data science
Ppt on data science Ansh Budania
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 

Was ist angesagt? (20)

1. Data Analytics-introduction
1. Data Analytics-introduction1. Data Analytics-introduction
1. Data Analytics-introduction
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data Science
Data ScienceData Science
Data Science
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Introduction to data science.pptx
Introduction to data science.pptxIntroduction to data science.pptx
Introduction to data science.pptx
 
Data science
Data scienceData science
Data science
 
Ppt on data science
Ppt on data science Ppt on data science
Ppt on data science
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 

Ähnlich wie Introduction to Data Science and Analytics

Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Dr Sulaimon Afolabi
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer universityLászló Kovács
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedSqrrl
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfDanilo Cardona
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101AmmarChalifah
 
01VD062009003760042.pdf
01VD062009003760042.pdf01VD062009003760042.pdf
01VD062009003760042.pdfSunilMatsagar1
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsSriskandarajah Suhothayan
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer Jim Czuprynski
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleSongtao Guo
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIPramit Choudhary
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 

Ähnlich wie Introduction to Data Science and Analytics (20)

Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1Implementing a data_science_project (Python Version)_part1
Implementing a data_science_project (Python Version)_part1
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
data-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdfdata-science-lifecycle-ebook.pdf
data-science-lifecycle-ebook.pdf
 
Machine learning 101
Machine learning 101Machine learning 101
Machine learning 101
 
01VD062009003760042.pdf
01VD062009003760042.pdf01VD062009003760042.pdf
01VD062009003760042.pdf
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
 
Business Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at ScaleBusiness Applications of Predictive Modeling at Scale
Business Applications of Predictive Modeling at Scale
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 

Mehr von Srinath Perera

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingSrinath Perera
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the EnterpriseSrinath Perera
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs Srinath Perera
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsSrinath Perera
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesSrinath Perera
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?Srinath Perera
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsSrinath Perera
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Srinath Perera
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of BlockchainSrinath Perera
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesSrinath Perera
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata EraSrinath Perera
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksSrinath Perera
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeSrinath Perera
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies TimelineSrinath Perera
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsSrinath Perera
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglySrinath Perera
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through AnalyticsSrinath Perera
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySrinath Perera
 

Mehr von Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Kürzlich hochgeladen

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

Introduction to Data Science and Analytics

  • 1. Introduction to Data Science and Analytics Summer School 2015 Srinath Perera VP Research WSO2 Inc.
  • 2. What is Data Science? Extraction of knowledge from large volumes of data that are structured or unstructured. It is a continuation of the fields data mining and predictive analytics
  • 4. Example ( Road.lk) traffic Feed 1. Data as tweets 2. Extract time, location, and traffic level using NLP 3. Explore data 4. Model based on time, and it is a holiday 5. Predict traffic given a time and location.
  • 5. Real data is messay, often needs to cleaned up before useful. o Bad formats - ignore or treat like missing data o Missing Data - extrapolate or remove data line o Useless variables - remove o Wrong data - e.g. aaa, bbb, joe, some might be deliberate lie, or 99 may be a code for N/A Data Cleanup
  • 6. o Transform variables ( date formats, String to int) o Create derived variables o Derive country from IP o age from ID card number o Normalize strings o e.g. stemm or use phonetic sounds o different spellings and nicknames ( William->Bill) o Feature value rescaling (e.g. most ML algorithms needs value to rescaled to 0-1 range). o Enrich (e.g. lookup and add age from profile) Data Cleanup (Contd.)
  • 7. Understand, and get a feel for what is expected (models => densities, constraints) and unexpected/ residuals (errors, outliers) o think what this is data about? domain, background, how it is collected, what each fields mean and range of values. o head, tail, count, all descriptives (Mean, Max, median, percentiles .. ) - Five number Summary. Min. 1st Qu. Median Mean 3rd Qu. Max. o run a bunch of count/group-by statements to gauge if I think it's corrupt. Data Exploration
  • 8. o Plot - take random sample and explore ( scatter plot) o e.g. Draw scatter plot or Trellis Plot o Find Dependencies between fields o Calculate Correlation o Dimensionality reduction o Cluster and look visualize clusters o Look at frequency distribution of each field and try to find a known distribution if possible. Data Exploration (Contd.)
  • 10. Feature Engineering o Feature engineering is the art of finding feature that leads simplest decision algorithm. ( Good features allow a simple model to beat a complex model.) o Best features may be a subset, or a combination, or transformed version of the features.
  • 11. How to do Feature Engineering? o Manually pick by domain experts and trial and error. o Search the possible combinations by training and combining subsets (e.g. Random Forest) o Use statistical concepts like correlation and information criteria o Reduce the features to a low dimension space using techniques like PCA. o Automatic Feature Learning though Deep Learning o ...
  • 12. Analysis o Goal of analysis is to extract knowledge o This knowledge usually come in one of the two forms o KPI (Key Performance Indicators) ■ Describe key measurement for what is being measured. (e.g. revenue per year, profit margin, revenue for sqft in retail, revenue per employer) o Models to describe or predict the data ■ e.g. Machine Learning models or Statistical models
  • 13. 4 Analysis types by time to decision o Hindsight ( what happened?) o Done using Batch Analytics like MapReduce o Oversight ( what is happening?) o Done using Realtime Analytics technologies like CEP o Insight ( why things happening?) o Done with Data Mining and Unsupervised learning algorithms like Clustering o Foresight ( what will happen?) o Done by building models using Machine learning or one of other techniques
  • 14. Data Analytics Tools Landscape
  • 15.
  • 17. Realtime Analytics: Complex Event Processing
  • 18. Interactive Analytics o Define Indexes on Collected data ( Streams) o Issue, dynamic queries and get results right away. ( Powered by Apache Lucene) o Shows multiples events from same activity together using custom defined activity IDs o Useful for data exploration o Powered by Apache Lucene, with support for Index Sharding
  • 19. Predictive Analytics o Build models and use them with WSO2 CEP, BAM and ESB using WSO2 Machine Learner Product ( 2015 Q3) o Build model using R, export them as PMML, and use within WSO2 CEP
  • 20. WSO2 Machine Learner o Sample, explore, and understand data through visualizations o A wizard to configure, train machine learning models, and select the best model o Find and use those models with WSO2 CEP, BAM and ESB o Powered by Apache Spark MLLib
  • 21. Building Decision Models A model describe how a system behave when inputs changes. There are many ways to build models. see https://icrunchdatanews.com/what-are-predictive-models/ o Regression models and ML Models Time series models o Statistical models o Physical Models - based on physical phenomena. They include 6-DoF flight models, space flight models Weather models. o Mathematical Models
  • 22. Verification o All is good, now you have a model. You must verify that it is correct before using it in the real world. o Prediction can be verified by waiting for events to occur o Relationships like causality (e.g. having free shipping leads a customer to buy more) must be verified with A/B testing o Let’s look at few of pitfalls
  • 23. Pitfalls: Experiment vs Observation o If you follow scientific method, you would do experiments, and they have control sets ( A/B) tests. o Bigdata does not have a control set, it is rather observations. ( we observe the world as it happens) o So what we can tell are limited. o Correlation does not imply Causality!! o Send a book home example [1] o All big buyers have free shipping
  • 24. Causality: What can we do? o Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing ) o Option 2: We verify correlations using A/B testing or propensity analysis
  • 25. http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http: //www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/ Pitfalls: Think about the Missing Data o WW II, Returned Aircrafts and data on where they were hit? o How would you add Armour? Abraham Wald
  • 26. o Dashboard give an “Overall idea” in a glance (e.g. car dashboard) o Support for personalization, you can build your own dashboard. o Also the entry point for Drill down o How to build? o WSO2 DAS supports a gadget generation WIzard o Or you can write your own Gadgets using D3 and Javascript. Communicate: Dashboards
  • 27. Communicate: Alerts o Detecting conditions can be done via CEP Queries. Key is the “Last Mile”. o Email o SMS o Push notifications to a UI o Pager o Trigger physical Alarm o How? o Select Email sender “Output Adaptor” from CEP, or send from CEP to ESB, and ESB has lot of connectors
  • 28. o How? o Write data to a database from CEP event tables o Build Services via WSO2 Data Service o Expose them as APIs via API Manager Communicate: APIs o With mobile Apps, most data are exposed and shared as APIs (REST/Json ) to end users. o Need to expose analytics results as API o Following are some challenges o Security and Permissions o API Discovery, Billing, throttling, quotas & SLA
  • 29. Communicate: Realtime Soccer Analytics Watch at: https://www.youtube.com/watch? v=nRI6buQ0NOM
  • 31. Conclusion o Data Science is extracting knowledge by analyzing data o Discussed the pipeline and tools you can use to do that o Rest of summer school will look at different aspects in detail. o All tools discussed are available free under Apache Licence.