SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Iris data
analysis
example
Author: Do Thi Duyen
Overview: data analysis process
Iris setosa
Iris virginica
Iris versicolor
Iris flower data set
• Also called Fisher’s Iris data set or Anderson’s Iris data set
• Collected by Edgar Anderson and Gaspé Peninsula
• To quantify the morphologic variation of Iris flowers of three
related species
• >iris
Draw a hypothesis that you can test!
• Null hypothesis
• Alternative hypothesis
• P-value < 0.05
Get data!
Some ways to read data in R:
• read.table, read.csv, read.xls, data.frame,..
• edit,…
• …
=> Hint: Never modify your raw data file; always work on a
copy!
Exploration of data
Some basic function in R to examine iris
data:
>?iris
>names(iris)
>iris
>str(iris)
>iris$new_class_specis<-as.character(iris$Species)
>iris$new_class_specis<-NULL
>iris$Species <- gsub("%","",iris$Species))
>iris<-na.omit(iris)
Summarize and plot your data!
>summary(iris)
>plot(iris)
Scatter plot
Scatter plot
plot(iris, col=iris$Species)
legend(7,4.3,unique(iris$Species),col=1:length(iris$Species),p
ch=1)
Lattice library
Ggplot2 library
Box plot
Box plot
>par(mfrow=c(1,2))
> plot(iris$Petal.Length)
>boxplot(iris$Petal.Length~ iris$Species)
> par(mfrow=c(2,2)) # to draw four figs in one window
> for(i in 1:4) boxplot(iris[,i] ~ Species, data=iris,
main=names(iris)[i])
Histogram
Histogram vs Bar chart
Outlier
Outlier
Histogram
> par(mfrow=c(1,1))
>hist(iris$Petal.Length[1:50])
Subsetting:
>iris$Sepal.Length[1:50]
>iris$Sepal.Length[-(1:50)]
Select by name:
>iris$Sepal.Length[iris$Species == "setosa"]
Change the order of data frame:
>iris.ordered<-iris[order(iris$Sepal.Length),]
Build a statistical model!
Data mining:
• Predict:
• Classification
• Regression
• Deviation detection
• Descript:
• Clustering
• Association Rule Discovery
Analysis
Clustering
Clustering
• Principle: based on measure of distances
• Algorithms:
• Hierarchical clustering: bottom up, top down
• Centroid-based clustering: k-mean, PAM, CLARA, CLARANS,..
• Distribution-based clustering: STING, WAVECluster, CLIQUE,..
• Density-based clustering: DBSCANS, OPTICS, DENCLUE,..
• Model-based cluatering: statistical model + Neural network
• …
K-Mean
Classification
Clasification algorithms
• Linear classifiers: Fisher's linear discriminant analysis, Naive
Bayes classifier,..
• Support vector machines: Least squares support vector
• Quadratic classifiers
• Kernel estimation: k-nearest neighbor
• Boosting (meta-algorithm)
• Decision trees: Random forests
• …
Fisher's linear discriminant analysis (LDA)
Demo in R
Regression analysis
Y ≈ f(X, β)
Regression analysis
• Methods: Linear regression, Logistic regression, Poisson
regression
• Regression analysis is widely used for prediction and
forecasting
Predict model
Linear regression
Analysis simple linear regression in R
Demo in R
Estimate
• Clustering: hard, using: user examine, similarity measure,
classification algorithms, entropy, F-measure, pure,…
• Clasification: holdout, k-fold cross validation,…
• Regression: statistical hypothesis test
Report
References
• http://en.wikipedia.org/wiki/Iris_flower_data_set
• http://ykhoa.net/r/R/Chuong%2010.%20%20Phan%20tich%
20hoi%20qui%20tuyen%20tinh.pdf
• http://www.statsoft.com/Textbook/Elementary-Statistics-
Concepts
• http://bis.net.vn/forums/p/366/628.aspx
Q&A
Thank for listening

Weitere ähnliche Inhalte

Was ist angesagt?

Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Handling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptxHandling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptxShamimBhuiyan8
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Hakky St
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionChristian Perone
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysisBangalore
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to AutoencodersYan Xu
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluationeShikshak
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Data science in health care
Data science in health careData science in health care
Data science in health careChetan Khanzode
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
K means clustering
K means clusteringK means clustering
K means clusteringKuppusamy P
 

Was ist angesagt? (20)

Feature selection
Feature selectionFeature selection
Feature selection
 
Chapter8
Chapter8Chapter8
Chapter8
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Handling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptxHandling Missing Values for Machine Learning.pptx
Handling Missing Values for Machine Learning.pptx
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introduction
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Data science in health care
Data science in health careData science in health care
Data science in health care
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
K means clustering
K means clusteringK means clustering
K means clustering
 

Ähnlich wie Iris data analysis example in R

Data science training in Hyderabad
Data science training in HyderabadData science training in Hyderabad
Data science training in HyderabadRajitha D
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in RSujaAldrin
 
managing big data
managing big datamanaging big data
managing big dataSuveeksha
 
Outlier Analysis.pdf
Outlier Analysis.pdfOutlier Analysis.pdf
Outlier Analysis.pdfH K Yoon
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
 
about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.MohammadMoreb
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxPrincePatel272012
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionValery Tkachenko
 
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index DesignsAnalyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index DesignsAleatha Parker-Wood
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptxNaveen Kumar
 

Ähnlich wie Iris data analysis example in R (20)

Data science training in Hyderabad
Data science training in HyderabadData science training in Hyderabad
Data science training in Hyderabad
 
Datascience Training in Hyderabad
Datascience Training in HyderabadDatascience Training in Hyderabad
Datascience Training in Hyderabad
 
Machine Learning in R
Machine Learning in RMachine Learning in R
Machine Learning in R
 
managing big data
managing big datamanaging big data
managing big data
 
Outlier Analysis.pdf
Outlier Analysis.pdfOutlier Analysis.pdf
Outlier Analysis.pdf
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptx
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Chemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collectionChemical workflows supporting automated research data collection
Chemical workflows supporting automated research data collection
 
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index DesignsAnalyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index Designs
 
Data Mining
Data MiningData Mining
Data Mining
 
NoSQL - Cassandra & MongoDB.pptx
NoSQL -  Cassandra & MongoDB.pptxNoSQL -  Cassandra & MongoDB.pptx
NoSQL - Cassandra & MongoDB.pptx
 
KNN
KNN KNN
KNN
 

Kürzlich hochgeladen

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Iris data analysis example in R