SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Handling  Missing  A,ributes  
using  Matrix  Factorization	
Övünç Bozcan
Software Research Lab
Dept. of Computer Engineering
Boğaziçi University
Istanbul, Turkey
ovuncbozcan@gmail.com
Ayşe Başar Bener
Data Science Lab
Mechanical and Industrial Engineering
Ryerson University
Toronto, Canada
ayse.bener@ryerson.ca
Outline	
•  Introduction
•  Related Work
•  Matrix Factorization
•  Experiment
•  Results
•  Conclusion
Introduction	
Ø  Software defect prediction models reveal defect
prone parts of the software to guide managers in
allocating testing resources efficiently
Ø  Popular studies
Ø  Estimate number of defects remaining in software systems
Ø  Discover defect associations
Ø  Classify defect-proneness of software components into two classes,
defect-prone and not defect-prone
Ø  Metrics
Ø  Static code
Ø  History
Ø  Social
Introduction	
Ø  Numerous defect prediction research in the last 40
years
Ø  Statistical techniques with machine learning algorithms are adopted
Ø  Nagappan et al., Ostrand et al., Zimmermann et al., Fenton et al.,
Khoshgoftaar et al.
Ø  Benchmarking studies
Ø  Lessmann et al. and Menzies et al.
Ø  Systematic literature surveys
Ø  Hall et al.
Ø  Industrial case studies
Ø  Tosun et al.
Introduction	
Ø  Major challenges in building defect prediction models:
Ø  High dimensionality of software defect data
Ø  The number of available software metrics is too large for a classifier to work
Ø  Skewed, imbalanced data sets
Ø  Proportion of one of the classes is quite larger than the proportion of the
other class.
Ø  Performance limitations
Ø  Limited information content
Ø  Performance ceiling effect
Ø  Incomplete datasets
Ø  Features of the train set may differ from the features of test set
Ø  Some of the test set attributes may be missing
Ø  There may be extra attributes in test sets
Ø  Building model with several datasets.
Ø  Different datasets may have different attribute sets.
Introduction	
Ø  Missing value pattern may be in
different forms:
Ø  Data may be missing at individual points
Ø  Some attribute values may be considered as
outliers. Data may be missing in chunks
Ø  You may want to build your model with several
datasets and the attributes of these datasets
may differ.
Ø  When these datasets are concatenated, there
will probably be missing chunks.
Ø  Solution might be:
Ø  To use the largest common attribute set OR
Ø  To introduce imputation to the missing
attributes
Proposed  Solution	
Matrix Factorization is a solution to data scarcity
problem in recommendation systems
Related  Work	
•  Recommendation systems
o  Netflix Prize competition
•  Koren, Bell, and Volinsky
o  Collective Matrix Factorization
•  Singh et al. and Lippert et al.
Matrix  Factorization	
•  Netflix competition
o  Matrix Factorization models are actually superior to classical nearest-neighbor
techniques as they offer incorporation of an additional information and
scalable predictive accuracy (Bell et al.)
•  Matrix factorization is basically factorizing a
large matrix into two smaller matrices called
factors.
•  Factors are multiplied to obtain the original matrix.
Matrix  Factorization	
•  Nonnegative MF
Algorithms (Berry et al.)
o  Multiplicative update
algorithms
o  Gradient descent algorithms
•  Easiest to implement
and to scale
o  Alternating least square
algorithms
•  Multi Relational Matrix
Factorization by Lippert
et al.
o  Low-norm Matrix
Factorization based on
gradient descent algorithm
Experiment	
Datasets	
 Static  Code  
Metrics	
Churn  
Metrics	
Social  
Metrics	
Instances	
 Defective  %	
Android	
 106	
 15	
 25	
 12981	
 6.4	
Linux  
Kernel	
106	
 15	
 25	
 14801	
 5.5	
Perl	
 106	
 15	
 25	
 125	
 61.6	
VLC	
 106	
 15	
 25	
 936	
 39.2	
Datasets
•  Android
o  Open source Operating System designed for mobile devices
•  Linux Kernel
o  Open source operating system
•  Perl
o  Stable, cross-platform, open source interpreted language
•  VLC
o  Open source multimedia player
Experiment	
Performance Measures
o  Pd
o  Pf
o  Balance
Learning Algorithms
o  Naive Bayes
o  Matrix Factorization
Experiment	
Experiment 1
•  The performance of Naive Bayes
algorithm is explored
•  Run 10 times 10-fold cross
validation while gradually
removing attributes from datasets
•  Attributes are removed according
to their correlation with the class
attribute
•  Pearson correlation is used
•  4(datasets)x10(removal
steps)x10x10(fold size)=4000 Naive
Bayes prediction models are built
Experiment 2
•  The performances of Naive Bayes
with Imputation and Matrix
Factorization are compared
•  Attributes are chosen according to
their correlation with the class
attribute
o  Pearson correlation is used
•  Imputation or removal procedure is
done on the chosen attributes in the
increasing proportion
•  4(datasets)x10(attribute selection
steps)x10(imputation steps)x10(fold
size)=4000 Naive Bayes and Matrix
Factorization models are built
Results  (Exp.  1)	
Balance values of Naive Bayes with respect to feature reduction percentage
Android	
 Kernel	
Perl	
 VLC
Results  (Exp.  2)	
Android	
 Kernel	
Perl	
 VLC	
Balance values of MF with respect to the missing Churn and Social Attribute data
and NB with imputation on Churn and Social Attributes
Threats  to  Validity	
•  Internal Validity
o  Naive Bayes, Mean-Value Imputation and Matrix Factorization are used largely
in previous studies.
o  Performance measurements used for evaluation are also adopted by several
researchers in the past.
o  The number of studies discussing static code, history and social metrics is quite
abundant.
o  The datasets are extracted from open source project repositories and they are
also used in previous studies.
•  External validity
o  Four different datasets extracted from open source project repositories.
o  Nevertheless, our results are limited to the analyzed data and context
Conclusion	
•  Collective matrix factorization from recommender systems for
missing data problem in defect prediction
•  Two experiments conducted
o  The performance of NB with feature reduction
o  The performance of NB with mean-value imputation vs. the performance of MF with
missing data
•  NB performance decreases while the number of features are
reduced.
•  Matrix Factorization performs better on datasets with missing
data than the benchmark model with imputation
•  Future Work
o  Support the findings with using complex imputation techniques
o  Different missing data scenarios may be adopted
Thank  You

Weitere ähnliche Inhalte

Was ist angesagt?

Learning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learningLearning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learningSimon John
 
Download
DownloadDownload
Downloadbutest
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionKishor Datta Gupta
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsananth
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 
On Sampling Strategies for Sampling Strategies-based Collaborative Filtering
On Sampling Strategies for Sampling Strategies-based Collaborative FilteringOn Sampling Strategies for Sampling Strategies-based Collaborative Filtering
On Sampling Strategies for Sampling Strategies-based Collaborative FilteringTing Chen
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Btv thesis defense_v1.02-final
Btv thesis defense_v1.02-finalBtv thesis defense_v1.02-final
Btv thesis defense_v1.02-finalVinh Bui
 
[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware RecommendationYONG ZHENG
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 

Was ist angesagt? (19)

Learning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learningLearning to compare: relation network for few shot learning
Learning to compare: relation network for few shot learning
 
Download
DownloadDownload
Download
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
On Sampling Strategies for Sampling Strategies-based Collaborative Filtering
On Sampling Strategies for Sampling Strategies-based Collaborative FilteringOn Sampling Strategies for Sampling Strategies-based Collaborative Filtering
On Sampling Strategies for Sampling Strategies-based Collaborative Filtering
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Btv thesis defense_v1.02-final
Btv thesis defense_v1.02-finalBtv thesis defense_v1.02-final
Btv thesis defense_v1.02-final
 
[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation[WISE 2015] Similarity-Based Context-aware Recommendation
[WISE 2015] Similarity-Based Context-aware Recommendation
 
Principal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT SlidesPrincipal Component Analysis (PCA) and LDA PPT Slides
Principal Component Analysis (PCA) and LDA PPT Slides
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 

Ähnlich wie Handling Missing Attributes using Matrix Factorization 

introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingLionel Briand
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
MachineLearning Seminar PPT.pptx
MachineLearning Seminar PPT.pptxMachineLearning Seminar PPT.pptx
MachineLearning Seminar PPT.pptxAmanDixit74
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Testing Machine Learning-enabled Systems: A Personal Perspective
Testing Machine Learning-enabled Systems: A Personal PerspectiveTesting Machine Learning-enabled Systems: A Personal Perspective
Testing Machine Learning-enabled Systems: A Personal PerspectiveLionel Briand
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Lionel Briand
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networkingStenio Fernandes
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Anubhav Jain
 

Ähnlich wie Handling Missing Attributes using Matrix Factorization  (20)

introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
MachineLearning Seminar PPT.pptx
MachineLearning Seminar PPT.pptxMachineLearning Seminar PPT.pptx
MachineLearning Seminar PPT.pptx
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Testing Machine Learning-enabled Systems: A Personal Perspective
Testing Machine Learning-enabled Systems: A Personal PerspectiveTesting Machine Learning-enabled Systems: A Personal Perspective
Testing Machine Learning-enabled Systems: A Personal Perspective
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Week 8: Programming for Data Analysis
Week 8: Programming for Data AnalysisWeek 8: Programming for Data Analysis
Week 8: Programming for Data Analysis
 
See12.ppt
See12.pptSee12.ppt
See12.ppt
 

Mehr von CS, NcState

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdecCS, NcState
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringCS, NcState
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...CS, NcState
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9CS, NcState
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).CS, NcState
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 

Mehr von CS, NcState (20)

Talks2015 novdec
Talks2015 novdecTalks2015 novdec
Talks2015 novdec
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...Three Laws of Trusted Data Sharing:(Building a Better Business Case for Dat...
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
 
Lexisnexis june9
Lexisnexis june9Lexisnexis june9
Lexisnexis june9
 
Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).Welcome to ICSE NIER’15 (new ideas and emerging results).
Welcome to ICSE NIER’15 (new ideas and emerging results).
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Handling Missing Attributes using Matrix Factorization 

  • 1. Handling  Missing  A,ributes   using  Matrix  Factorization Övünç Bozcan Software Research Lab Dept. of Computer Engineering Boğaziçi University Istanbul, Turkey ovuncbozcan@gmail.com Ayşe Başar Bener Data Science Lab Mechanical and Industrial Engineering Ryerson University Toronto, Canada ayse.bener@ryerson.ca
  • 2. Outline •  Introduction •  Related Work •  Matrix Factorization •  Experiment •  Results •  Conclusion
  • 3. Introduction Ø  Software defect prediction models reveal defect prone parts of the software to guide managers in allocating testing resources efficiently Ø  Popular studies Ø  Estimate number of defects remaining in software systems Ø  Discover defect associations Ø  Classify defect-proneness of software components into two classes, defect-prone and not defect-prone Ø  Metrics Ø  Static code Ø  History Ø  Social
  • 4. Introduction Ø  Numerous defect prediction research in the last 40 years Ø  Statistical techniques with machine learning algorithms are adopted Ø  Nagappan et al., Ostrand et al., Zimmermann et al., Fenton et al., Khoshgoftaar et al. Ø  Benchmarking studies Ø  Lessmann et al. and Menzies et al. Ø  Systematic literature surveys Ø  Hall et al. Ø  Industrial case studies Ø  Tosun et al.
  • 5. Introduction Ø  Major challenges in building defect prediction models: Ø  High dimensionality of software defect data Ø  The number of available software metrics is too large for a classifier to work Ø  Skewed, imbalanced data sets Ø  Proportion of one of the classes is quite larger than the proportion of the other class. Ø  Performance limitations Ø  Limited information content Ø  Performance ceiling effect Ø  Incomplete datasets Ø  Features of the train set may differ from the features of test set Ø  Some of the test set attributes may be missing Ø  There may be extra attributes in test sets Ø  Building model with several datasets. Ø  Different datasets may have different attribute sets.
  • 6. Introduction Ø  Missing value pattern may be in different forms: Ø  Data may be missing at individual points Ø  Some attribute values may be considered as outliers. Data may be missing in chunks Ø  You may want to build your model with several datasets and the attributes of these datasets may differ. Ø  When these datasets are concatenated, there will probably be missing chunks. Ø  Solution might be: Ø  To use the largest common attribute set OR Ø  To introduce imputation to the missing attributes
  • 7. Proposed  Solution Matrix Factorization is a solution to data scarcity problem in recommendation systems
  • 8. Related  Work •  Recommendation systems o  Netflix Prize competition •  Koren, Bell, and Volinsky o  Collective Matrix Factorization •  Singh et al. and Lippert et al.
  • 9. Matrix  Factorization •  Netflix competition o  Matrix Factorization models are actually superior to classical nearest-neighbor techniques as they offer incorporation of an additional information and scalable predictive accuracy (Bell et al.) •  Matrix factorization is basically factorizing a large matrix into two smaller matrices called factors. •  Factors are multiplied to obtain the original matrix.
  • 10. Matrix  Factorization •  Nonnegative MF Algorithms (Berry et al.) o  Multiplicative update algorithms o  Gradient descent algorithms •  Easiest to implement and to scale o  Alternating least square algorithms •  Multi Relational Matrix Factorization by Lippert et al. o  Low-norm Matrix Factorization based on gradient descent algorithm
  • 11. Experiment Datasets Static  Code   Metrics Churn   Metrics Social   Metrics Instances Defective  % Android 106 15 25 12981 6.4 Linux   Kernel 106 15 25 14801 5.5 Perl 106 15 25 125 61.6 VLC 106 15 25 936 39.2 Datasets •  Android o  Open source Operating System designed for mobile devices •  Linux Kernel o  Open source operating system •  Perl o  Stable, cross-platform, open source interpreted language •  VLC o  Open source multimedia player
  • 12. Experiment Performance Measures o  Pd o  Pf o  Balance Learning Algorithms o  Naive Bayes o  Matrix Factorization
  • 13. Experiment Experiment 1 •  The performance of Naive Bayes algorithm is explored •  Run 10 times 10-fold cross validation while gradually removing attributes from datasets •  Attributes are removed according to their correlation with the class attribute •  Pearson correlation is used •  4(datasets)x10(removal steps)x10x10(fold size)=4000 Naive Bayes prediction models are built Experiment 2 •  The performances of Naive Bayes with Imputation and Matrix Factorization are compared •  Attributes are chosen according to their correlation with the class attribute o  Pearson correlation is used •  Imputation or removal procedure is done on the chosen attributes in the increasing proportion •  4(datasets)x10(attribute selection steps)x10(imputation steps)x10(fold size)=4000 Naive Bayes and Matrix Factorization models are built
  • 14. Results  (Exp.  1) Balance values of Naive Bayes with respect to feature reduction percentage Android Kernel Perl VLC
  • 15. Results  (Exp.  2) Android Kernel Perl VLC Balance values of MF with respect to the missing Churn and Social Attribute data and NB with imputation on Churn and Social Attributes
  • 16. Threats  to  Validity •  Internal Validity o  Naive Bayes, Mean-Value Imputation and Matrix Factorization are used largely in previous studies. o  Performance measurements used for evaluation are also adopted by several researchers in the past. o  The number of studies discussing static code, history and social metrics is quite abundant. o  The datasets are extracted from open source project repositories and they are also used in previous studies. •  External validity o  Four different datasets extracted from open source project repositories. o  Nevertheless, our results are limited to the analyzed data and context
  • 17. Conclusion •  Collective matrix factorization from recommender systems for missing data problem in defect prediction •  Two experiments conducted o  The performance of NB with feature reduction o  The performance of NB with mean-value imputation vs. the performance of MF with missing data •  NB performance decreases while the number of features are reduced. •  Matrix Factorization performs better on datasets with missing data than the benchmark model with imputation •  Future Work o  Support the findings with using complex imputation techniques o  Different missing data scenarios may be adopted