SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Curse of Dimensionality

                            By,
                 Nikhil Sharma
What is it?
   In applied mathematics, curse of
    dimensionality (a term coined by Richard E.
    Bellman), also known as the Hughes
    effect (named after Gordon F. Hughes),refers to
    the problem caused by the exponential increase
    in volume associated with adding extra
    dimensions to a mathematical space.
Problems
•   High Dimensional Data is difficult to work with
    since:
      Adding more features can increase the noise, and
       hence the error
      There usually aren’t enough observations to get
       good estimates
•   This causes:
      Increase in running time
      Overfitting
      Number of Samples Required
An Example
Representation of 10% sample probability space
    (i) 2-D                       (ii)3-D




The Number of Points Would Need to Increase Exponentially
              to Maintain a Given Accuracy.
 10n samples would be required for a n-dimension problem.
Curse of Dimensionality:
               Complexity
   Complexity (running time) increases with
    dimension d
    A lot of methods have at least
                                        O(nd 2 )
    complexity, where n is the number of samples
   So as d becomes large, O(nd2) complexity may
    be too costly
Curse of Dimensionality:
               Overfitting
   Paradox: If n < d2 we are better off assuming
    that features are uncorrelated, even if we know
    this assumption is wrong
   We are likely to avoid overfitting because we fit a
    model with less parameters:
Curse of Dimensionality: Number
               of Samples
 Suppose we want to use the nearest neighbor
  approach with k = 1 (1NN)
 This feature is not discriminative, i.e. it does not
 separate the classes well
 Suppose we start with only one feature




 We decide to use 2 features. For the 1NN method to
  work well, need a lot of samples, i.e. samples have to
  be dense
 To maintain the same density as in 1D (9 samples per
  unit length), how many samples do we need?
Curse of Dimensionality: Number
           of Samples
   We need 92 samples to maintain the same
    density as in 1D
Curse of Dimensionality: Number
           of Samples
   Of course, when we go from 1 feature to 2, no
    one gives us more samples, we still have 9




   This is way too sparse for 1NN to work well
Curse of Dimensionality: Number
               of Samples
   Things go from bad to worse if we decide to use 3
    features:




   If 9 was dense enough in 1D, in 3D we need 93=729
    samples!
Curse of Dimensionality: Number
           of Samples
   In general, if n samples is dense enough in 1D

   Then in d dimensions we need n d samples!
   And n d grows really really fast as a function of d
   Common pitfall:
       If we can’t solve a problem with a few features, adding
         more features seems like a good idea
       However the number of samples usually stays the same
       The method with more features is likely to perform worse
        instead of expected better
Curse of Dimensionality: Number
           of Samples
   For a fixed number of samples, as we add
    features, the graph of classification error:




   Thus for each fixed sample size n, there is the
    optimal number of features to use
The Curse of Dimensionality
 We should try to avoid creating lot of features
 Often no choice, problem starts with many features
 Example: Face Detection
     One sample point is k by m array of pixels




       Feature extraction is not trivial, usually every
       pixel is taken as a feature
        Typical dimension is 20 by 20 = 400
        Suppose 10 samples are dense enough for 1 dimension.
        Need only 10400samples
The Curse of Dimensionality
   Face Detection, dimension of one sample point is km




   The fact that we set up the problem with km dimensions
    (features) does not mean it is really a km-dimensional
    problem
    Most likely we are not setting the problem up with the right
    features
    If we used better features, we are likely need much less than
    km-dimensions
    Space of all k by m images has km dimensions
    Space of all k by m faces must be much smaller, since faces
    form a tiny fraction of all possible images
Dimensionality Reduction
   We want to reduce the number of
    dimensions because:
      Efficiency.
      Measurement costs.
      Storage costs.
      Computation costs.
      Classification performance.
      Ease of interpretation/modeling.
Principal Components
   The idea is to project onto the subspace which accounts for
    most of the variance.

   This is accomplished by projecting onto the eigenvectors of
    the covariance matrix associated with the largest
    eigenvalues.

   This is generally not the projection best suited for
    classification.

   It can be a useful method for providing a first-cut reduction
    in dimension from a high dimensional space
Feature Combination as a method
    to reduce Dimensionality
   High dimensionality is challenging and redundant
    It is natural to try to reduce dimensionality
    Reduce dimensionality by feature combination: combine
    old features x to create new features y




   For Example,



   Ideally, the new vector y should retain from x all
    information important for classification
Principle Component Analysis
                  (PCA)
   Main idea: seek most accurate data representation in a
    lower dimensional space
   Example in 2-D
        ◦ Project data to 1-D subspace (a line) which minimize the
          projection error




   Notice that the the good line to use for projection lies in
    the direction of largest variance
PCA: Approximation of Elliptical
         Cloud in 3D
Thank You!
   The End.

Weitere ähnliche Inhalte

Was ist angesagt?

Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Simplilearn
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)Sharayu Patil
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 
Feature selection
Feature selectionFeature selection
Feature selectiondkpawar
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Edureka!
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning AlgorithmsWalaa Hamdy Assy
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic RegressionKnoldus Inc.
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted treesNihar Ranjan
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 

Was ist angesagt? (20)

Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Random forest
Random forestRandom forest
Random forest
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Gradient Boosted trees
Gradient Boosted treesGradient Boosted trees
Gradient Boosted trees
 
Linear regression
Linear regressionLinear regression
Linear regression
 

Ähnlich wie Curse of dimensionality

17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...San Kim
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)cairo university
 
Density maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applicationsDensity maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applicationsI3E Technologies
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
17 large scale machine learning
17 large scale machine learning17 large scale machine learning
17 large scale machine learningTanmayVijay1
 
Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmijfcstjournal
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smallerTony Tran
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al..."Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...Edge AI and Vision Alliance
 

Ähnlich wie Curse of dimensionality (20)

PCA.pptx
PCA.pptxPCA.pptx
PCA.pptx
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tu...
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Density maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applicationsDensity maximization for improving graph matching with its applications
Density maximization for improving graph matching with its applications
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
17 large scale machine learning
17 large scale machine learning17 large scale machine learning
17 large scale machine learning
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Web image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithmWeb image annotation by diffusion maps manifold learning algorithm
Web image annotation by diffusion maps manifold learning algorithm
 
Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
04 numerical
04 numerical04 numerical
04 numerical
 
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al..."Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
"Designing CNN Algorithms for Real-time Applications," a Presentation from Al...
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Svm ms
Svm msSvm ms
Svm ms
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 

Mehr von Nikhil Sharma

Mehr von Nikhil Sharma (6)

Digital Life
Digital LifeDigital Life
Digital Life
 
B tree short
B tree shortB tree short
B tree short
 
B tree long
B tree longB tree long
B tree long
 
India
IndiaIndia
India
 
Impurities in wastewater & problems caused by it
Impurities in wastewater & problems caused by itImpurities in wastewater & problems caused by it
Impurities in wastewater & problems caused by it
 
Asymptotic notations
Asymptotic notationsAsymptotic notations
Asymptotic notations
 

Kürzlich hochgeladen

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

Curse of dimensionality

  • 1. Curse of Dimensionality By, Nikhil Sharma
  • 2. What is it?  In applied mathematics, curse of dimensionality (a term coined by Richard E. Bellman), also known as the Hughes effect (named after Gordon F. Hughes),refers to the problem caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space.
  • 3. Problems • High Dimensional Data is difficult to work with since: Adding more features can increase the noise, and hence the error There usually aren’t enough observations to get good estimates • This causes: Increase in running time Overfitting Number of Samples Required
  • 4. An Example Representation of 10% sample probability space (i) 2-D (ii)3-D The Number of Points Would Need to Increase Exponentially to Maintain a Given Accuracy. 10n samples would be required for a n-dimension problem.
  • 5. Curse of Dimensionality: Complexity  Complexity (running time) increases with dimension d  A lot of methods have at least O(nd 2 ) complexity, where n is the number of samples  So as d becomes large, O(nd2) complexity may be too costly
  • 6. Curse of Dimensionality: Overfitting  Paradox: If n < d2 we are better off assuming that features are uncorrelated, even if we know this assumption is wrong  We are likely to avoid overfitting because we fit a model with less parameters:
  • 7. Curse of Dimensionality: Number of Samples  Suppose we want to use the nearest neighbor approach with k = 1 (1NN)  This feature is not discriminative, i.e. it does not  separate the classes well  Suppose we start with only one feature  We decide to use 2 features. For the 1NN method to work well, need a lot of samples, i.e. samples have to be dense  To maintain the same density as in 1D (9 samples per unit length), how many samples do we need?
  • 8. Curse of Dimensionality: Number of Samples  We need 92 samples to maintain the same density as in 1D
  • 9. Curse of Dimensionality: Number of Samples  Of course, when we go from 1 feature to 2, no one gives us more samples, we still have 9  This is way too sparse for 1NN to work well
  • 10. Curse of Dimensionality: Number of Samples  Things go from bad to worse if we decide to use 3 features:  If 9 was dense enough in 1D, in 3D we need 93=729 samples!
  • 11. Curse of Dimensionality: Number of Samples  In general, if n samples is dense enough in 1D  Then in d dimensions we need n d samples!  And n d grows really really fast as a function of d  Common pitfall:  If we can’t solve a problem with a few features, adding more features seems like a good idea  However the number of samples usually stays the same  The method with more features is likely to perform worse instead of expected better
  • 12. Curse of Dimensionality: Number of Samples  For a fixed number of samples, as we add features, the graph of classification error:  Thus for each fixed sample size n, there is the optimal number of features to use
  • 13. The Curse of Dimensionality  We should try to avoid creating lot of features  Often no choice, problem starts with many features  Example: Face Detection  One sample point is k by m array of pixels  Feature extraction is not trivial, usually every  pixel is taken as a feature  Typical dimension is 20 by 20 = 400  Suppose 10 samples are dense enough for 1 dimension. Need only 10400samples
  • 14. The Curse of Dimensionality  Face Detection, dimension of one sample point is km  The fact that we set up the problem with km dimensions (features) does not mean it is really a km-dimensional problem  Most likely we are not setting the problem up with the right features  If we used better features, we are likely need much less than km-dimensions  Space of all k by m images has km dimensions  Space of all k by m faces must be much smaller, since faces form a tiny fraction of all possible images
  • 15. Dimensionality Reduction  We want to reduce the number of dimensions because:  Efficiency.  Measurement costs.  Storage costs.  Computation costs.  Classification performance.  Ease of interpretation/modeling.
  • 16. Principal Components  The idea is to project onto the subspace which accounts for most of the variance.  This is accomplished by projecting onto the eigenvectors of the covariance matrix associated with the largest eigenvalues.  This is generally not the projection best suited for classification.  It can be a useful method for providing a first-cut reduction in dimension from a high dimensional space
  • 17. Feature Combination as a method to reduce Dimensionality  High dimensionality is challenging and redundant  It is natural to try to reduce dimensionality  Reduce dimensionality by feature combination: combine old features x to create new features y  For Example,  Ideally, the new vector y should retain from x all information important for classification
  • 18. Principle Component Analysis (PCA)  Main idea: seek most accurate data representation in a lower dimensional space  Example in 2-D ◦ Project data to 1-D subspace (a line) which minimize the projection error  Notice that the the good line to use for projection lies in the direction of largest variance
  • 19. PCA: Approximation of Elliptical Cloud in 3D
  • 20. Thank You! The End.

Hinweis der Redaktion

  1. Let&apos;s say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn&apos;t be too hard to find. You walk along the line and it takes two minutes.Now let&apos;s say you have a square 100 yards across and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days.Now a cube 100 yards across. That&apos;s like searching a 30-story building the size of a football stadium. Ugh.The difficulty of searching through the space gets a *lot* harder as you have more dimensions. You might not realize this intuitively when it&apos;s just stated in mathematical formulas, since they all have the same &quot;width&quot;. That&apos;s the curse of dimensionality. It gets to have a name because it is unintuitive, useful, and yet simple.
  2. Let the complete probability space for one variable be represented by the unit interval (0, 1), and imagine drawing ten samples along that interval. Each sample would then have to represent 10% of the probability space (on average). Now consider a second variable defined on another, orthogonal, (0, 1) interval, also being represented by ten samples. We have produced 10 (x1 , x2 ) points on a plane defined by the orthogonal x1 , x2 lines, and representing a new probability space. But the new space has 10 ? 10 = 100 area units, so each of the ten points now represents only 1% of the probability space. It would require 100 points for each point to represent the same 10% of the probability space that was represented by 10 points in only one dimension. This is illustrated below, where the 10 points are not drawn at random to illustrate their diminished coverage.The Number of Points Would Need to Increase Exponentially to Maintain a Given Accuracy.Now, although there are 100 (x1 , x2 ) points, neither axis has more accuracy than was provided by the previous 10, because there are 10 values of x1 required for the 10 different values of x2 necessary to represent the new probability plane. Because in practice the samples are taken at random it appears as though there are more samples, but this isn&apos;t the case in terms of probability space coverage which is related to the density of points, not the number of observations per axis.Next consider adding another dimension creating a probability space represented by a cube with ten units on a side, and 1000 probability units within. It now requires stacking 10 of the previous (x1 , x2 ) planes to construct the new (x1 , x2 , x3 ) cube, and of course 10 times the number of samples per axis to maintain the former level of accuracy, or probability coverage. It is obvious that10n samples would be required for a n-dimension problem.