SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY
MBA
SEMESTER: 3
SPECIALIZATION
BUSINESS ANALYTICS (BA 2)
SUBJECT
DATA MINING
MODULE NO : 2
DATA REDUCTION
- Jayanti R Pande
DGICM College, Nagpur
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q1. What is Data Reduction?
DATA REDUCTION
• Data reduction refers to the process of reducing the volume of data while preserving its integrity and meaningfulness.
• It involves techniques aimed at minimizing the storage space required to store data or reducing the computational
resources needed to process it, without significantly compromising its informational content.
• Data reduction techniques are crucial in handling large datasets efficiently, improving analysis speed, minimizing storage
requirements, and facilitating easier processing and analysis of data without losing essential information.
• The primary goal of data reduction is to simplify complex datasets by eliminating redundant, irrelevant, or noisy
information, thereby improving the efficiency of data storage, processing, and analysis without significantly compromising
the accuracy or integrity of the data.
DEFINITION OF DATA REDUCTION
Data reduction is a process used in data analysis to decrease the volume or size of a dataset while retaining its essential
information and maintaining its quality.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q2. What are dimensions of large data sets?
DIMENSIONS OF LARGE DATA SETS
1. Volume: This refers to the sheer size of the dataset, usually measured in terms of the amount of data it contains. Large
datasets often contain terabytes, petabytes, or even exabytes of information.
2. Velocity: It represents the speed at which data is generated, collected, processed, and analyzed. In the context of big data,
the velocity dimension emphasizes the rapid rate at which data streams in, requiring real-time or near-real-time
processing.
3. Variety: It signifies the diversity of data types and sources within a dataset. Large datasets often comprise structured, semi-
structured, and unstructured data from various sources such as text, images, videos, sensor data, social media feeds, etc.
4. Veracity: It pertains to the quality and reliability of the data. Large datasets can contain noisy, inconsistent, or incomplete
data, making it crucial to assess and ensure data accuracy and reliability.
5. Value: This dimension represents the potential insights, knowledge, or actionable information that can be derived from
analyzing the dataset. It's essential to extract meaningful value from large datasets to justify the resources invested in
collecting, storing, and analyzing them.
6. Variability: It refers to the inconsistency or fluctuation in data flow and structure over time. Large datasets might have
dynamic characteristics, where the data distribution, patterns, or formats change over different periods.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q3. What is Relief Algorithm?
The RELIEF ALGORITHM is a machine learning technique used for feature selection or attribute weighting in supervised
learning tasks, especially in classification problems. It was initially introduced for handling high-dimensional datasets and is
particularly useful when dealing with noisy or irrelevant features.
The Relief algorithm operates by estimating the relevance of features by analyzing their contribution to the classification task.
It works as follows:
1 Initialization: The algorithm starts by initializing the weights for each feature to zero.
2 Iterative Process: For each instance in the dataset:
• Randomly select an instance from the dataset.
• Identify its k nearest neighbors, where k is a user-defined parameter.
• Separate the instances belonging to the same class (nearest neighbors) and those from different classes.
• Update the weights of features based on the differences between the selected instance and its nearest neighbors.
3 Weight Update: The algorithm adjusts the feature weights as follows:
• For continuous features: The weights are updated by considering the differences between the feature values of the selected
instance and its nearest neighbors.
• For categorical features: The algorithm calculates the weights based on the occurrences of different attribute values among
neighbors.
4 Final Feature Ranking: After iterating through the dataset, the Relief algorithm ranks the features based on their weights.
Features with higher weights are considered more relevant or discriminatory for the classification task.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q4. Explain about Feature Reduction.
FEATURE REDUCTION, also known as feature selection or dimensionality reduction, involves techniques to decrease the number of
features or variables in a dataset while retaining its essential information. The primary goal is to simplify the dataset, making it more
manageable and efficient for analysis without losing critical information.
Here are some common techniques for feature reduction:
Filter Methods: These methods select features based on statistical properties like correlation, variance, or information gain without
involving a machine learning model.
Wrapper Methods: Wrapper methods use specific machine learning algorithms to evaluate subsets of features, selecting the best set that
yields the highest model performance.
Embedded Methods: These methods integrate feature selection within the process of model training. Algorithms like LASSO regression or
tree-based models perform feature selection as part of their learning process.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE)
transform high-dimensional data into a lower-dimensional space while preserving essential information.
Q5. What is value reduction?
VALUE REDUCTION often refers to a process aimed at refining or improving the quality of a dataset by eliminating irrelevant, redundant,
or noisy data. This process involves various techniques such as:
Feature Selection: Identifying and retaining the most relevant and informative features from a dataset, while disregarding less important
or redundant ones. This step helps in reducing dimensionality and focusing on essential information.
Data Cleansing: Removing errors, inconsistencies, or outliers from the dataset to enhance data accuracy and reliability.
Dimensionality Reduction: Applying methods like Principal Component Analysis (PCA) or other dimensionality reduction techniques to
reduce the number of variables or features while preserving the most critical information.
Sampling: Utilizing sampling methods to extract representative subsets of data for analysis, reducing the dataset's volume without losing
its significant characteristics.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q6. What are entropy measures for ranking features?
ENTROPY-BASED METHODS are commonly used in feature selection to rank features according to their relevance in
a dataset. Entropy measures the uncertainty or randomness in a dataset, and by analyzing how features reduce this
uncertainty, we can determine their importance for classification or prediction tasks.
Two common entropy-based metrics used for feature ranking are:
1.Information Gain (IG): Information Gain measures the reduction in entropy achieved by splitting a dataset based
on a particular feature. It's commonly used in decision tree algorithms for feature selection. Features with higher
information gain are cAonsidered more informative for classification.
2.Mutual Information (MI): Mutual Information calculates the amount of information one feature provides about
another. When applied to feature ranking, it quantifies how much knowing the value of one feature reduces
uncertainty about another feature. High mutual information between a feature and the target variable signifies its
importance in prediction.
The process typically involves computing these metrics for each feature and ranking them based on their
Information Gain or Mutual Information scores. Features with higher scores are considered more relevant or
informative for the given task.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q7. Write a note on PCA.
PCA stands for Principal Component Analysis. It's a statistical method used for dimensionality reduction in data analysis and
machine learning. PCA aims to transform high-dimensional data into a lower-dimensional space while preserving as much
variance (or information) as possible.
Here's an overview of how PCA works:
• Data Representation: PCA starts with a dataset consisting of variables (features) possibly correlated with each other.
• Covariance Matrix: PCA computes the covariance matrix of the dataset. This matrix shows how variables change together. It's
essential for understanding the relationships between different variables.
• Eigenvalue Decomposition: PCA finds the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the
directions of maximum variance in the data, while eigenvalues indicate the magnitude of variance along those directions.
• Principal Components: PCA sorts the eigenvectors based on their corresponding eigenvalues in descending order. These
sorted eigenvectors become the principal components. The first principal component captures the most variance in the data,
followed by the second, third, and so on.
• Dimensionality Reduction: To reduce the dimensionality of the data, PCA selects a subset of principal components that
capture the most significant variance. By choosing fewer principal components, the dataset is transformed into a lower-
dimensional space while retaining as much relevant information as possible.
PCA is useful for various purposes:
Dimensionality Reduction: Reducing the number of features while maintaining most of the information.
Data Visualization: Visualizing high-dimensional data in two or three dimensions for better understanding.
Noise Reduction: Removing noise or redundant information from the dataset.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q8. What is Feature Discretisation? Explain about Chi Merge Technique.
FEATURE DISCRETIZATION is the process of converting continuous variables or features into discrete or categorical ones. This
transformation is often useful for certain machine learning algorithms that perform better with categorical data or when
dealing with datasets that contain continuous values and require discretization for analysis purposes.
Chi-Merge is one of the techniques used for feature discretization. It's a method based on the statistical significance of adjacent
intervals to merge them into larger intervals if their statistical properties, such as the distribution of classes or values within
them, are similar enough.
The steps involved in the Chi-Merge technique are:
Initialization: Begin with a continuous variable that needs discretization.
Initial Interval Creation: Initially, each unique value of the continuous variable forms a separate interval.
Chi-Squared Test: Compute the Chi-Squared statistic between adjacent intervals. The Chi-Squared test measures the
independence of variables and helps determine if the adjacent intervals should be merged.
Merge Process: Merge adjacent intervals if the Chi-Squared statistic meets a specified threshold or if their statistical properties
are similar enough. This merging process continues until all adjacent intervals satisfy certain criteria (e.g., the Chi-Squared
statistic does not exceed a predetermined threshold).
Final Discretization: Once the merging process is complete, the resulting intervals represent the discretized or binned version of
the original continuous variable.
Chi-Merge helps in reducing the number of intervals or bins while maintaining meaningful distinctions in the data distribution. It
ensures that adjacent intervals are merged based on their statistical similarities, thus creating fewer but more representative
categories for the discretized variable.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q9. Discuss the major comparison parameters used in data reduction techniques
When employing data reduction techniques in preparation for data mining, various comparison parameters play a crucial role in
assessing the effectiveness and trade-offs involved in the process. The major comparison parameters include:
1 Computing Time: Simplifying the dataset through data reduction ideally leads to reduced computing time during data mining.
By decreasing the volume or dimensions of the data, computational processes such as model training, analysis, and querying
become more efficient. However, it's essential to strike a balance between spending time on pre-processing tasks (such as
dimensionality reduction) and the overall improvement gained in computational efficiency.
2 Predictive/Descriptive Accuracy: The accuracy of data mining models is paramount. By using only relevant and significant
features derived from data reduction, the expectation is that data mining algorithms can learn more swiftly and produce more
accurate models. Removing irrelevant or redundant features mitigates the risk of misleading the learning process, leading to
improved predictive or descriptive accuracy of the models generated.
3 Representation of Data-Mining Models: Simplifying the representation of the data-mining model through data reduction often
results in models that are easier to interpret and understand. Simpler models derived from reduced data dimensions enhance
interpretability. While there might be a slight trade-off in accuracy, achieving a balance between accuracy and simplicity in
representation is essential. Dimensionality reduction aids in striking this balance by simplifying the model representation without
sacrificing accuracy significantly.
Striving to achieve reduced computational time, improved accuracy, and a simplified representation simultaneously through
dimensionality reduction is an ideal scenario. However, in practical applications, there might be trade-offs among these
parameters. Balancing these factors becomes crucial, as extreme reduction efforts might impact accuracy, and overly complex
models might hinder interpretability.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q10. Discuss in brief the recommended characteristics of data reduction algorithms.
The recommended characteristics of data reduction algorithms play a crucial role in designing effective techniques that enable
efficient and accurate reduction of data. Here are the key characteristics:
Measurable Quality: The algorithms should produce quantifiable results regarding the quality of the approximated data set after
reduction. This allows for precise measurement and evaluation of the effectiveness of the reduction process.
Recognizable Quality: The quality of the approximated results should be easily determinable before any data mining procedure is
applied. This facilitates the assessment of the effectiveness of reduction in real-time, allowing adjustments or optimizations as
needed.
Monotonicity: As the reduction algorithms are often iterative in nature, the quality of results should consistently improve or, at
the very least, remain the same with each iteration. The algorithm's output should be a non-decreasing function of both time
and input data quality.
Consistency: The quality of the results achieved should demonstrate a correlation with computation time and the quality of the
input data. This characteristic ensures that the outcomes are reliable and predictable based on these factors.
Diminishing Returns: The algorithm should exhibit diminishing returns, where the initial iterations lead to more significant
improvements in the solution. As the process continues, the rate of improvement gradually diminishes until reaching a point of
diminishing marginal returns.
Interruptability: The algorithm should allow for interruption at any stage, providing intermediate results or partial solutions. This
feature is crucial as it enables users to halt the algorithm at any point without losing all progress made, providing some useful
insights or reduced data sets.
Pre-emptability: The algorithm should be designed to support suspension and resumption with minimal overhead. This capability
allows for the temporary suspension of the reduction process and its subsequent resumption without significant resource or
time penalties.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q11. State different types of learning
The two primary types of learning methods in machine learning are supervised learning and unsupervised learning. Both types
of learning play essential roles in machine learning and data analysis. Supervised learning is suitable for tasks where labeled data
is available, enabling precise prediction or estimation based on known relationships. Unsupervised learning, on the other hand,
explores data structures, revealing insights or patterns without the need for labeled examples.
Supervised Learning
Definition: Supervised learning involves learning
from labeled data, where the input data is paired
with corresponding output labels or target values.
Tasks: Common tasks include classification and
regression.
Teacher-Student Analogy: It operates with a
"teacher" who provides labeled examples, allowing
the algorithm to learn the relationship between
input features and output labels.
Characteristics: The algorithm uses this labeled data
to learn patterns, associations, or mappings
between input and output. The goal is to predict or
estimate the output for new, unseen input data
accurately.
Examples: Predicting house prices based on features
(regression) or classifying emails as spam or not
spam (classification).
Unsupervised Learning
Definition: Unsupervised learning involves learning from
unlabeled data, where the algorithm explores the inherent
structure or patterns in the data without explicit output
labels.
Tasks: Clustering, dimensionality reduction, and anomaly
detection are common tasks.
No Teacher/Labels: There is no teacher or labeled output;
the algorithm explores the data's structure, finding
similarities, differences, or groupings within the data.
Characteristics: It discovers hidden patterns, structures, or
relationships in the data without specific guidance or
labeled examples.
Examples: Grouping similar documents together
(clustering), reducing dimensions while retaining key
information (dimensionality reduction), or detecting
unusual patterns (anomaly detection) without labeled
anomalies.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q12. From Data how to acknowledge what kind of learning task is defined for our application?
When data has been pre-processed, and the learning task for the application is defined, it leads to the consideration of various
data mining methodologies aligned with the problem's characteristics and the available dataset. These methodologies and their
associated computer-based tools include:
Statistical Methods: Utilized for inference and modeling relationships between variables. Techniques involve Bayesian inference,
logistic regression, ANOVA analysis, and log-linear models.
Cluster Analysis: Used for grouping similar data points to uncover patterns or similarities. Techniques include divisible algorithms,
agglomerative algorithms, partitional clustering, and incremental clustering.
Decision Trees and Rules: Developed mainly in artificial intelligence for inductive learning. Techniques encompass the CLS
method, ID3 algorithm, C4.5 algorithm, and pruning algorithms.
Association Rules: Discover associations among items in datasets. Methods include market basket analysis, Apriori algorithm, and
WWW path-traversal patterns.
Artificial Neural Networks: Mimic human brain behavior to learn patterns. Examples are multilayer perceptrons with
backpropagation, Kohonen networks, or convolutional neural networks.
Genetic Algorithms: Useful for solving hard optimization problems and often integrated into data mining algorithms.
Fuzzy Inference Systems: Based on fuzzy sets and logic for modeling uncertainty. Includes fuzzy modeling and decision-making
steps.
N-Dimensional Visualization Methods: Useful for exploring data patterns. Techniques include geometric, icon-based, pixel-
oriented, and hierarchical visualization.
Each methodology offers distinct approaches to data analysis, and the choice depends on the problem's nature, available data,
and the desired outcomes of the data mining task. These techniques empower analysts to extract meaningful insights and
patterns from data, aiding in informed decision-making.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q13. Explain SVM Process in detail.
The SUPPORT VECTOR MACHINE (SVM) algorithm is a powerful supervised learning method primarily used for classification
tasks, although it can also handle regression problems. SVM aims to create a decision boundary that effectively separates
different classes in the input space.
Here's a detailed illustration of the SVM procedure:
1. SVM Classification and Regression: SVMs were initially developed for classification tasks and later extended to regression
problems as Support Vector Regression (SVR) in addition to the original Support Vector Classification (SVC).
2. Supervised Learning from Labeled Data: SVM operates as a supervised learning algorithm, relying on labeled training data
that includes input attributes and corresponding class labels for classification or continuous values for regression tasks.
3. Decision Planes and Class Separation: SVM's fundamental concept revolves around defining decision planes that act as
boundaries between different classes or categories within the input space.
4. Visualizing Data for Classification: Plotting data points aids in visualizing the classification task. For instance, in a binary
classification scenario with continuous attributes, representing data points on a graph helps understand the separation
between classes.
5. Optimal Separating Hyperplane: SVM aims to determine an optimal hyperplane that effectively separates classes while
maximizing the margin, which is the distance between the hyperplane and the closest data points from each class.
6. Maximizing Margin and Generalization: By maximizing the margin between the hyperplane and support vectors (closest data
points from different classes), SVM ensures robustness in classifying new, unseen instances, facilitating generalization.
7. Linear SVM for Linearly Separable Data: In cases where data is linearly separable, the optimal separating hyperplane takes the
form of a linear decision boundary. Linear SVM classifiers focus on identifying this hyperplane to effectively separate classes.
8. Objective of SVM: SVM's primary goal is to construct a model that generalizes well, enabling accurate predictions or
classifications by optimizing the margin between classes in the input space.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Supervised Learning Unsupervised Learning
• Trained using labeled data. • Trained using unlabeled data.
• Takes direct feedback to check predictions. • Does not take any feedback.
• Predicts the output based on input-output pairs. • Finds hidden patterns in data.
• Input data provided with corresponding outputs. • Only input data is provided.
• Goal: Predict output for new data. • Goal: Find hidden patterns and insights.
• Requires supervision to train the model. • Trains without supervision.
• Categorized as Classification and Regression problems. • Classified as Clustering and Associations problems.
• Used when input and corresponding outputs are known. • Used when only input data is available.
• Tends to produce accurate results. • May offer less accuracy compared to supervised learning.
• Training model requires prior knowledge for each data. • Learns patterns similarly to how a child learns.
• Examples: Linear Regression, Logistic Regression, SVM,
Decision Trees.
• Examples: Clustering, KNN, Apriori algorithm.
Q14. Differentiate between Supervised and Unsupervised Learning.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q15. Explain kNN algorithm in all aspects of Machine Learning
The K-Nearest Neighbors (K-NN) algorithm is a straightforward machine learning method based on supervised learning. K-NN's
fundamental methodology involves computing similarities and assigning new data points to categories based on the majority
vote of their K nearest neighbors, making it a versatile and relatively simple algorithm for classification and regression tasks.
Here's an outline covering various aspects of the K-NN algorithm:
• K-NN falls under the category of Supervised Learning techniques.
• It operates on the assumption of similarity between new data and available data, classifying the new data into the most
similar category among available categories.
• K-NN stores all available data and classifies new data points based on their similarity to the stored dataset. This allows easy
categorization of new data as it appears.
• K-NN can be used for both Regression and Classification tasks, although it's more commonly used for Classification problems.
• It's a non-parametric algorithm, implying it doesn't make underlying data assumptions and instead learns directly from the
dataset.
• Often referred to as a lazy learner algorithm, K-NN doesn't immediately learn from the training set but stores the dataset
and performs classification at the time of prediction.
• During training, K-NN stores the dataset, and when new data arrives, it classifies it into the most similar category based on
the stored dataset. Example For instance, if there are two categories, A and B, and a new data point x1 is introduced, the K-
NN algorithm can help determine which category x1 belongs to based on similarity.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
K-NN Working Principle
Step-1: Choose the number K of neighbors to consider.
Step-2: Compute the Euclidean distance of K neighbors.
Step-3: Select the K nearest neighbors based on the calculated distance.
Step-4: Among these neighbors, count the data points in each category.
Step-5: Assign the new data point to the category with the maximum number of neighbors.
Step-6: The K-NN model is ready for classification.
Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Copyright © 2023 Jayanti Rajdevendra Pande.
All rights reserved.
This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose
without the express written permission of the copyright owner.
This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and
other applicable laws.
For any further queries contact on email: jayantipande17@gmail.com

Weitere ähnliche Inhalte

Ähnlich wie Data Mining Module 2 Business Analytics.

A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
IJERA Editor
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
Editor IJCATR
 

Ähnlich wie Data Mining Module 2 Business Analytics. (20)

IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Data mining techniques a survey paper
Data mining techniques a survey paperData mining techniques a survey paper
Data mining techniques a survey paper
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...Network Based Intrusion Detection System using Filter Based Feature Selection...
Network Based Intrusion Detection System using Filter Based Feature Selection...
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
Data mining techniques
Data mining techniquesData mining techniques
Data mining techniques
 
M43016571
M43016571M43016571
M43016571
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
 
M5.pptx
M5.pptxM5.pptx
M5.pptx
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection in
 

Mehr von Jayanti Pande

Mehr von Jayanti Pande (20)

Resume "My Content" Feature| Resume Tips.pdf
Resume "My Content" Feature| Resume Tips.pdfResume "My Content" Feature| Resume Tips.pdf
Resume "My Content" Feature| Resume Tips.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Web & Social Media Analytics Module 5.pdf
Web & Social Media Analytics Module 5.pdfWeb & Social Media Analytics Module 5.pdf
Web & Social Media Analytics Module 5.pdf
 
Web & Social Media Analytics Module 4.pdf
Web & Social Media Analytics Module 4.pdfWeb & Social Media Analytics Module 4.pdf
Web & Social Media Analytics Module 4.pdf
 
Web & Social Media Analytics Module 3.pdf
Web & Social Media Analytics Module 3.pdfWeb & Social Media Analytics Module 3.pdf
Web & Social Media Analytics Module 3.pdf
 
Web & Social Media Analytics Module 2.pdf
Web & Social Media Analytics Module 2.pdfWeb & Social Media Analytics Module 2.pdf
Web & Social Media Analytics Module 2.pdf
 
Web & Social Media Analytics Module 1.pdf
Web & Social Media Analytics Module 1.pdfWeb & Social Media Analytics Module 1.pdf
Web & Social Media Analytics Module 1.pdf
 
Basics of Research| Also Valuable for MBA Research Project Viva.pdf
Basics of Research| Also Valuable for MBA Research Project Viva.pdfBasics of Research| Also Valuable for MBA Research Project Viva.pdf
Basics of Research| Also Valuable for MBA Research Project Viva.pdf
 
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 5.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 5.pdfPERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 5.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 5.pdf
 
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2 ] Module 4.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2 ] Module 4.pdfPERFORMANCE MEASUREMENT SYSTEM [HR Paper 2 ] Module 4.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2 ] Module 4.pdf
 
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 3.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 3.pdfPERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 3.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 3.pdf
 
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 2.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 2.pdfPERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 2.pdf
PERFORMANCE MEASUREMENT SYSTEM [HR Paper 2] Module 2.pdf
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf
 
MBA Project Report ppt By Jayanti Pande.pdf
MBA Project Report ppt By Jayanti Pande.pdfMBA Project Report ppt By Jayanti Pande.pdf
MBA Project Report ppt By Jayanti Pande.pdf
 
MBA Project Report | By Jayanti Pande.pdf
MBA Project Report |  By Jayanti Pande.pdfMBA Project Report |  By Jayanti Pande.pdf
MBA Project Report | By Jayanti Pande.pdf
 
HR Paper 2 Module 1 INTRODUCTION TO PERFORMANCE MEASUREMENT .pdf
HR Paper 2 Module 1 INTRODUCTION TO PERFORMANCE MEASUREMENT .pdfHR Paper 2 Module 1 INTRODUCTION TO PERFORMANCE MEASUREMENT .pdf
HR Paper 2 Module 1 INTRODUCTION TO PERFORMANCE MEASUREMENT .pdf
 
Data Mining Module 5 Business Analytics.pdf
Data Mining Module 5 Business Analytics.pdfData Mining Module 5 Business Analytics.pdf
Data Mining Module 5 Business Analytics.pdf
 
Data Mining Module 4 Business Analytics.pdf
Data Mining Module 4 Business Analytics.pdfData Mining Module 4 Business Analytics.pdf
Data Mining Module 4 Business Analytics.pdf
 
Data Mining Module 3 Business Analtics..pdf
Data Mining Module 3 Business Analtics..pdfData Mining Module 3 Business Analtics..pdf
Data Mining Module 3 Business Analtics..pdf
 
Data Mining Module 1 Business Analytics.
Data Mining Module 1 Business Analytics.Data Mining Module 1 Business Analytics.
Data Mining Module 1 Business Analytics.
 

Kürzlich hochgeladen

Poster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdfPoster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdf
Alexander Litvinenko
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
Krashi Coaching
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
中 央社
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 

Kürzlich hochgeladen (20)

Poster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdfPoster_density_driven_with_fracture_MLMC.pdf
Poster_density_driven_with_fracture_MLMC.pdf
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptx
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
 
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
MSc Ag Genetics & Plant Breeding: Insights from Previous Year JNKVV Entrance ...
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Benefits and Challenges of OER by Shweta Babel.pptx
Benefits and Challenges of OER by Shweta Babel.pptxBenefits and Challenges of OER by Shweta Babel.pptx
Benefits and Challenges of OER by Shweta Babel.pptx
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading RoomImplanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
Implanted Devices - VP Shunts: EMGuidewire's Radiology Reading Room
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 

Data Mining Module 2 Business Analytics.

  • 1. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY MBA SEMESTER: 3 SPECIALIZATION BUSINESS ANALYTICS (BA 2) SUBJECT DATA MINING MODULE NO : 2 DATA REDUCTION - Jayanti R Pande DGICM College, Nagpur
  • 2. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q1. What is Data Reduction? DATA REDUCTION • Data reduction refers to the process of reducing the volume of data while preserving its integrity and meaningfulness. • It involves techniques aimed at minimizing the storage space required to store data or reducing the computational resources needed to process it, without significantly compromising its informational content. • Data reduction techniques are crucial in handling large datasets efficiently, improving analysis speed, minimizing storage requirements, and facilitating easier processing and analysis of data without losing essential information. • The primary goal of data reduction is to simplify complex datasets by eliminating redundant, irrelevant, or noisy information, thereby improving the efficiency of data storage, processing, and analysis without significantly compromising the accuracy or integrity of the data. DEFINITION OF DATA REDUCTION Data reduction is a process used in data analysis to decrease the volume or size of a dataset while retaining its essential information and maintaining its quality.
  • 3. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q2. What are dimensions of large data sets? DIMENSIONS OF LARGE DATA SETS 1. Volume: This refers to the sheer size of the dataset, usually measured in terms of the amount of data it contains. Large datasets often contain terabytes, petabytes, or even exabytes of information. 2. Velocity: It represents the speed at which data is generated, collected, processed, and analyzed. In the context of big data, the velocity dimension emphasizes the rapid rate at which data streams in, requiring real-time or near-real-time processing. 3. Variety: It signifies the diversity of data types and sources within a dataset. Large datasets often comprise structured, semi- structured, and unstructured data from various sources such as text, images, videos, sensor data, social media feeds, etc. 4. Veracity: It pertains to the quality and reliability of the data. Large datasets can contain noisy, inconsistent, or incomplete data, making it crucial to assess and ensure data accuracy and reliability. 5. Value: This dimension represents the potential insights, knowledge, or actionable information that can be derived from analyzing the dataset. It's essential to extract meaningful value from large datasets to justify the resources invested in collecting, storing, and analyzing them. 6. Variability: It refers to the inconsistency or fluctuation in data flow and structure over time. Large datasets might have dynamic characteristics, where the data distribution, patterns, or formats change over different periods.
  • 4. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q3. What is Relief Algorithm? The RELIEF ALGORITHM is a machine learning technique used for feature selection or attribute weighting in supervised learning tasks, especially in classification problems. It was initially introduced for handling high-dimensional datasets and is particularly useful when dealing with noisy or irrelevant features. The Relief algorithm operates by estimating the relevance of features by analyzing their contribution to the classification task. It works as follows: 1 Initialization: The algorithm starts by initializing the weights for each feature to zero. 2 Iterative Process: For each instance in the dataset: • Randomly select an instance from the dataset. • Identify its k nearest neighbors, where k is a user-defined parameter. • Separate the instances belonging to the same class (nearest neighbors) and those from different classes. • Update the weights of features based on the differences between the selected instance and its nearest neighbors. 3 Weight Update: The algorithm adjusts the feature weights as follows: • For continuous features: The weights are updated by considering the differences between the feature values of the selected instance and its nearest neighbors. • For categorical features: The algorithm calculates the weights based on the occurrences of different attribute values among neighbors. 4 Final Feature Ranking: After iterating through the dataset, the Relief algorithm ranks the features based on their weights. Features with higher weights are considered more relevant or discriminatory for the classification task.
  • 5. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q4. Explain about Feature Reduction. FEATURE REDUCTION, also known as feature selection or dimensionality reduction, involves techniques to decrease the number of features or variables in a dataset while retaining its essential information. The primary goal is to simplify the dataset, making it more manageable and efficient for analysis without losing critical information. Here are some common techniques for feature reduction: Filter Methods: These methods select features based on statistical properties like correlation, variance, or information gain without involving a machine learning model. Wrapper Methods: Wrapper methods use specific machine learning algorithms to evaluate subsets of features, selecting the best set that yields the highest model performance. Embedded Methods: These methods integrate feature selection within the process of model training. Algorithms like LASSO regression or tree-based models perform feature selection as part of their learning process. Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE) transform high-dimensional data into a lower-dimensional space while preserving essential information. Q5. What is value reduction? VALUE REDUCTION often refers to a process aimed at refining or improving the quality of a dataset by eliminating irrelevant, redundant, or noisy data. This process involves various techniques such as: Feature Selection: Identifying and retaining the most relevant and informative features from a dataset, while disregarding less important or redundant ones. This step helps in reducing dimensionality and focusing on essential information. Data Cleansing: Removing errors, inconsistencies, or outliers from the dataset to enhance data accuracy and reliability. Dimensionality Reduction: Applying methods like Principal Component Analysis (PCA) or other dimensionality reduction techniques to reduce the number of variables or features while preserving the most critical information. Sampling: Utilizing sampling methods to extract representative subsets of data for analysis, reducing the dataset's volume without losing its significant characteristics.
  • 6. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q6. What are entropy measures for ranking features? ENTROPY-BASED METHODS are commonly used in feature selection to rank features according to their relevance in a dataset. Entropy measures the uncertainty or randomness in a dataset, and by analyzing how features reduce this uncertainty, we can determine their importance for classification or prediction tasks. Two common entropy-based metrics used for feature ranking are: 1.Information Gain (IG): Information Gain measures the reduction in entropy achieved by splitting a dataset based on a particular feature. It's commonly used in decision tree algorithms for feature selection. Features with higher information gain are cAonsidered more informative for classification. 2.Mutual Information (MI): Mutual Information calculates the amount of information one feature provides about another. When applied to feature ranking, it quantifies how much knowing the value of one feature reduces uncertainty about another feature. High mutual information between a feature and the target variable signifies its importance in prediction. The process typically involves computing these metrics for each feature and ranking them based on their Information Gain or Mutual Information scores. Features with higher scores are considered more relevant or informative for the given task.
  • 7. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q7. Write a note on PCA. PCA stands for Principal Component Analysis. It's a statistical method used for dimensionality reduction in data analysis and machine learning. PCA aims to transform high-dimensional data into a lower-dimensional space while preserving as much variance (or information) as possible. Here's an overview of how PCA works: • Data Representation: PCA starts with a dataset consisting of variables (features) possibly correlated with each other. • Covariance Matrix: PCA computes the covariance matrix of the dataset. This matrix shows how variables change together. It's essential for understanding the relationships between different variables. • Eigenvalue Decomposition: PCA finds the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of maximum variance in the data, while eigenvalues indicate the magnitude of variance along those directions. • Principal Components: PCA sorts the eigenvectors based on their corresponding eigenvalues in descending order. These sorted eigenvectors become the principal components. The first principal component captures the most variance in the data, followed by the second, third, and so on. • Dimensionality Reduction: To reduce the dimensionality of the data, PCA selects a subset of principal components that capture the most significant variance. By choosing fewer principal components, the dataset is transformed into a lower- dimensional space while retaining as much relevant information as possible. PCA is useful for various purposes: Dimensionality Reduction: Reducing the number of features while maintaining most of the information. Data Visualization: Visualizing high-dimensional data in two or three dimensions for better understanding. Noise Reduction: Removing noise or redundant information from the dataset.
  • 8. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q8. What is Feature Discretisation? Explain about Chi Merge Technique. FEATURE DISCRETIZATION is the process of converting continuous variables or features into discrete or categorical ones. This transformation is often useful for certain machine learning algorithms that perform better with categorical data or when dealing with datasets that contain continuous values and require discretization for analysis purposes. Chi-Merge is one of the techniques used for feature discretization. It's a method based on the statistical significance of adjacent intervals to merge them into larger intervals if their statistical properties, such as the distribution of classes or values within them, are similar enough. The steps involved in the Chi-Merge technique are: Initialization: Begin with a continuous variable that needs discretization. Initial Interval Creation: Initially, each unique value of the continuous variable forms a separate interval. Chi-Squared Test: Compute the Chi-Squared statistic between adjacent intervals. The Chi-Squared test measures the independence of variables and helps determine if the adjacent intervals should be merged. Merge Process: Merge adjacent intervals if the Chi-Squared statistic meets a specified threshold or if their statistical properties are similar enough. This merging process continues until all adjacent intervals satisfy certain criteria (e.g., the Chi-Squared statistic does not exceed a predetermined threshold). Final Discretization: Once the merging process is complete, the resulting intervals represent the discretized or binned version of the original continuous variable. Chi-Merge helps in reducing the number of intervals or bins while maintaining meaningful distinctions in the data distribution. It ensures that adjacent intervals are merged based on their statistical similarities, thus creating fewer but more representative categories for the discretized variable.
  • 9. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q9. Discuss the major comparison parameters used in data reduction techniques When employing data reduction techniques in preparation for data mining, various comparison parameters play a crucial role in assessing the effectiveness and trade-offs involved in the process. The major comparison parameters include: 1 Computing Time: Simplifying the dataset through data reduction ideally leads to reduced computing time during data mining. By decreasing the volume or dimensions of the data, computational processes such as model training, analysis, and querying become more efficient. However, it's essential to strike a balance between spending time on pre-processing tasks (such as dimensionality reduction) and the overall improvement gained in computational efficiency. 2 Predictive/Descriptive Accuracy: The accuracy of data mining models is paramount. By using only relevant and significant features derived from data reduction, the expectation is that data mining algorithms can learn more swiftly and produce more accurate models. Removing irrelevant or redundant features mitigates the risk of misleading the learning process, leading to improved predictive or descriptive accuracy of the models generated. 3 Representation of Data-Mining Models: Simplifying the representation of the data-mining model through data reduction often results in models that are easier to interpret and understand. Simpler models derived from reduced data dimensions enhance interpretability. While there might be a slight trade-off in accuracy, achieving a balance between accuracy and simplicity in representation is essential. Dimensionality reduction aids in striking this balance by simplifying the model representation without sacrificing accuracy significantly. Striving to achieve reduced computational time, improved accuracy, and a simplified representation simultaneously through dimensionality reduction is an ideal scenario. However, in practical applications, there might be trade-offs among these parameters. Balancing these factors becomes crucial, as extreme reduction efforts might impact accuracy, and overly complex models might hinder interpretability.
  • 10. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q10. Discuss in brief the recommended characteristics of data reduction algorithms. The recommended characteristics of data reduction algorithms play a crucial role in designing effective techniques that enable efficient and accurate reduction of data. Here are the key characteristics: Measurable Quality: The algorithms should produce quantifiable results regarding the quality of the approximated data set after reduction. This allows for precise measurement and evaluation of the effectiveness of the reduction process. Recognizable Quality: The quality of the approximated results should be easily determinable before any data mining procedure is applied. This facilitates the assessment of the effectiveness of reduction in real-time, allowing adjustments or optimizations as needed. Monotonicity: As the reduction algorithms are often iterative in nature, the quality of results should consistently improve or, at the very least, remain the same with each iteration. The algorithm's output should be a non-decreasing function of both time and input data quality. Consistency: The quality of the results achieved should demonstrate a correlation with computation time and the quality of the input data. This characteristic ensures that the outcomes are reliable and predictable based on these factors. Diminishing Returns: The algorithm should exhibit diminishing returns, where the initial iterations lead to more significant improvements in the solution. As the process continues, the rate of improvement gradually diminishes until reaching a point of diminishing marginal returns. Interruptability: The algorithm should allow for interruption at any stage, providing intermediate results or partial solutions. This feature is crucial as it enables users to halt the algorithm at any point without losing all progress made, providing some useful insights or reduced data sets. Pre-emptability: The algorithm should be designed to support suspension and resumption with minimal overhead. This capability allows for the temporary suspension of the reduction process and its subsequent resumption without significant resource or time penalties.
  • 11. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q11. State different types of learning The two primary types of learning methods in machine learning are supervised learning and unsupervised learning. Both types of learning play essential roles in machine learning and data analysis. Supervised learning is suitable for tasks where labeled data is available, enabling precise prediction or estimation based on known relationships. Unsupervised learning, on the other hand, explores data structures, revealing insights or patterns without the need for labeled examples. Supervised Learning Definition: Supervised learning involves learning from labeled data, where the input data is paired with corresponding output labels or target values. Tasks: Common tasks include classification and regression. Teacher-Student Analogy: It operates with a "teacher" who provides labeled examples, allowing the algorithm to learn the relationship between input features and output labels. Characteristics: The algorithm uses this labeled data to learn patterns, associations, or mappings between input and output. The goal is to predict or estimate the output for new, unseen input data accurately. Examples: Predicting house prices based on features (regression) or classifying emails as spam or not spam (classification). Unsupervised Learning Definition: Unsupervised learning involves learning from unlabeled data, where the algorithm explores the inherent structure or patterns in the data without explicit output labels. Tasks: Clustering, dimensionality reduction, and anomaly detection are common tasks. No Teacher/Labels: There is no teacher or labeled output; the algorithm explores the data's structure, finding similarities, differences, or groupings within the data. Characteristics: It discovers hidden patterns, structures, or relationships in the data without specific guidance or labeled examples. Examples: Grouping similar documents together (clustering), reducing dimensions while retaining key information (dimensionality reduction), or detecting unusual patterns (anomaly detection) without labeled anomalies.
  • 12. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q12. From Data how to acknowledge what kind of learning task is defined for our application? When data has been pre-processed, and the learning task for the application is defined, it leads to the consideration of various data mining methodologies aligned with the problem's characteristics and the available dataset. These methodologies and their associated computer-based tools include: Statistical Methods: Utilized for inference and modeling relationships between variables. Techniques involve Bayesian inference, logistic regression, ANOVA analysis, and log-linear models. Cluster Analysis: Used for grouping similar data points to uncover patterns or similarities. Techniques include divisible algorithms, agglomerative algorithms, partitional clustering, and incremental clustering. Decision Trees and Rules: Developed mainly in artificial intelligence for inductive learning. Techniques encompass the CLS method, ID3 algorithm, C4.5 algorithm, and pruning algorithms. Association Rules: Discover associations among items in datasets. Methods include market basket analysis, Apriori algorithm, and WWW path-traversal patterns. Artificial Neural Networks: Mimic human brain behavior to learn patterns. Examples are multilayer perceptrons with backpropagation, Kohonen networks, or convolutional neural networks. Genetic Algorithms: Useful for solving hard optimization problems and often integrated into data mining algorithms. Fuzzy Inference Systems: Based on fuzzy sets and logic for modeling uncertainty. Includes fuzzy modeling and decision-making steps. N-Dimensional Visualization Methods: Useful for exploring data patterns. Techniques include geometric, icon-based, pixel- oriented, and hierarchical visualization. Each methodology offers distinct approaches to data analysis, and the choice depends on the problem's nature, available data, and the desired outcomes of the data mining task. These techniques empower analysts to extract meaningful insights and patterns from data, aiding in informed decision-making.
  • 13. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q13. Explain SVM Process in detail. The SUPPORT VECTOR MACHINE (SVM) algorithm is a powerful supervised learning method primarily used for classification tasks, although it can also handle regression problems. SVM aims to create a decision boundary that effectively separates different classes in the input space. Here's a detailed illustration of the SVM procedure: 1. SVM Classification and Regression: SVMs were initially developed for classification tasks and later extended to regression problems as Support Vector Regression (SVR) in addition to the original Support Vector Classification (SVC). 2. Supervised Learning from Labeled Data: SVM operates as a supervised learning algorithm, relying on labeled training data that includes input attributes and corresponding class labels for classification or continuous values for regression tasks. 3. Decision Planes and Class Separation: SVM's fundamental concept revolves around defining decision planes that act as boundaries between different classes or categories within the input space. 4. Visualizing Data for Classification: Plotting data points aids in visualizing the classification task. For instance, in a binary classification scenario with continuous attributes, representing data points on a graph helps understand the separation between classes. 5. Optimal Separating Hyperplane: SVM aims to determine an optimal hyperplane that effectively separates classes while maximizing the margin, which is the distance between the hyperplane and the closest data points from each class. 6. Maximizing Margin and Generalization: By maximizing the margin between the hyperplane and support vectors (closest data points from different classes), SVM ensures robustness in classifying new, unseen instances, facilitating generalization. 7. Linear SVM for Linearly Separable Data: In cases where data is linearly separable, the optimal separating hyperplane takes the form of a linear decision boundary. Linear SVM classifiers focus on identifying this hyperplane to effectively separate classes. 8. Objective of SVM: SVM's primary goal is to construct a model that generalizes well, enabling accurate predictions or classifications by optimizing the margin between classes in the input space.
  • 14. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Supervised Learning Unsupervised Learning • Trained using labeled data. • Trained using unlabeled data. • Takes direct feedback to check predictions. • Does not take any feedback. • Predicts the output based on input-output pairs. • Finds hidden patterns in data. • Input data provided with corresponding outputs. • Only input data is provided. • Goal: Predict output for new data. • Goal: Find hidden patterns and insights. • Requires supervision to train the model. • Trains without supervision. • Categorized as Classification and Regression problems. • Classified as Clustering and Associations problems. • Used when input and corresponding outputs are known. • Used when only input data is available. • Tends to produce accurate results. • May offer less accuracy compared to supervised learning. • Training model requires prior knowledge for each data. • Learns patterns similarly to how a child learns. • Examples: Linear Regression, Logistic Regression, SVM, Decision Trees. • Examples: Clustering, KNN, Apriori algorithm. Q14. Differentiate between Supervised and Unsupervised Learning.
  • 15. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Q15. Explain kNN algorithm in all aspects of Machine Learning The K-Nearest Neighbors (K-NN) algorithm is a straightforward machine learning method based on supervised learning. K-NN's fundamental methodology involves computing similarities and assigning new data points to categories based on the majority vote of their K nearest neighbors, making it a versatile and relatively simple algorithm for classification and regression tasks. Here's an outline covering various aspects of the K-NN algorithm: • K-NN falls under the category of Supervised Learning techniques. • It operates on the assumption of similarity between new data and available data, classifying the new data into the most similar category among available categories. • K-NN stores all available data and classifies new data points based on their similarity to the stored dataset. This allows easy categorization of new data as it appears. • K-NN can be used for both Regression and Classification tasks, although it's more commonly used for Classification problems. • It's a non-parametric algorithm, implying it doesn't make underlying data assumptions and instead learns directly from the dataset. • Often referred to as a lazy learner algorithm, K-NN doesn't immediately learn from the training set but stores the dataset and performs classification at the time of prediction. • During training, K-NN stores the dataset, and when new data arrives, it classifies it into the most similar category based on the stored dataset. Example For instance, if there are two categories, A and B, and a new data point x1 is introduced, the K- NN algorithm can help determine which category x1 belongs to based on similarity.
  • 16. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. K-NN Working Principle Step-1: Choose the number K of neighbors to consider. Step-2: Compute the Euclidean distance of K neighbors. Step-3: Select the K nearest neighbors based on the calculated distance. Step-4: Among these neighbors, count the data points in each category. Step-5: Assign the new data point to the category with the maximum number of neighbors. Step-6: The K-NN model is ready for classification.
  • 17. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved. This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose without the express written permission of the copyright owner. This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and other applicable laws. For any further queries contact on email: jayantipande17@gmail.com