Weitere ähnliche Inhalte
Ähnlich wie Data Mining Module 2 Business Analytics. (20)
Mehr von Jayanti Pande (20)
Kürzlich hochgeladen (20)
Data Mining Module 2 Business Analytics.
- 1. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY
MBA
SEMESTER: 3
SPECIALIZATION
BUSINESS ANALYTICS (BA 2)
SUBJECT
DATA MINING
MODULE NO : 2
DATA REDUCTION
- Jayanti R Pande
DGICM College, Nagpur
- 2. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q1. What is Data Reduction?
DATA REDUCTION
• Data reduction refers to the process of reducing the volume of data while preserving its integrity and meaningfulness.
• It involves techniques aimed at minimizing the storage space required to store data or reducing the computational
resources needed to process it, without significantly compromising its informational content.
• Data reduction techniques are crucial in handling large datasets efficiently, improving analysis speed, minimizing storage
requirements, and facilitating easier processing and analysis of data without losing essential information.
• The primary goal of data reduction is to simplify complex datasets by eliminating redundant, irrelevant, or noisy
information, thereby improving the efficiency of data storage, processing, and analysis without significantly compromising
the accuracy or integrity of the data.
DEFINITION OF DATA REDUCTION
Data reduction is a process used in data analysis to decrease the volume or size of a dataset while retaining its essential
information and maintaining its quality.
- 3. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q2. What are dimensions of large data sets?
DIMENSIONS OF LARGE DATA SETS
1. Volume: This refers to the sheer size of the dataset, usually measured in terms of the amount of data it contains. Large
datasets often contain terabytes, petabytes, or even exabytes of information.
2. Velocity: It represents the speed at which data is generated, collected, processed, and analyzed. In the context of big data,
the velocity dimension emphasizes the rapid rate at which data streams in, requiring real-time or near-real-time
processing.
3. Variety: It signifies the diversity of data types and sources within a dataset. Large datasets often comprise structured, semi-
structured, and unstructured data from various sources such as text, images, videos, sensor data, social media feeds, etc.
4. Veracity: It pertains to the quality and reliability of the data. Large datasets can contain noisy, inconsistent, or incomplete
data, making it crucial to assess and ensure data accuracy and reliability.
5. Value: This dimension represents the potential insights, knowledge, or actionable information that can be derived from
analyzing the dataset. It's essential to extract meaningful value from large datasets to justify the resources invested in
collecting, storing, and analyzing them.
6. Variability: It refers to the inconsistency or fluctuation in data flow and structure over time. Large datasets might have
dynamic characteristics, where the data distribution, patterns, or formats change over different periods.
- 4. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q3. What is Relief Algorithm?
The RELIEF ALGORITHM is a machine learning technique used for feature selection or attribute weighting in supervised
learning tasks, especially in classification problems. It was initially introduced for handling high-dimensional datasets and is
particularly useful when dealing with noisy or irrelevant features.
The Relief algorithm operates by estimating the relevance of features by analyzing their contribution to the classification task.
It works as follows:
1 Initialization: The algorithm starts by initializing the weights for each feature to zero.
2 Iterative Process: For each instance in the dataset:
• Randomly select an instance from the dataset.
• Identify its k nearest neighbors, where k is a user-defined parameter.
• Separate the instances belonging to the same class (nearest neighbors) and those from different classes.
• Update the weights of features based on the differences between the selected instance and its nearest neighbors.
3 Weight Update: The algorithm adjusts the feature weights as follows:
• For continuous features: The weights are updated by considering the differences between the feature values of the selected
instance and its nearest neighbors.
• For categorical features: The algorithm calculates the weights based on the occurrences of different attribute values among
neighbors.
4 Final Feature Ranking: After iterating through the dataset, the Relief algorithm ranks the features based on their weights.
Features with higher weights are considered more relevant or discriminatory for the classification task.
- 5. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q4. Explain about Feature Reduction.
FEATURE REDUCTION, also known as feature selection or dimensionality reduction, involves techniques to decrease the number of
features or variables in a dataset while retaining its essential information. The primary goal is to simplify the dataset, making it more
manageable and efficient for analysis without losing critical information.
Here are some common techniques for feature reduction:
Filter Methods: These methods select features based on statistical properties like correlation, variance, or information gain without
involving a machine learning model.
Wrapper Methods: Wrapper methods use specific machine learning algorithms to evaluate subsets of features, selecting the best set that
yields the highest model performance.
Embedded Methods: These methods integrate feature selection within the process of model training. Algorithms like LASSO regression or
tree-based models perform feature selection as part of their learning process.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE)
transform high-dimensional data into a lower-dimensional space while preserving essential information.
Q5. What is value reduction?
VALUE REDUCTION often refers to a process aimed at refining or improving the quality of a dataset by eliminating irrelevant, redundant,
or noisy data. This process involves various techniques such as:
Feature Selection: Identifying and retaining the most relevant and informative features from a dataset, while disregarding less important
or redundant ones. This step helps in reducing dimensionality and focusing on essential information.
Data Cleansing: Removing errors, inconsistencies, or outliers from the dataset to enhance data accuracy and reliability.
Dimensionality Reduction: Applying methods like Principal Component Analysis (PCA) or other dimensionality reduction techniques to
reduce the number of variables or features while preserving the most critical information.
Sampling: Utilizing sampling methods to extract representative subsets of data for analysis, reducing the dataset's volume without losing
its significant characteristics.
- 6. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q6. What are entropy measures for ranking features?
ENTROPY-BASED METHODS are commonly used in feature selection to rank features according to their relevance in
a dataset. Entropy measures the uncertainty or randomness in a dataset, and by analyzing how features reduce this
uncertainty, we can determine their importance for classification or prediction tasks.
Two common entropy-based metrics used for feature ranking are:
1.Information Gain (IG): Information Gain measures the reduction in entropy achieved by splitting a dataset based
on a particular feature. It's commonly used in decision tree algorithms for feature selection. Features with higher
information gain are cAonsidered more informative for classification.
2.Mutual Information (MI): Mutual Information calculates the amount of information one feature provides about
another. When applied to feature ranking, it quantifies how much knowing the value of one feature reduces
uncertainty about another feature. High mutual information between a feature and the target variable signifies its
importance in prediction.
The process typically involves computing these metrics for each feature and ranking them based on their
Information Gain or Mutual Information scores. Features with higher scores are considered more relevant or
informative for the given task.
- 7. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q7. Write a note on PCA.
PCA stands for Principal Component Analysis. It's a statistical method used for dimensionality reduction in data analysis and
machine learning. PCA aims to transform high-dimensional data into a lower-dimensional space while preserving as much
variance (or information) as possible.
Here's an overview of how PCA works:
• Data Representation: PCA starts with a dataset consisting of variables (features) possibly correlated with each other.
• Covariance Matrix: PCA computes the covariance matrix of the dataset. This matrix shows how variables change together. It's
essential for understanding the relationships between different variables.
• Eigenvalue Decomposition: PCA finds the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the
directions of maximum variance in the data, while eigenvalues indicate the magnitude of variance along those directions.
• Principal Components: PCA sorts the eigenvectors based on their corresponding eigenvalues in descending order. These
sorted eigenvectors become the principal components. The first principal component captures the most variance in the data,
followed by the second, third, and so on.
• Dimensionality Reduction: To reduce the dimensionality of the data, PCA selects a subset of principal components that
capture the most significant variance. By choosing fewer principal components, the dataset is transformed into a lower-
dimensional space while retaining as much relevant information as possible.
PCA is useful for various purposes:
Dimensionality Reduction: Reducing the number of features while maintaining most of the information.
Data Visualization: Visualizing high-dimensional data in two or three dimensions for better understanding.
Noise Reduction: Removing noise or redundant information from the dataset.
- 8. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q8. What is Feature Discretisation? Explain about Chi Merge Technique.
FEATURE DISCRETIZATION is the process of converting continuous variables or features into discrete or categorical ones. This
transformation is often useful for certain machine learning algorithms that perform better with categorical data or when
dealing with datasets that contain continuous values and require discretization for analysis purposes.
Chi-Merge is one of the techniques used for feature discretization. It's a method based on the statistical significance of adjacent
intervals to merge them into larger intervals if their statistical properties, such as the distribution of classes or values within
them, are similar enough.
The steps involved in the Chi-Merge technique are:
Initialization: Begin with a continuous variable that needs discretization.
Initial Interval Creation: Initially, each unique value of the continuous variable forms a separate interval.
Chi-Squared Test: Compute the Chi-Squared statistic between adjacent intervals. The Chi-Squared test measures the
independence of variables and helps determine if the adjacent intervals should be merged.
Merge Process: Merge adjacent intervals if the Chi-Squared statistic meets a specified threshold or if their statistical properties
are similar enough. This merging process continues until all adjacent intervals satisfy certain criteria (e.g., the Chi-Squared
statistic does not exceed a predetermined threshold).
Final Discretization: Once the merging process is complete, the resulting intervals represent the discretized or binned version of
the original continuous variable.
Chi-Merge helps in reducing the number of intervals or bins while maintaining meaningful distinctions in the data distribution. It
ensures that adjacent intervals are merged based on their statistical similarities, thus creating fewer but more representative
categories for the discretized variable.
- 9. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q9. Discuss the major comparison parameters used in data reduction techniques
When employing data reduction techniques in preparation for data mining, various comparison parameters play a crucial role in
assessing the effectiveness and trade-offs involved in the process. The major comparison parameters include:
1 Computing Time: Simplifying the dataset through data reduction ideally leads to reduced computing time during data mining.
By decreasing the volume or dimensions of the data, computational processes such as model training, analysis, and querying
become more efficient. However, it's essential to strike a balance between spending time on pre-processing tasks (such as
dimensionality reduction) and the overall improvement gained in computational efficiency.
2 Predictive/Descriptive Accuracy: The accuracy of data mining models is paramount. By using only relevant and significant
features derived from data reduction, the expectation is that data mining algorithms can learn more swiftly and produce more
accurate models. Removing irrelevant or redundant features mitigates the risk of misleading the learning process, leading to
improved predictive or descriptive accuracy of the models generated.
3 Representation of Data-Mining Models: Simplifying the representation of the data-mining model through data reduction often
results in models that are easier to interpret and understand. Simpler models derived from reduced data dimensions enhance
interpretability. While there might be a slight trade-off in accuracy, achieving a balance between accuracy and simplicity in
representation is essential. Dimensionality reduction aids in striking this balance by simplifying the model representation without
sacrificing accuracy significantly.
Striving to achieve reduced computational time, improved accuracy, and a simplified representation simultaneously through
dimensionality reduction is an ideal scenario. However, in practical applications, there might be trade-offs among these
parameters. Balancing these factors becomes crucial, as extreme reduction efforts might impact accuracy, and overly complex
models might hinder interpretability.
- 10. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q10. Discuss in brief the recommended characteristics of data reduction algorithms.
The recommended characteristics of data reduction algorithms play a crucial role in designing effective techniques that enable
efficient and accurate reduction of data. Here are the key characteristics:
Measurable Quality: The algorithms should produce quantifiable results regarding the quality of the approximated data set after
reduction. This allows for precise measurement and evaluation of the effectiveness of the reduction process.
Recognizable Quality: The quality of the approximated results should be easily determinable before any data mining procedure is
applied. This facilitates the assessment of the effectiveness of reduction in real-time, allowing adjustments or optimizations as
needed.
Monotonicity: As the reduction algorithms are often iterative in nature, the quality of results should consistently improve or, at
the very least, remain the same with each iteration. The algorithm's output should be a non-decreasing function of both time
and input data quality.
Consistency: The quality of the results achieved should demonstrate a correlation with computation time and the quality of the
input data. This characteristic ensures that the outcomes are reliable and predictable based on these factors.
Diminishing Returns: The algorithm should exhibit diminishing returns, where the initial iterations lead to more significant
improvements in the solution. As the process continues, the rate of improvement gradually diminishes until reaching a point of
diminishing marginal returns.
Interruptability: The algorithm should allow for interruption at any stage, providing intermediate results or partial solutions. This
feature is crucial as it enables users to halt the algorithm at any point without losing all progress made, providing some useful
insights or reduced data sets.
Pre-emptability: The algorithm should be designed to support suspension and resumption with minimal overhead. This capability
allows for the temporary suspension of the reduction process and its subsequent resumption without significant resource or
time penalties.
- 11. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q11. State different types of learning
The two primary types of learning methods in machine learning are supervised learning and unsupervised learning. Both types
of learning play essential roles in machine learning and data analysis. Supervised learning is suitable for tasks where labeled data
is available, enabling precise prediction or estimation based on known relationships. Unsupervised learning, on the other hand,
explores data structures, revealing insights or patterns without the need for labeled examples.
Supervised Learning
Definition: Supervised learning involves learning
from labeled data, where the input data is paired
with corresponding output labels or target values.
Tasks: Common tasks include classification and
regression.
Teacher-Student Analogy: It operates with a
"teacher" who provides labeled examples, allowing
the algorithm to learn the relationship between
input features and output labels.
Characteristics: The algorithm uses this labeled data
to learn patterns, associations, or mappings
between input and output. The goal is to predict or
estimate the output for new, unseen input data
accurately.
Examples: Predicting house prices based on features
(regression) or classifying emails as spam or not
spam (classification).
Unsupervised Learning
Definition: Unsupervised learning involves learning from
unlabeled data, where the algorithm explores the inherent
structure or patterns in the data without explicit output
labels.
Tasks: Clustering, dimensionality reduction, and anomaly
detection are common tasks.
No Teacher/Labels: There is no teacher or labeled output;
the algorithm explores the data's structure, finding
similarities, differences, or groupings within the data.
Characteristics: It discovers hidden patterns, structures, or
relationships in the data without specific guidance or
labeled examples.
Examples: Grouping similar documents together
(clustering), reducing dimensions while retaining key
information (dimensionality reduction), or detecting
unusual patterns (anomaly detection) without labeled
anomalies.
- 12. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q12. From Data how to acknowledge what kind of learning task is defined for our application?
When data has been pre-processed, and the learning task for the application is defined, it leads to the consideration of various
data mining methodologies aligned with the problem's characteristics and the available dataset. These methodologies and their
associated computer-based tools include:
Statistical Methods: Utilized for inference and modeling relationships between variables. Techniques involve Bayesian inference,
logistic regression, ANOVA analysis, and log-linear models.
Cluster Analysis: Used for grouping similar data points to uncover patterns or similarities. Techniques include divisible algorithms,
agglomerative algorithms, partitional clustering, and incremental clustering.
Decision Trees and Rules: Developed mainly in artificial intelligence for inductive learning. Techniques encompass the CLS
method, ID3 algorithm, C4.5 algorithm, and pruning algorithms.
Association Rules: Discover associations among items in datasets. Methods include market basket analysis, Apriori algorithm, and
WWW path-traversal patterns.
Artificial Neural Networks: Mimic human brain behavior to learn patterns. Examples are multilayer perceptrons with
backpropagation, Kohonen networks, or convolutional neural networks.
Genetic Algorithms: Useful for solving hard optimization problems and often integrated into data mining algorithms.
Fuzzy Inference Systems: Based on fuzzy sets and logic for modeling uncertainty. Includes fuzzy modeling and decision-making
steps.
N-Dimensional Visualization Methods: Useful for exploring data patterns. Techniques include geometric, icon-based, pixel-
oriented, and hierarchical visualization.
Each methodology offers distinct approaches to data analysis, and the choice depends on the problem's nature, available data,
and the desired outcomes of the data mining task. These techniques empower analysts to extract meaningful insights and
patterns from data, aiding in informed decision-making.
- 13. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q13. Explain SVM Process in detail.
The SUPPORT VECTOR MACHINE (SVM) algorithm is a powerful supervised learning method primarily used for classification
tasks, although it can also handle regression problems. SVM aims to create a decision boundary that effectively separates
different classes in the input space.
Here's a detailed illustration of the SVM procedure:
1. SVM Classification and Regression: SVMs were initially developed for classification tasks and later extended to regression
problems as Support Vector Regression (SVR) in addition to the original Support Vector Classification (SVC).
2. Supervised Learning from Labeled Data: SVM operates as a supervised learning algorithm, relying on labeled training data
that includes input attributes and corresponding class labels for classification or continuous values for regression tasks.
3. Decision Planes and Class Separation: SVM's fundamental concept revolves around defining decision planes that act as
boundaries between different classes or categories within the input space.
4. Visualizing Data for Classification: Plotting data points aids in visualizing the classification task. For instance, in a binary
classification scenario with continuous attributes, representing data points on a graph helps understand the separation
between classes.
5. Optimal Separating Hyperplane: SVM aims to determine an optimal hyperplane that effectively separates classes while
maximizing the margin, which is the distance between the hyperplane and the closest data points from each class.
6. Maximizing Margin and Generalization: By maximizing the margin between the hyperplane and support vectors (closest data
points from different classes), SVM ensures robustness in classifying new, unseen instances, facilitating generalization.
7. Linear SVM for Linearly Separable Data: In cases where data is linearly separable, the optimal separating hyperplane takes the
form of a linear decision boundary. Linear SVM classifiers focus on identifying this hyperplane to effectively separate classes.
8. Objective of SVM: SVM's primary goal is to construct a model that generalizes well, enabling accurate predictions or
classifications by optimizing the margin between classes in the input space.
- 14. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Supervised Learning Unsupervised Learning
• Trained using labeled data. • Trained using unlabeled data.
• Takes direct feedback to check predictions. • Does not take any feedback.
• Predicts the output based on input-output pairs. • Finds hidden patterns in data.
• Input data provided with corresponding outputs. • Only input data is provided.
• Goal: Predict output for new data. • Goal: Find hidden patterns and insights.
• Requires supervision to train the model. • Trains without supervision.
• Categorized as Classification and Regression problems. • Classified as Clustering and Associations problems.
• Used when input and corresponding outputs are known. • Used when only input data is available.
• Tends to produce accurate results. • May offer less accuracy compared to supervised learning.
• Training model requires prior knowledge for each data. • Learns patterns similarly to how a child learns.
• Examples: Linear Regression, Logistic Regression, SVM,
Decision Trees.
• Examples: Clustering, KNN, Apriori algorithm.
Q14. Differentiate between Supervised and Unsupervised Learning.
- 15. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Q15. Explain kNN algorithm in all aspects of Machine Learning
The K-Nearest Neighbors (K-NN) algorithm is a straightforward machine learning method based on supervised learning. K-NN's
fundamental methodology involves computing similarities and assigning new data points to categories based on the majority
vote of their K nearest neighbors, making it a versatile and relatively simple algorithm for classification and regression tasks.
Here's an outline covering various aspects of the K-NN algorithm:
• K-NN falls under the category of Supervised Learning techniques.
• It operates on the assumption of similarity between new data and available data, classifying the new data into the most
similar category among available categories.
• K-NN stores all available data and classifies new data points based on their similarity to the stored dataset. This allows easy
categorization of new data as it appears.
• K-NN can be used for both Regression and Classification tasks, although it's more commonly used for Classification problems.
• It's a non-parametric algorithm, implying it doesn't make underlying data assumptions and instead learns directly from the
dataset.
• Often referred to as a lazy learner algorithm, K-NN doesn't immediately learn from the training set but stores the dataset
and performs classification at the time of prediction.
• During training, K-NN stores the dataset, and when new data arrives, it classifies it into the most similar category based on
the stored dataset. Example For instance, if there are two categories, A and B, and a new data point x1 is introduced, the K-
NN algorithm can help determine which category x1 belongs to based on similarity.
- 16. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
K-NN Working Principle
Step-1: Choose the number K of neighbors to consider.
Step-2: Compute the Euclidean distance of K neighbors.
Step-3: Select the K nearest neighbors based on the calculated distance.
Step-4: Among these neighbors, count the data points in each category.
Step-5: Assign the new data point to the category with the maximum number of neighbors.
Step-6: The K-NN model is ready for classification.
- 17. Copyright © 2023 Jayanti Rajdevendra Pande. All rights reserved.
Copyright © 2023 Jayanti Rajdevendra Pande.
All rights reserved.
This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose
without the express written permission of the copyright owner.
This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and
other applicable laws.
For any further queries contact on email: jayantipande17@gmail.com