SlideShare ist ein Scribd-Unternehmen logo
1 von 27
INTRODUCTION 1
 Extraction or ‘mining’ of large amount of data
 Also known as knowledge mining from data / knowledge extraction / data or
pattern analysis / data archaeology / data dredging
 Most popular – Knowledge Discovery from Data (KDD)
 Data available in huge amount -> Imminent need for turning into useful info
 Application – market analysis, fraud detection, customer retention, production
control, science exploration
2
 Data cleaning (remove noise and inconsistent data)
 Data integration (combine multiple data sources)
 Data selection (relevant data is retrieved from database)
 Data transformation (data is transformed or consolidated by mining/aggregation)
 Data mining (extraction of data patterns)
 Pattern evaluation (identifying interesting patterns representing knowledge using
interestingness measures)
 Knowledge presentation (visualization and presentation of mined knowledge)
3
4
 Database, Data Warehouses, WWW, Information Repositories – It may be a set of
databases/warehouses or any other information repositories. Data cleaning and
data integration is performed.
 Database / Data Warehouse servers – responsible for fetching relevant data based
on user’s request
 Knowledge base – it’s the domain knowledge that guides the search. Includes
concept hierarchies used to organize attributes, user believes
 Data mining engine – consist of functional modules for task such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis.
 Pattern evaluation module – employs interestingness measures and interactive
with data mining modules to focus the search towards interesting patterns
 User interface – user specifies a data mining query or task, providing information
to help focus search and perform exploratory data mining based on intermediate
data mining results.
5
 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and Advanced Application
 Object-Relational Database
 Temporal Database/Sequence Database and Time-Series Database
 Spatial Databases and Spatiotemporal Databases
 Text Databases and Multimedia Databases
 Heterogeneous Databases and Legacy Databases
 Data Streams
 World Wide Web
6
 No coupling
 DM system does not utilize any function of DB/DW.
 Fetches data from source and stores result in different file
 Drawbacks
 Without a DB system, a DM system spends time in searching, collecting, transforming data.
 DM systems doesn’t have any tested, scalable algorithm or data structure implemented
 DM systems needs another tool to extract data
 Loose coupling
 DM system will use some feature of DB system like fetching data, performing data
mining and storing the results in a file/place in database
 Advantage
 Fetch data from database using query processing, indexing
 Has advantages of flexibility, efficiency by the system.
 Disadvantage – mining does not explore data structure/query optimization methods 7
 Semi-tight coupling
 Linking of DM system to DB system and efficient implementation of a few essential data
mining primitives is provided by DB
 Includes sorting, indexing, aggregation, histogram analysis, pre-computation of
statistical measures like sum, count, min-max, standard deviation
 Enhances performance of DM system since some frequently used results is pre-computed
 Tight coupling
 DM system is smoothly integrated into DB system.
 data mining queries and functionalities are optimized based on mining query analysis,
data structure, indexing schemes and query processing methods.
8
 Why preprocess the data?
 Incomplete (lacking attribute values)
 Noisy (containing errors or outliers)
 Inconsistent (containing discrepancies in department codes used to categorize them)
 Redundancy (repetition of the same data)
 Descriptive Data Summarization helps in the study of general characteristics of
the data and identifies the presence of noise or outliers which is useful for
successful for cleaning and data integration.
 Measures of central tendency – mean, median, weighted arithmetic mean, mode
 Measure of data dispersion – quartiles, interquartile range, variance
9
 A distributive measure is a measure that can be computed for a given data set by
partitioning the data into smaller subsets, computing the measure for each subset
and then merging the result in order to arrive at the measure’s value for the
original dataset.
 An algebraic measure is a measure that can be computed by applying an algebraic
function to one or more distributive measures.
 A holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
10
 The degree to which the numerical data tend to spread is called dispersion or
variance of the data.
 Most common measure of dispersion are range, five-number summary, inter
quartile range, standard deviation.
 For displaying the data summary and dispersion popular graphs include –
histograms, quantile plots, q-q plots, scatter plots, loess curves.
11
12
 Data cleaning tends to fill missing values, smooth out noise, identify outliers,
correct inconsistencies
 Missing values
 Ignore the tuple
 Fill the missing value manually
 Use a global constant to fill the missing value
 Use the architecture mean to fill the missing value
 Use the attribute mean for samples belonging to the same class as the given tuple
 Use the most probable value to fill the missing value
 Use regression, decision-tree induction, Bayesian formation
13
 Noisy data
 Binning
 Consults the neighboring value
 Performs local smoothing
 Smoothing by bin means – each value of bin is replaced by mean value of the bin
 Smoothing by bin median – each value of the bin is replaced by bin median
 smoothing by bin boundaries – max and min value of bin is bin boundary and each value of
bin is replaced by the closest bin boundary
 Regression
 Filters the data into functions
 Linear regression finds the best line to fit two attributes
 Multiple regression involves more than two variables
 Clustering
 Outliers is detected through clustering where similar values are organized into clusters
 Values falling off the set is outlier
14
 Data integration
 Entity identification problem is matching of equivalent real-world entries from multiple
data sources
 Correlation analysis measures how strong one attribute implies the other
 Data transformation
 Smoothing – binning, regression, clustering
 Aggregation
 Generalization – low level data is replaced by higher level concept through the use of
concept hierarchy
 Normalization – data is scaled to fall within a small specified range
 Min-Max method
 Z-score normalization
 Decimal scaling
 Attribute construction
15
 Applied to obtain a reduced representation of data set
 Data cube aggregation
 Attribute subset selection reduces the data size by removing irrelevant or redundant
attribute.
 Dimensionality reduction involves data encoding or transformation to obtain
compressed data. Lossy dimensionality reduction – wavelet transform, principal
component analysis
 Numerosity reduction
 Parametric methods use a model to estimate data ex. Log-Linear model
 Nonparametric method include histogram, clustering and sample for storing reduced
representation
 Discretization and concept hierarchy reduces the number of values for a given attribute
by dividing the range of the attribute into intervals.
16
DM task is divided into two categories: descriptive and predictive
Descriptive mining task characterizes general properties of the data
Predictive mining task performs inference on current data in order to make predictions
17
 Data characterization is summarization of the general characterization or
features of the target class of data.
 Data corresponding to user specific class are typically collected by database query
 Example: to study the characteristics of software products whose sales increased by 10%,
data related to the product is collected
 Data cube OLAP roll-up operation is used for data summarization
 Output is presented in the form of pie charts, histogram
 Data discrimination is comparison of the general features of target class data
objects with general features of the object from one or a set of contrasting class.
 Example: comparison of a product whose sales increased by 10% with that of a product
whose sales decreased by 30%
18
 Classification is the process of finding a model that describes or distinguishes data
classes or concepts for the purpose of being able to use the model to predict the
class of object whose class label is unknown.
 Classifying loan as ‘safe’ or ‘risky’
 Given a customer profile, guess whether he will buy a new computer
 Decision tree induction
 Bayesian classification
 Rule-based classification
 Classification by backpropogation
 Support vector machines
 Classification by association rule analysis
19
 Prediction models continuous valued functions. Numeric prediction is the task of
predicting continues values for the given input.
 Regression analysis is a statistical methodology that is often used for numerical
prediction
 Linear/straight-line regression involves a response variable, y and a single
predictor variable, x. It models y as a function of x. [y=b+wx]
 Multiple linear regression extends straight-line regression to models more than
one predictor variable
 Nonlinear regression models polynomial terms
20
 The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
 A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other cluster.
 Class labels are not present in training data because they are not known to begin
with. Clustering is used to generate such labels
 Applications: taxonomy (organization of observations into hierarchy of classes that
group similar events together)
21
 Partitioning method
 Partitioning method creates k partitions of the database of n objects of data tuples
 Requirements
 Each group must contain at least one object
 Each object must belong to exactly one group
 Objects in the same cluster are close or related to each other whereas objects of different
cluster are fat apart or very different
 k-means algorithm where each cluster is represented by the mean value of the objects
 k-medoids algorithm where each cluster is represented by one of the objects located near
the center of the cluster.
 works well for small to medium databases
22
 Hierarchical method
 Created hierarchical decomposition of the given set of data objects.
 Classification based on how hierarchical decomposition is formed
 Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until
all the groups are merged into one
 Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into
smaller cluster until eventually each object is in one cluster
 Density-based method
 Can easily determine clusters of arbitrary shape
 Used to filter out noise
 Grid based method
 Quantize the object space into a finite number of cells that form a grid structure.
 Faster processing
 Model based clustering
 Hypothesizes a model for each cluster and finds the best fit of the data to the given
model
 Locates cluster by constructing a density function that reflects spatial distribution of
data
 Automatically determines the number of clusters based on standard statistics
 Example: self organizing maps
23
 Clustering high dimensional data
 examines objects having a number of features
 Subspace clustering method searches for clusters in subspace
 Frequent pattern based clustering extracts distinct frequent patterns among subset of
dimensions that occur frequently
 Constrain based clustering
 Performs clustering by incorporating user-specific constrains
 A constrain expresses a user’s expectations or desired results
 Example: spatial clustering with the existence of obstacles and clustering under user
specific constrains
24
 Outliers are data that do not comply with the general behavior or model of data
 Its discarded by most data mining applications. However, in applications like
fraud detection, it worth noting. Example: fraudulent usage of credit cards by
detecting purchases extremely of extremely large amount on a given day
 Outliers may be detected by using a statistical test for probability model or using
distance measure where objects that are a substantial distance from any other
cluster is considered outlier.
 Evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.
25
 Massive data, temporally ordered, fast changing and potentially infinite is stream
data.
 Stream data flow in and out of a computer system continuously and with varying
update rates.
 Examples – real-time surveillance system, communication network, internet
traffic, on-line transactions in financial markets or retail industry, electric power
grids, industry production process and other dynamic environments.
 It is impossible to store an entire data stream. Moreover, it tends to be of rather
low level of abstraction.
26
 Mining time-series data
 A time-series database consist of sequence of values spread over repeated measurements
of time.
 Time-series database is popular in stock-market analysis, economic and sales
forecasting, budgetary analysis, utility studies, yield studies, work-load projections,
observation of natural phenomenon
 Mining sequence patterns
 A sequence database consist of sequence of ordered elements or events, recorded with or
without a concrete notion of time. Sequential pattern mining is the discovery of
frequently occurring ordered events or sequence of patterns.
 Applications include customer shopping sequence, web clickstream, biological sequences,
sequences of events in science and engineering.
27

Weitere ähnliche Inhalte

Was ist angesagt?

Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization janani thirupathi
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 

Was ist angesagt? (20)

Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preparation
Data preparationData preparation
Data preparation
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data reduction
Data reductionData reduction
Data reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 

Ähnlich wie Data Mining Introduction and Techniques

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 

Ähnlich wie Data Mining Introduction and Techniques (20)

data mining
data miningdata mining
data mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Unit II.pdf
Unit II.pdfUnit II.pdf
Unit II.pdf
 
Preprocess
PreprocessPreprocess
Preprocess
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 

Mehr von Ujjawal

fMRI in machine learning
fMRI in machine learningfMRI in machine learning
fMRI in machine learningUjjawal
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalUjjawal
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmUjjawal
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighborUjjawal
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesUjjawal
 
Vector space classification
Vector space classificationVector space classification
Vector space classificationUjjawal
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Bayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionBayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionUjjawal
 

Mehr von Ujjawal (10)

fMRI in machine learning
fMRI in machine learningfMRI in machine learning
fMRI in machine learning
 
Random forest
Random forestRandom forest
Random forest
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Bayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionBayes’ theorem and logistic regression
Bayes’ theorem and logistic regression
 

Kürzlich hochgeladen

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Data Mining Introduction and Techniques

  • 2.  Extraction or ‘mining’ of large amount of data  Also known as knowledge mining from data / knowledge extraction / data or pattern analysis / data archaeology / data dredging  Most popular – Knowledge Discovery from Data (KDD)  Data available in huge amount -> Imminent need for turning into useful info  Application – market analysis, fraud detection, customer retention, production control, science exploration 2
  • 3.  Data cleaning (remove noise and inconsistent data)  Data integration (combine multiple data sources)  Data selection (relevant data is retrieved from database)  Data transformation (data is transformed or consolidated by mining/aggregation)  Data mining (extraction of data patterns)  Pattern evaluation (identifying interesting patterns representing knowledge using interestingness measures)  Knowledge presentation (visualization and presentation of mined knowledge) 3
  • 4. 4
  • 5.  Database, Data Warehouses, WWW, Information Repositories – It may be a set of databases/warehouses or any other information repositories. Data cleaning and data integration is performed.  Database / Data Warehouse servers – responsible for fetching relevant data based on user’s request  Knowledge base – it’s the domain knowledge that guides the search. Includes concept hierarchies used to organize attributes, user believes  Data mining engine – consist of functional modules for task such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis.  Pattern evaluation module – employs interestingness measures and interactive with data mining modules to focus the search towards interesting patterns  User interface – user specifies a data mining query or task, providing information to help focus search and perform exploratory data mining based on intermediate data mining results. 5
  • 6.  Relational Databases  Data Warehouses  Transactional Databases  Advanced Data and Information Systems and Advanced Application  Object-Relational Database  Temporal Database/Sequence Database and Time-Series Database  Spatial Databases and Spatiotemporal Databases  Text Databases and Multimedia Databases  Heterogeneous Databases and Legacy Databases  Data Streams  World Wide Web 6
  • 7.  No coupling  DM system does not utilize any function of DB/DW.  Fetches data from source and stores result in different file  Drawbacks  Without a DB system, a DM system spends time in searching, collecting, transforming data.  DM systems doesn’t have any tested, scalable algorithm or data structure implemented  DM systems needs another tool to extract data  Loose coupling  DM system will use some feature of DB system like fetching data, performing data mining and storing the results in a file/place in database  Advantage  Fetch data from database using query processing, indexing  Has advantages of flexibility, efficiency by the system.  Disadvantage – mining does not explore data structure/query optimization methods 7
  • 8.  Semi-tight coupling  Linking of DM system to DB system and efficient implementation of a few essential data mining primitives is provided by DB  Includes sorting, indexing, aggregation, histogram analysis, pre-computation of statistical measures like sum, count, min-max, standard deviation  Enhances performance of DM system since some frequently used results is pre-computed  Tight coupling  DM system is smoothly integrated into DB system.  data mining queries and functionalities are optimized based on mining query analysis, data structure, indexing schemes and query processing methods. 8
  • 9.  Why preprocess the data?  Incomplete (lacking attribute values)  Noisy (containing errors or outliers)  Inconsistent (containing discrepancies in department codes used to categorize them)  Redundancy (repetition of the same data)  Descriptive Data Summarization helps in the study of general characteristics of the data and identifies the presence of noise or outliers which is useful for successful for cleaning and data integration.  Measures of central tendency – mean, median, weighted arithmetic mean, mode  Measure of data dispersion – quartiles, interquartile range, variance 9
  • 10.  A distributive measure is a measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset and then merging the result in order to arrive at the measure’s value for the original dataset.  An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures.  A holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset. 10
  • 11.  The degree to which the numerical data tend to spread is called dispersion or variance of the data.  Most common measure of dispersion are range, five-number summary, inter quartile range, standard deviation.  For displaying the data summary and dispersion popular graphs include – histograms, quantile plots, q-q plots, scatter plots, loess curves. 11
  • 12. 12
  • 13.  Data cleaning tends to fill missing values, smooth out noise, identify outliers, correct inconsistencies  Missing values  Ignore the tuple  Fill the missing value manually  Use a global constant to fill the missing value  Use the architecture mean to fill the missing value  Use the attribute mean for samples belonging to the same class as the given tuple  Use the most probable value to fill the missing value  Use regression, decision-tree induction, Bayesian formation 13
  • 14.  Noisy data  Binning  Consults the neighboring value  Performs local smoothing  Smoothing by bin means – each value of bin is replaced by mean value of the bin  Smoothing by bin median – each value of the bin is replaced by bin median  smoothing by bin boundaries – max and min value of bin is bin boundary and each value of bin is replaced by the closest bin boundary  Regression  Filters the data into functions  Linear regression finds the best line to fit two attributes  Multiple regression involves more than two variables  Clustering  Outliers is detected through clustering where similar values are organized into clusters  Values falling off the set is outlier 14
  • 15.  Data integration  Entity identification problem is matching of equivalent real-world entries from multiple data sources  Correlation analysis measures how strong one attribute implies the other  Data transformation  Smoothing – binning, regression, clustering  Aggregation  Generalization – low level data is replaced by higher level concept through the use of concept hierarchy  Normalization – data is scaled to fall within a small specified range  Min-Max method  Z-score normalization  Decimal scaling  Attribute construction 15
  • 16.  Applied to obtain a reduced representation of data set  Data cube aggregation  Attribute subset selection reduces the data size by removing irrelevant or redundant attribute.  Dimensionality reduction involves data encoding or transformation to obtain compressed data. Lossy dimensionality reduction – wavelet transform, principal component analysis  Numerosity reduction  Parametric methods use a model to estimate data ex. Log-Linear model  Nonparametric method include histogram, clustering and sample for storing reduced representation  Discretization and concept hierarchy reduces the number of values for a given attribute by dividing the range of the attribute into intervals. 16
  • 17. DM task is divided into two categories: descriptive and predictive Descriptive mining task characterizes general properties of the data Predictive mining task performs inference on current data in order to make predictions 17
  • 18.  Data characterization is summarization of the general characterization or features of the target class of data.  Data corresponding to user specific class are typically collected by database query  Example: to study the characteristics of software products whose sales increased by 10%, data related to the product is collected  Data cube OLAP roll-up operation is used for data summarization  Output is presented in the form of pie charts, histogram  Data discrimination is comparison of the general features of target class data objects with general features of the object from one or a set of contrasting class.  Example: comparison of a product whose sales increased by 10% with that of a product whose sales decreased by 30% 18
  • 19.  Classification is the process of finding a model that describes or distinguishes data classes or concepts for the purpose of being able to use the model to predict the class of object whose class label is unknown.  Classifying loan as ‘safe’ or ‘risky’  Given a customer profile, guess whether he will buy a new computer  Decision tree induction  Bayesian classification  Rule-based classification  Classification by backpropogation  Support vector machines  Classification by association rule analysis 19
  • 20.  Prediction models continuous valued functions. Numeric prediction is the task of predicting continues values for the given input.  Regression analysis is a statistical methodology that is often used for numerical prediction  Linear/straight-line regression involves a response variable, y and a single predictor variable, x. It models y as a function of x. [y=b+wx]  Multiple linear regression extends straight-line regression to models more than one predictor variable  Nonlinear regression models polynomial terms 20
  • 21.  The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.  A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster.  Class labels are not present in training data because they are not known to begin with. Clustering is used to generate such labels  Applications: taxonomy (organization of observations into hierarchy of classes that group similar events together) 21
  • 22.  Partitioning method  Partitioning method creates k partitions of the database of n objects of data tuples  Requirements  Each group must contain at least one object  Each object must belong to exactly one group  Objects in the same cluster are close or related to each other whereas objects of different cluster are fat apart or very different  k-means algorithm where each cluster is represented by the mean value of the objects  k-medoids algorithm where each cluster is represented by one of the objects located near the center of the cluster.  works well for small to medium databases 22
  • 23.  Hierarchical method  Created hierarchical decomposition of the given set of data objects.  Classification based on how hierarchical decomposition is formed  Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until all the groups are merged into one  Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into smaller cluster until eventually each object is in one cluster  Density-based method  Can easily determine clusters of arbitrary shape  Used to filter out noise  Grid based method  Quantize the object space into a finite number of cells that form a grid structure.  Faster processing  Model based clustering  Hypothesizes a model for each cluster and finds the best fit of the data to the given model  Locates cluster by constructing a density function that reflects spatial distribution of data  Automatically determines the number of clusters based on standard statistics  Example: self organizing maps 23
  • 24.  Clustering high dimensional data  examines objects having a number of features  Subspace clustering method searches for clusters in subspace  Frequent pattern based clustering extracts distinct frequent patterns among subset of dimensions that occur frequently  Constrain based clustering  Performs clustering by incorporating user-specific constrains  A constrain expresses a user’s expectations or desired results  Example: spatial clustering with the existence of obstacles and clustering under user specific constrains 24
  • 25.  Outliers are data that do not comply with the general behavior or model of data  Its discarded by most data mining applications. However, in applications like fraud detection, it worth noting. Example: fraudulent usage of credit cards by detecting purchases extremely of extremely large amount on a given day  Outliers may be detected by using a statistical test for probability model or using distance measure where objects that are a substantial distance from any other cluster is considered outlier.  Evolution analysis describes and models regularities or trends for objects whose behavior changes over time. 25
  • 26.  Massive data, temporally ordered, fast changing and potentially infinite is stream data.  Stream data flow in and out of a computer system continuously and with varying update rates.  Examples – real-time surveillance system, communication network, internet traffic, on-line transactions in financial markets or retail industry, electric power grids, industry production process and other dynamic environments.  It is impossible to store an entire data stream. Moreover, it tends to be of rather low level of abstraction. 26
  • 27.  Mining time-series data  A time-series database consist of sequence of values spread over repeated measurements of time.  Time-series database is popular in stock-market analysis, economic and sales forecasting, budgetary analysis, utility studies, yield studies, work-load projections, observation of natural phenomenon  Mining sequence patterns  A sequence database consist of sequence of ordered elements or events, recorded with or without a concrete notion of time. Sequential pattern mining is the discovery of frequently occurring ordered events or sequence of patterns.  Applications include customer shopping sequence, web clickstream, biological sequences, sequences of events in science and engineering. 27

Hinweis der Redaktion

  1. Steps 1-4 are different forms of data preprocessing
  2. From a data warehouse perspective, data mining can be viewed as an advanced stage of on-line analytical processing (OLAP).