4. Overview:
• What is ID3 ?
• Decision Trees.
• Simple example of Decision Trees.
• ID3 Algorithm.
• Problem.
• Solution to the discussed problem.
• Conclusion.
5. What is ID3 ?
• ID3 Stands for Iterative Dichotomiser 3.
• This is a mathematical algorithm for building Decision Trees from a
dataset.
• Invented by J . Ross Quinlan in 1979.
• Uses Information Theory invented by Shannon in 1948.
• The algorithm attempts to create smallest possible decision tree from
top down, with no backtracking.
• ID3 is the precursor to the C4.5 algorithm.
• This is typically used in machine learning and Natural Language
Processing Domains.
6. Decision trees
• The tree consists of decision nodes and leaf nodes.
• A decision node has two or more branches, each representing values for the
attribute set.
• A leaf node attribute produces a homogeneous result, which does not
require additional classification testing.
• Decision trees are produced by algorithms that identify various ways of
splitting a data set into branch-like segments.
• These segments form an inverted decision tree that originates with a root
node at the top of the tree.
8. ID3 Algorithm
• First step involves creating a root node for the tree.
• If all the examples turn out to be containing positive values then return the
single-node r=tree root, with label „+‟.
• If all the examples turn out to be containing negative values then return the
single-node root, with label „-„.
• If the number of predicting attributes is empty, then return the single node
tree root, with label being the most common value of the target attribute.
• Else
A = Attribute that best classifies examples.
Decision tree attribute for root that equals to A.
For each possible value, vi, of A,
Add a new tree branch below root, corresponding to the test A = vi.
9. ID3 Algorithm
Let examples (vi), be the subset of examples that have the value vi
for A.
If examples (vi) is empty
Then below this new branch add a leaf node with label equal to most
common target value in the examples.
– Else below this new branch add the subtree ID3 (Examples
(vi), Target_Attribute, Attributes-{A}).
• End
• Return Root.
10. Conclusion
• ID3 attempts to make the shortest decision tree out of a set of learning
data, shortest is not always the best classification.
• Requires learning data to have completely consistent patterns with no
uncertainty.
12. Overview
• What is WEKA ?
• WEKA GUI Chooser.
• Data Mining with WEKA.
• Problem.
• Solution for the discussed problem.
• Conclusion
13. What is WEKA ?
• WEKA is an acronym for Waikato Analysis for Knowledge Analysis.
• This is a popular suite of machine learning software written in Java.
• This is developed at University of Waikato, New Zealand.
• WEKA is portable, since it is fully implemented in the Java programming
language and thus runs on almost any modern computing platform.
• WEKA is free software available under the GNU General Public License.
• WEKA‟s applications:
Explorer.
Knowledge Flow.
Experimenter.
Simple CLI.
15. Data Mining With WEKA
Input
•Raw data
Data Mining by WEKA
•Pre-processing
•Classification
•Regression
•Clustering
•Association Rules
•Visualization
Output
•Result
16. Explorer
• Explorer is WEKA‟s main user interface.
• The Explorer interface features several panels providing access to the main
component of the work bench :
Preprocess.
Classify
Associate
Cluster
Select Attributes
Visualize.
• Preprocess Panel: This can be used to transform the data and make it
possible to delete the instances and attributes according to specific criteria.
• Classify Panel: Enables the users to apply classification and regression
algorithms to resulting dataset, to estimate accuracy of the resulting
predictive model.
17. • Associate Panel: This provides access to association rule learners that
attempt to identify all important interrelationships between attributes in the
data.
• Cluster Panel: This gives access to the clustering techniques in WEKA.
• Select Panel: This panel provides algorithms for identifying the most
predictive attributes in a dataset.
• Visualize Panel: This panel shows a scatter plot matrix, where individual
scatter plots can be selected and enlarged, and analyzed further using
various selection operators.
18.
19. Experimenter
• This allows the systematic comparison of the predictive performance of
WEKA‟s machine learning algorithms on a collection of datasets.
• Experimenter also allows us to set large-scale experiments, start them
running, leave them, and they analyze the performance statistics that have
been collected.
• They automate the experimental process.
• The statistics can be stored in ARFF format.
• It allows users to distribute the computing load across multiple machines
using Java RMI.
21. Knowledge Flow
• The Knowledge Flow provides an alternative to the Explorer as a graphical
front end to WEKA‟s core algorithms.
• The Knowledge Flow presents a data-flow inspired interface to WEKA.
• The user can select WEKA components from a tool bar, place them on a
layout canvas and connect them together in order to form a knowledge for
Flow processing and analyzing data.
• Unlike the Explorer the Knowledge Flow can handle data either
incrementally or in batches.
24. Conclusion
• In sum, the overall goal of WEKA is to build a state-of-the-art facility for
developing machine learning (ML) techniques and allow people to apply
them to real-world data mining problems.
• Detailed documentation about different functions provided by WEKA can
be found on WEKA website.
26. Overview
• What is Web mining ?
• Challenges related to web mining.
• Web mining applications.
• Problems with Web search.
• Improvised search – adding structure to the web.
• Conclusion.
27. What is Web Mining ?
• Web mining is the use of data mining techniques to automatically discover
and extract information from web documents / services.
• Discovering useful information from the World-wide Web and its usage
patterns.
• Web mining can be divided into three different type:
Web usage mining.
Web Content mining.
Web structure mining.
28. Challenges related to Web Mining
• The web is a huge collection of documents except for the following:
Hyperlink information
Access and usage information.
• The web is very dynamic, new pages are constantly being generated.
• Challenge: The main challenge is to develop new web mining algorithms
and adapt traditional data mining algorithms to exploit hyperlinks and
access patterns.
29. Web Mining Applications
• E-Commerce (Infrastructure)
Generate User profiles.
Internet Advertising.
Fraud.
Similar Image Retrieval.
• Information retrieval (search) on web
Automatic generation of topic hierarchies.
Web Knowledge bases.
Extraction of schema for XML documents.
• Network Management
Performance Management.
Fault Management.
30. User Profiling.
• Important for improving customization:
Provides users with pages, advertisements of interest.
Example profiles: on-line trader, on-line shopper.
• Generate user profiles based on their access patterns
Cluster users based on frequently accessed URLs
Use classifier to generate a profile for each cluster.
31. Internet Advertising.
• Scheme 1:
Manually associate a set of ads with each user profile.
For each user, display an ad from the set based on profile.
• Scheme 2:
Automate association between ads and users.
Use ad click information to cluster users.
For each cluster, find ads that occur most frequently in the cluster and these
become the ads for the set of users in the cluster.
32. Fraud
• With the growing popularity of E-commerce, systems to detect and prevent
fraud on the web become important.
• Maintain a signature for each user based on buying patterns on the web.
• If buying pattern changes significantly, then signal fraud.
• HNC software uses domain knowledge and neural networks for credit card
fraud detection.
33. Image Retrieval System
• Given:
A set of images
• Find:
All images similar to a given image.
All pairs of similar images.
• Few applications of the image retrieval system are :
Medical diagnosis.
Weather Prediction
Web search engine for images.
E-commerce.
34. Problems with Web Search
• Today‟s search engine are plagued by many problems and few of them are
as mentioned below:
The “abundance” problem.
“Limited coverage” of the web.
(largest crawlers cover less than 18% of all the web pages.
“Limited Query” interface based on keyword-oriented search.
“Limited customization” to individual users.
Web is “highly dynamic”.
36. Conclusion
• Web mining systems needs to be implemented to:
Understand visitor‟s profiles.
Identify company‟s strength and weaknesses.
Measure the effectiveness of online marketing efforts.
• Web mining support on-going continuous improvements for E-businesses.