SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Web Mining
Web Mining
The World Wide Web (Web) is a popular and interactive
medium to disseminate information today.
The Web is huge, diverse, and dynamic and thus raises
the scalability, multi-media data, and temporal issues
respectively.
Web Mining
The following problems when interacting with Web
a) Finding relevant information
People either browse or use the search service when they
want to find a specific information on the Web. When a user
uses search service he or she usually inputs a simple
keyword query and the query response is the list of pages
ranked based on their similarity to the query.
Web Mining
The following problems when interacting with Web
a) Finding relevant information
Search tools have the following problems.
1. Low precision
Which is due to the irrelevance of many of the search
results. This results in a difficulty finding the relevant
information.
2. Low recall
Which is due to the inability to index all the information
available on the web. This results in a difficulty finding the
unindexed information that is relevant.
Web Mining
The following problems when interacting with Web
b) Creating new knowledge out of the information
available on the Web.
This problem is data-triggered process that presumes that we
already have a collection of web data and we want to extract
potentially useful knowledge of of it(data mining oriented).
Web Mining
The following problems when interacting with Web
c) Personalization Of the Information.
This problem is often associated with the type and
presentation of information. People differ in the contents and
presentations they prefer while interacting with the Web.
Web Mining
The following problems when interacting with Web
d) Learning about consumers or individual Users.
This is a problem that specifically deals with above problem
which is about knowing what the customers do and want.
Inside this problem, there are sub-problems such as mass
customizing the information to the intended consumers or
even to personalize it to individual user, problem related to
effective Web site design and management, problem related
to marketing....
Web Mining
Web mining techniques could be used to solve the
information overload problem.
Other techniques and works from different research
areas,
Database (DB)
Information Retrieval (IR)
Natural Language Processing (NLP)
Web Document Community
Web Mining
Web Mining is the use of data mining techniques to
automatically discover and extract information from Web
documents and services.
Decomposing Web mining into subtasks
1. Resource Finding: the task of retrieving intended Web
document.
2. Information selection and pre-processing: automatically
selecting and pre-processing specific information from
retrieved Web resources.
3. Generalization: automatically discovers general patterns
at individual Web sites as well as across multiple sites.
4. Analysis: validation and/or interpretation of the mined
patterns.
Web Mining
Web mining refers to the overall process of discovering
potentially useful and previously unknown information or
knowledge from the Web data. It implicitly covers the
standard process of knowledge discovery in database (KDD).
We could simply view Web mining as an extension of KDD
that is applied on the Web data.
From the KDD point of view, the information and knowledge
terms are interchangeable .
There is a close relationship between data mining, machine
learning, and advanced analysis.
Web Mining
Knowledge discovery in databases (KDD) is the process of
discovering useful knowledge from a collection of data. This
widely used data mining technique is a process that includes
data preparation and selection, data cleansing, incorporating
prior knowledge on data sets and interpreting accurate
solutions from the observed results.
Web Mining
Web Mining and Information Retrieval
Information Retrieval (IR) is the automatic retrieval of all relevant
documents while at the same time retrieving as few of the non-
relevant as possible.
IR has primary goals of indexing text and searching for useful
documents in a collection.
Research in IR – modeling, document classification and
categorization, user interfaces, data visualization, filtering...
The task that can be considered to be an instance of Web mining
is Web document classification or categorization, which could be
used for indexing.
Web mining is part of the (Web) IR process. We should note that
not all of the indexing tasks use data mining techniques
Web Mining
Web Mining and Information Extraction
Information Extraction (IE) has a goal of transforming a
collection of documents, usually with the help of an IR
system, into information that is more readily digested and
analyzed.
IE aims to extract relevant facts from the documents while IR
aims to select relevant documents.
While IE is interested in the structure or representation of a
document. IR views the text in a document just as a bag of
unordered words.
In general IE works as a finer granularity level than IR does
on the document.
Web Mining
Web Mining and Information Extraction
Building IE systems manually is not feasible and scalable for
such a dynamic and diverse medium such as Web contents.
Due to the nature of the Web, most IE systems focus on
specific Web sites to extract. Others use machine learning or
data mining techniques to learn the extraction patterns or
rules for Web documents semi-automatically or automatically.
Web mining is part of the (Web) IE process. Other views
regarding the relationship between (Web) IE and Web mining
also exist.
Web Mining
Web Mining and Information Extraction
The results of the IE process could be in the form of a
structured database or could be a compression or summary
of the original text or documents.
One could view for the former that IE is a kind of pre-
processing stage in the web mining process, which is the step
after the IR process and before the data mining techniques
are being performed.
IE can also be used to improve the indexing process, which is
part of the IR process.
Web mining is used to improve Web IE (Web mining is part of
IE).
Web Mining
Web Mining and Information Extraction
There are basically two types of IE:
Unstructured texts (Classical or traditional IE tasks)
Semi-structured data (Structural IE tasks)
Classical or traditional IE tasks
● IE tasks from unstructured natural language texts (Classical
or traditional IE tasks) typically use a rather basic to a
slightly deeper linguistic preprocessing before data mining.
● With roots on the NLP community, has been studied for
quite a long time
● MUCs and TIPSTER are competitive environments that
seek to improve IE and IR technologies.
Web Mining
Web Mining and Information Extraction
Classical or traditional IE tasks
● Usually relies on linguistic preprocessing such as syntactic
analysis, semantic analysis, and disclosure analysis.
● Could be called a core language technology.
Web Mining
Web Mining and Information Extraction
Structural IE tasks
● With the increasing popularity of the web, there is a need for
structural IE systems that extract information from
semistructured documents.
● Structural IE research is different from the classical one as it
usually utilizes meta information.
● HTML tags, simple syntactics, delimiters that are available
inside the semi-structured data.
● Structural IE approaches that do not use linguistic
constraints are termed wrapper induction.
● Some of the Structural IE systems are built manually by
knowledge engineering approach.
Web Mining
Web Mining and Information Extraction
Structural IE tasks
● More structural IE systems for web are built (semi)
automatically using machine learning techniques or other
algorithms as building the system manually in no longer
appropriate.
● These systems are usually built by using machine learning
or data mining techniques, which learn extraction rules from
the annotated corpora.
Web Mining
Web Mining and Machine Learning Applied on the Web
Web mining is not the same as learning from the Web or machine
learning techniques applied on the Web.
There are some applications of machine learning applied on the
Web that is not instances of Web mining.
An example of this is a machine learning technique that is used to
spider the Web efficiently for a specific topic that emphasize on
planning the best path that is going to be traversed next.
There are some other methods used for Web mining besides
machine learning methods.
Examples are some proprietary algorithms that is used for mining
the hubs and authorities, DataGuides and Web schema discovery.
Web Mining
Web Mining and Machine Learning Applied on the Web
Machine learning techniques support and help Web mining as they
could be applied to the processes in Web mining.
Applying machine learning techniques could improve the text
classification process compared to the traditional IR techniques.
Web mining intersects with the application of machine learning on
the Web.
Web Mining
Web Mining Categories
Web mining can be categorized into
● Web content mining
● Web structure mining
● Web usage mining
Web Mining
Web Mining Categories
Web content mining
Web content mining describes the discovery of useful
information from the Web contents/data/documents.
However, what consist of the web contents could
encompass a very broad range of data.
● Previously the Internet consists of different type of services
such as FTP, Gopher and Usenet Now most of those data
ported or accessible from the web.
● The growth in the amount of government information
● Existence of digital libraries that also accessible on the Web.
● Many companies are transforming their businesses and
services electronically.
● Many applications and systems are being migrated to the Web
● Many types of applications are emerging in the Web
environment
Web Mining
Web Mining Categories
Web content mining
Some of the Web content data are hidden data, which can not
be indexed.
These data are either generated dynamically as a result of
queries and reside in the DBMS or are private.
Web content consists of several type of data.
● Textual, image, audio, video, metadata as well as hyper-
links
● Multi types of data (Multimedia data mining)
Web Mining
Web Mining Categories
Web content mining
The Web content data consist of unstructured data such as
free texts, semi-structured data such as HTML documents,
and more structured data such as data in the tables and
databases generated HTML pages.
However, much of the Web content data are unstructured text
data.
Applying data mining techniques to unstructured text is
termed knowledge discovery in text (KDT) or text data mining
or text mining.
Web Mining
Web Mining Categories
Web content mining
Web content mining from 2 different points of view: IR and DB
● The goal of Web content mining from the IR view is
mainly to assist or to improve the information finding and
filtering the information to the users usually based on
either inferred or solicited user profiles.
● The goal of Web content mining from the DB view
mainly tries to model the data on the web and integrate
them so that more sophisticated queries other than the
keywords based search could be performed.
Web Mining
Web Mining Categories
Web structure mining
Web structure mining tries to discover the model underlying
the link structures of the Web.
The model is based on the topology of the hyper-links with or
without the description of the links.
This model can be used to categorize Web pages and is
useful to generate information such as the similarity and
relationship between Web sites.
Web structure mining could be used to discover authority
sites for the subjects (authorities) and overview sites for the
subjects that point to many authorities(hubs).
Web Mining
Web Mining Categories
Web usage mining
Web usage mining tries to make sense of data generated by
the Web surfer's sessions or behaviors.
The Web usage data includes the data from web server
access logs, proxy server logs, browser logs, user profiles,
registration data, user sessions or transactions, cookies, user
queries, bookmark data, mouse clicks and scrolls and any
other data as results of interaction.
Web Mining
Web Mining Categories
Web content and structure mining utilize the real or primary
data on the Web.
Web usage mining mines the secondary data derived from
the interactions of the users while interacting with the Web.
Web Mining
Web Mining and the Agent Paradigm
Web mining is often viewed from or implemented within an
agent paradigm.
Web mining has a close relationship with software agents or
intelligent agents.
There 3 sub-categories of Software agents
● User Interface
● Distributed
● Mobile
User interface and distributed agents are relevant to web
mining tasks.
Web Mining
Web Mining and the Agent Paradigm
User Interface Agents
User Interface agents try to maximize the productivity of
current users interaction with the system adapting behaviors.
User Interface agents that can be classified into the Web
mining agent category are
● Informational retrieval agent
● Information filtering agent
● Personal assistant agent
Web Mining
Web Mining and the Agent Paradigm
Distributed Agents
Distribute agent technology is concerned with problem
solving by a group of agents and relevant agents in this
category are distributed agents for knowledge discovery or
data mining.
There are two frequently used approaches for developing
intelligent agents that help users find and retrieve relevant
information on the Web
● Content-based
● Collaborative
Web Mining
Web Mining and the Agent Paradigm
Distributed Agents
Content-based approach
The system searches for items that match based on an
analysis of the content using the user preference.
Collaborative approach
The system tries to find users with similar interests to give
recommendations to.
We could categorize the content-based methods as Web content
mining and categorize the collaborative approaches as Web usage
mining.
Web Mining
Web Mining and the Agent Paradigm
A similar view related that classifies the user interface agents
by underlying technology into:
Content-based filters (content mining)
Reputation filters (structure and content mining)
Collaborative or social based filters (usage mining)
Event-based filters (usage mining)
Hybrid filters (combination of categories)
Web Mining
WEB CONTENT DATA STRUCTURE
Web content consists of several types of data
Text, image, audio, video, hyperlinks.
Unstructured – free text
Semi-structured – HTML
More structured – Data in the tables or database
generated HTML pages
Note: much of the Web content data is unstructured text data.
Web Mining
WEB CONTENT MINING
Information Retrieval View for Unstructured Documents
● Bag of words to represent unstructured documents
Takes single word as feature
Ignores the sequence in which words occur
● Features could be
Boolean
Word either occurs or does not occur in a document
Frequency based
Frequency of the word in a document
● Variations of the feature selection include
Removing the case, punctuation, infrequent words and stop
words.
● Features can be reduced using different feature selection techniques:
Information gain, mutual information, cross entropy.
Stemming: which reduces words to their morphological roots.
Web Mining
WEB CONTENT MINING
Information Retrieval View for Unstructured Documents
Other feature representations
● Word position in document
● Using phrases
● Using hypernyms
The use of text compression techniques is rather new for text
classification tasks.
The applications range from text classification or categorization, event
detection and tracking, finding extraction patterns or rules, to finding
some interesting patterns in the text documents.
Web Mining
WEB CONTENT MINING
Information Retrieval View for Unstructured Documents
Event detection and tracking problems are sub topics of a
broader initiative called Topic Detection and Tracking (TDT).
● TDT – is an initiative to investigate the state of the art in finding and
following new events in a stream of news stories broadcast.
● Text Mining – has been used to describe different applications such
as text categorization, text clustering, empirical computational
linguistic tasks, exploratory data analysis, finding patterns in text
databases, finding sequential pattern in text and association
discovery.
Web Mining
WEB CONTENT MINING
Information Retrieval View for Unstructured Documents
in term of Knowledge Discovery in Text (KDT) or Text mining to
structure the text documents by means of information
extraction, text categorization, or applying NLP techniques as
pre-processing step before performing any kind of KDT s.
● Currently term of Text Mining – has been used to describe different
applications such as text categorization, text clustering, empirical
computational linguistic tasks, exploratory data analysis, finding
patterns in text databases, finding sequential pattern in text and
association discovery.
Web Mining
WEB CONTENT MINING
Database View
The database techniques on the Web are related to the problems
of managing and querying the information on the Web.
There are 3 classes of tasks
Modeling and querying the Web
Information extraction and integration
Web site construction and restructuring
DB view tries to infer the structure of a Web site or transform a
Web site to become a database
Better information management
Better querying on the Web
Web Mining
WEB CONTENT MINING
Database View
The DB view of Web content mining mainly tries to model the data on the
Web and integrate them so that more sophisticated queries other than key
word baseds search could be performed.
Can be achieved by:
Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
Web Mining
WEB CONTENT MINING
Database View
DB view mainly uses the Object Exchange Model (OEM)
● Represents semi-structured data by a labeled graph
● The data in the OEM is viewed as a graph, with objects as the vertices and
labels on the edges
● Each object is identified by an object identifier [oid]
● Value is either atomic(integer, string, gif, html.....) or complex
Process typically starts with manual selection of Web sites for doing Web
content mining instead of searching the whole Internet for the specific
resources.
Main application:
The task of finding frequent substructures in semi-structured data.
The task of creating multi-layered database (MLDB) in which each layer is
obtained by generalization on lower layers and use a special purpose query
language for Web mining to extract some knowledge form MLDB of web
document.
Web Mining
WEB CONTENT MINING
Database View
DataGuide is a kind of structural summary o semi-structured data
Web Mining
Web Structure Mining
Interested in the structure of the hyperlinks within the
Web
Inspired by the study of social networks and citation
analysis
● Can discover specific types of pages(such as hubs, authorities, etc.)
based on the incoming and outgoing links.
Application:
Discovering micro-communities in the Web ,
measuring the “completeness” of a Web site
Web Mining
Web Usage Mining
Tries to predict user behavior from interaction with the Web
Wide range of data (logs)
● Web client data
● Proxy server data
● Web server data
The Web usage data could also be represented with graps
Two common approaches
● Maps the usage data of Web server into relational tables before an
adapted data mining techniques.
● Uses the log data directly by utilizing special pre-processing
techniques
Web Mining
Web Usage Mining
Typical problems:
● Distinguishing among unique users, server sessions, episodes, etc. in the
presence of caching and proxy servers
● Often Usage Mining uses some background or domain knowledge
E.g. Navigation templates, site topology, Web content, Site
topology, concept hierarchy and syntactics constraints , etc.
Web Mining
Web Usage Mining
Applications:
Two main categories:
● Learning a user profile (personalized)
Web users would be interested in techniques that learn their
needs and preferences automatically
● Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques that
improve the effectiveness of their Web site

Weitere ähnliche Inhalte

Was ist angesagt?

Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 

Was ist angesagt? (20)

web mining
web miningweb mining
web mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Web Mining
Web Mining Web Mining
Web Mining
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Web mining
Web miningWeb mining
Web mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
Web mining
Web miningWeb mining
Web mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Semantic web
Semantic webSemantic web
Semantic web
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 

Andere mochten auch

Data Warehousing Datamining Concepts
Data Warehousing Datamining ConceptsData Warehousing Datamining Concepts
Data Warehousing Datamining Conceptsraulmisir
 
Data Visualization Resource Guide (September 2014)
Data Visualization Resource Guide (September 2014)Data Visualization Resource Guide (September 2014)
Data Visualization Resource Guide (September 2014)Amanda Makulec
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Data Visualization Techniques
Data Visualization TechniquesData Visualization Techniques
Data Visualization TechniquesAllAnalytics
 
Fundamental Ways We Use Data Visualizations
Fundamental Ways We Use Data VisualizationsFundamental Ways We Use Data Visualizations
Fundamental Ways We Use Data VisualizationsInitial State
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 

Andere mochten auch (15)

WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
 
Web mining
Web miningWeb mining
Web mining
 
Webmining ppt
Webmining pptWebmining ppt
Webmining ppt
 
Web Usage Pattern
Web Usage PatternWeb Usage Pattern
Web Usage Pattern
 
4.5 webminig
4.5 webminig 4.5 webminig
4.5 webminig
 
Data Warehousing Datamining Concepts
Data Warehousing Datamining ConceptsData Warehousing Datamining Concepts
Data Warehousing Datamining Concepts
 
Data Visualization Resource Guide (September 2014)
Data Visualization Resource Guide (September 2014)Data Visualization Resource Guide (September 2014)
Data Visualization Resource Guide (September 2014)
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Data Visualization Techniques
Data Visualization TechniquesData Visualization Techniques
Data Visualization Techniques
 
Data visualization
Data visualizationData visualization
Data visualization
 
Fundamental Ways We Use Data Visualizations
Fundamental Ways We Use Data VisualizationsFundamental Ways We Use Data Visualizations
Fundamental Ways We Use Data Visualizations
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 

Ähnlich wie Web mining

RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
Web Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A SurveyWeb Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A SurveyIOSR Journals
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSAM Publications
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...IAEME Publication
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 

Ähnlich wie Web mining (20)

RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
Web Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A SurveyWeb Mining Research Issues and Future Directions – A Survey
Web Mining Research Issues and Future Directions – A Survey
 
Web Mining
Web MiningWeb Mining
Web Mining
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLSSTRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
STRATEGY AND IMPLEMENTATION OF WEB MINING TOOLS
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
Aa03401490154
Aa03401490154Aa03401490154
Aa03401490154
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
01635156
0163515601635156
01635156
 
Minning www
Minning wwwMinning www
Minning www
 
H017554148
H017554148H017554148
H017554148
 
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
 
320 324
320 324320 324
320 324
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 

Mehr von Daminda Herath

Mehr von Daminda Herath (8)

Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
XML
XMLXML
XML
 
Social Aspect of the Internet
Social Aspect of the InternetSocial Aspect of the Internet
Social Aspect of the Internet
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage mining
 
JavaScript Libraries
JavaScript LibrariesJavaScript Libraries
JavaScript Libraries
 
1. Overview of Distributed Systems
1. Overview of Distributed Systems1. Overview of Distributed Systems
1. Overview of Distributed Systems
 

Kürzlich hochgeladen

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Kürzlich hochgeladen (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 

Web mining

  • 2. Web Mining The World Wide Web (Web) is a popular and interactive medium to disseminate information today. The Web is huge, diverse, and dynamic and thus raises the scalability, multi-media data, and temporal issues respectively.
  • 3. Web Mining The following problems when interacting with Web a) Finding relevant information People either browse or use the search service when they want to find a specific information on the Web. When a user uses search service he or she usually inputs a simple keyword query and the query response is the list of pages ranked based on their similarity to the query.
  • 4. Web Mining The following problems when interacting with Web a) Finding relevant information Search tools have the following problems. 1. Low precision Which is due to the irrelevance of many of the search results. This results in a difficulty finding the relevant information. 2. Low recall Which is due to the inability to index all the information available on the web. This results in a difficulty finding the unindexed information that is relevant.
  • 5. Web Mining The following problems when interacting with Web b) Creating new knowledge out of the information available on the Web. This problem is data-triggered process that presumes that we already have a collection of web data and we want to extract potentially useful knowledge of of it(data mining oriented).
  • 6. Web Mining The following problems when interacting with Web c) Personalization Of the Information. This problem is often associated with the type and presentation of information. People differ in the contents and presentations they prefer while interacting with the Web.
  • 7. Web Mining The following problems when interacting with Web d) Learning about consumers or individual Users. This is a problem that specifically deals with above problem which is about knowing what the customers do and want. Inside this problem, there are sub-problems such as mass customizing the information to the intended consumers or even to personalize it to individual user, problem related to effective Web site design and management, problem related to marketing....
  • 8. Web Mining Web mining techniques could be used to solve the information overload problem. Other techniques and works from different research areas, Database (DB) Information Retrieval (IR) Natural Language Processing (NLP) Web Document Community
  • 9. Web Mining Web Mining is the use of data mining techniques to automatically discover and extract information from Web documents and services. Decomposing Web mining into subtasks 1. Resource Finding: the task of retrieving intended Web document. 2. Information selection and pre-processing: automatically selecting and pre-processing specific information from retrieved Web resources. 3. Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites. 4. Analysis: validation and/or interpretation of the mined patterns.
  • 10. Web Mining Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data. It implicitly covers the standard process of knowledge discovery in database (KDD). We could simply view Web mining as an extension of KDD that is applied on the Web data. From the KDD point of view, the information and knowledge terms are interchangeable . There is a close relationship between data mining, machine learning, and advanced analysis.
  • 11. Web Mining Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.
  • 12. Web Mining Web Mining and Information Retrieval Information Retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non- relevant as possible. IR has primary goals of indexing text and searching for useful documents in a collection. Research in IR – modeling, document classification and categorization, user interfaces, data visualization, filtering... The task that can be considered to be an instance of Web mining is Web document classification or categorization, which could be used for indexing. Web mining is part of the (Web) IR process. We should note that not all of the indexing tasks use data mining techniques
  • 13. Web Mining Web Mining and Information Extraction Information Extraction (IE) has a goal of transforming a collection of documents, usually with the help of an IR system, into information that is more readily digested and analyzed. IE aims to extract relevant facts from the documents while IR aims to select relevant documents. While IE is interested in the structure or representation of a document. IR views the text in a document just as a bag of unordered words. In general IE works as a finer granularity level than IR does on the document.
  • 14. Web Mining Web Mining and Information Extraction Building IE systems manually is not feasible and scalable for such a dynamic and diverse medium such as Web contents. Due to the nature of the Web, most IE systems focus on specific Web sites to extract. Others use machine learning or data mining techniques to learn the extraction patterns or rules for Web documents semi-automatically or automatically. Web mining is part of the (Web) IE process. Other views regarding the relationship between (Web) IE and Web mining also exist.
  • 15. Web Mining Web Mining and Information Extraction The results of the IE process could be in the form of a structured database or could be a compression or summary of the original text or documents. One could view for the former that IE is a kind of pre- processing stage in the web mining process, which is the step after the IR process and before the data mining techniques are being performed. IE can also be used to improve the indexing process, which is part of the IR process. Web mining is used to improve Web IE (Web mining is part of IE).
  • 16. Web Mining Web Mining and Information Extraction There are basically two types of IE: Unstructured texts (Classical or traditional IE tasks) Semi-structured data (Structural IE tasks) Classical or traditional IE tasks ● IE tasks from unstructured natural language texts (Classical or traditional IE tasks) typically use a rather basic to a slightly deeper linguistic preprocessing before data mining. ● With roots on the NLP community, has been studied for quite a long time ● MUCs and TIPSTER are competitive environments that seek to improve IE and IR technologies.
  • 17. Web Mining Web Mining and Information Extraction Classical or traditional IE tasks ● Usually relies on linguistic preprocessing such as syntactic analysis, semantic analysis, and disclosure analysis. ● Could be called a core language technology.
  • 18. Web Mining Web Mining and Information Extraction Structural IE tasks ● With the increasing popularity of the web, there is a need for structural IE systems that extract information from semistructured documents. ● Structural IE research is different from the classical one as it usually utilizes meta information. ● HTML tags, simple syntactics, delimiters that are available inside the semi-structured data. ● Structural IE approaches that do not use linguistic constraints are termed wrapper induction. ● Some of the Structural IE systems are built manually by knowledge engineering approach.
  • 19. Web Mining Web Mining and Information Extraction Structural IE tasks ● More structural IE systems for web are built (semi) automatically using machine learning techniques or other algorithms as building the system manually in no longer appropriate. ● These systems are usually built by using machine learning or data mining techniques, which learn extraction rules from the annotated corpora.
  • 20. Web Mining Web Mining and Machine Learning Applied on the Web Web mining is not the same as learning from the Web or machine learning techniques applied on the Web. There are some applications of machine learning applied on the Web that is not instances of Web mining. An example of this is a machine learning technique that is used to spider the Web efficiently for a specific topic that emphasize on planning the best path that is going to be traversed next. There are some other methods used for Web mining besides machine learning methods. Examples are some proprietary algorithms that is used for mining the hubs and authorities, DataGuides and Web schema discovery.
  • 21. Web Mining Web Mining and Machine Learning Applied on the Web Machine learning techniques support and help Web mining as they could be applied to the processes in Web mining. Applying machine learning techniques could improve the text classification process compared to the traditional IR techniques. Web mining intersects with the application of machine learning on the Web.
  • 22. Web Mining Web Mining Categories Web mining can be categorized into ● Web content mining ● Web structure mining ● Web usage mining
  • 23. Web Mining Web Mining Categories Web content mining Web content mining describes the discovery of useful information from the Web contents/data/documents. However, what consist of the web contents could encompass a very broad range of data. ● Previously the Internet consists of different type of services such as FTP, Gopher and Usenet Now most of those data ported or accessible from the web. ● The growth in the amount of government information ● Existence of digital libraries that also accessible on the Web. ● Many companies are transforming their businesses and services electronically. ● Many applications and systems are being migrated to the Web ● Many types of applications are emerging in the Web environment
  • 24. Web Mining Web Mining Categories Web content mining Some of the Web content data are hidden data, which can not be indexed. These data are either generated dynamically as a result of queries and reside in the DBMS or are private. Web content consists of several type of data. ● Textual, image, audio, video, metadata as well as hyper- links ● Multi types of data (Multimedia data mining)
  • 25. Web Mining Web Mining Categories Web content mining The Web content data consist of unstructured data such as free texts, semi-structured data such as HTML documents, and more structured data such as data in the tables and databases generated HTML pages. However, much of the Web content data are unstructured text data. Applying data mining techniques to unstructured text is termed knowledge discovery in text (KDT) or text data mining or text mining.
  • 26. Web Mining Web Mining Categories Web content mining Web content mining from 2 different points of view: IR and DB ● The goal of Web content mining from the IR view is mainly to assist or to improve the information finding and filtering the information to the users usually based on either inferred or solicited user profiles. ● The goal of Web content mining from the DB view mainly tries to model the data on the web and integrate them so that more sophisticated queries other than the keywords based search could be performed.
  • 27. Web Mining Web Mining Categories Web structure mining Web structure mining tries to discover the model underlying the link structures of the Web. The model is based on the topology of the hyper-links with or without the description of the links. This model can be used to categorize Web pages and is useful to generate information such as the similarity and relationship between Web sites. Web structure mining could be used to discover authority sites for the subjects (authorities) and overview sites for the subjects that point to many authorities(hubs).
  • 28. Web Mining Web Mining Categories Web usage mining Web usage mining tries to make sense of data generated by the Web surfer's sessions or behaviors. The Web usage data includes the data from web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as results of interaction.
  • 29. Web Mining Web Mining Categories Web content and structure mining utilize the real or primary data on the Web. Web usage mining mines the secondary data derived from the interactions of the users while interacting with the Web.
  • 30.
  • 31.
  • 32. Web Mining Web Mining and the Agent Paradigm Web mining is often viewed from or implemented within an agent paradigm. Web mining has a close relationship with software agents or intelligent agents. There 3 sub-categories of Software agents ● User Interface ● Distributed ● Mobile User interface and distributed agents are relevant to web mining tasks.
  • 33. Web Mining Web Mining and the Agent Paradigm User Interface Agents User Interface agents try to maximize the productivity of current users interaction with the system adapting behaviors. User Interface agents that can be classified into the Web mining agent category are ● Informational retrieval agent ● Information filtering agent ● Personal assistant agent
  • 34. Web Mining Web Mining and the Agent Paradigm Distributed Agents Distribute agent technology is concerned with problem solving by a group of agents and relevant agents in this category are distributed agents for knowledge discovery or data mining. There are two frequently used approaches for developing intelligent agents that help users find and retrieve relevant information on the Web ● Content-based ● Collaborative
  • 35. Web Mining Web Mining and the Agent Paradigm Distributed Agents Content-based approach The system searches for items that match based on an analysis of the content using the user preference. Collaborative approach The system tries to find users with similar interests to give recommendations to. We could categorize the content-based methods as Web content mining and categorize the collaborative approaches as Web usage mining.
  • 36. Web Mining Web Mining and the Agent Paradigm A similar view related that classifies the user interface agents by underlying technology into: Content-based filters (content mining) Reputation filters (structure and content mining) Collaborative or social based filters (usage mining) Event-based filters (usage mining) Hybrid filters (combination of categories)
  • 37. Web Mining WEB CONTENT DATA STRUCTURE Web content consists of several types of data Text, image, audio, video, hyperlinks. Unstructured – free text Semi-structured – HTML More structured – Data in the tables or database generated HTML pages Note: much of the Web content data is unstructured text data.
  • 38. Web Mining WEB CONTENT MINING Information Retrieval View for Unstructured Documents ● Bag of words to represent unstructured documents Takes single word as feature Ignores the sequence in which words occur ● Features could be Boolean Word either occurs or does not occur in a document Frequency based Frequency of the word in a document ● Variations of the feature selection include Removing the case, punctuation, infrequent words and stop words. ● Features can be reduced using different feature selection techniques: Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.
  • 39. Web Mining WEB CONTENT MINING Information Retrieval View for Unstructured Documents Other feature representations ● Word position in document ● Using phrases ● Using hypernyms The use of text compression techniques is rather new for text classification tasks. The applications range from text classification or categorization, event detection and tracking, finding extraction patterns or rules, to finding some interesting patterns in the text documents.
  • 40. Web Mining WEB CONTENT MINING Information Retrieval View for Unstructured Documents Event detection and tracking problems are sub topics of a broader initiative called Topic Detection and Tracking (TDT). ● TDT – is an initiative to investigate the state of the art in finding and following new events in a stream of news stories broadcast. ● Text Mining – has been used to describe different applications such as text categorization, text clustering, empirical computational linguistic tasks, exploratory data analysis, finding patterns in text databases, finding sequential pattern in text and association discovery.
  • 41. Web Mining WEB CONTENT MINING Information Retrieval View for Unstructured Documents in term of Knowledge Discovery in Text (KDT) or Text mining to structure the text documents by means of information extraction, text categorization, or applying NLP techniques as pre-processing step before performing any kind of KDT s. ● Currently term of Text Mining – has been used to describe different applications such as text categorization, text clustering, empirical computational linguistic tasks, exploratory data analysis, finding patterns in text databases, finding sequential pattern in text and association discovery.
  • 42. Web Mining WEB CONTENT MINING Database View The database techniques on the Web are related to the problems of managing and querying the information on the Web. There are 3 classes of tasks Modeling and querying the Web Information extraction and integration Web site construction and restructuring DB view tries to infer the structure of a Web site or transform a Web site to become a database Better information management Better querying on the Web
  • 43. Web Mining WEB CONTENT MINING Database View The DB view of Web content mining mainly tries to model the data on the Web and integrate them so that more sophisticated queries other than key word baseds search could be performed. Can be achieved by: Finding the schema of Web documents Building a Web warehouse Building a Web knowledge base Building a virtual database
  • 44. Web Mining WEB CONTENT MINING Database View DB view mainly uses the Object Exchange Model (OEM) ● Represents semi-structured data by a labeled graph ● The data in the OEM is viewed as a graph, with objects as the vertices and labels on the edges ● Each object is identified by an object identifier [oid] ● Value is either atomic(integer, string, gif, html.....) or complex Process typically starts with manual selection of Web sites for doing Web content mining instead of searching the whole Internet for the specific resources. Main application: The task of finding frequent substructures in semi-structured data. The task of creating multi-layered database (MLDB) in which each layer is obtained by generalization on lower layers and use a special purpose query language for Web mining to extract some knowledge form MLDB of web document.
  • 45. Web Mining WEB CONTENT MINING Database View DataGuide is a kind of structural summary o semi-structured data
  • 46. Web Mining Web Structure Mining Interested in the structure of the hyperlinks within the Web Inspired by the study of social networks and citation analysis ● Can discover specific types of pages(such as hubs, authorities, etc.) based on the incoming and outgoing links. Application: Discovering micro-communities in the Web , measuring the “completeness” of a Web site
  • 47. Web Mining Web Usage Mining Tries to predict user behavior from interaction with the Web Wide range of data (logs) ● Web client data ● Proxy server data ● Web server data The Web usage data could also be represented with graps Two common approaches ● Maps the usage data of Web server into relational tables before an adapted data mining techniques. ● Uses the log data directly by utilizing special pre-processing techniques
  • 48. Web Mining Web Usage Mining Typical problems: ● Distinguishing among unique users, server sessions, episodes, etc. in the presence of caching and proxy servers ● Often Usage Mining uses some background or domain knowledge E.g. Navigation templates, site topology, Web content, Site topology, concept hierarchy and syntactics constraints , etc.
  • 49. Web Mining Web Usage Mining Applications: Two main categories: ● Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically ● Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site