Web Mining

Ain Shams University
College of Science
Dpt. Of Mathematics/Computer Science
Web Miming
Prepared By:
Ziyad Hazim Abid Al Jabbar

Content
Seq Subject Page
1 Data Mining and Web Mining Definitions 2
2 Introduction and Motivations 3
3 Web Mining categories 12
4 - Web Content Mining 13
5 - Web Usage Mining 14
6 - Web Structure Mining 15
7 Web Data Representation (Matrix Expression) 16
8 - Document-Keyword Co-occurrence Matrix 17
9 - Adjacent Matrix 19
10 - Usage Matrix 21
11 Similarity Functions 23
12 - Pearson correlation coefficient Function 24
13 - Cosine-Based Similarity 25
14 References 26

Data Mining
Web Mining
Data mining: is the process (Techniques,
Algorithms) of extracting information or
knowledge from a data set for the
purposes of decision making.
Web mining: is the process of
applying data mining techniques to
the pattern discovery in Web data.
2

Ziyad Hazim
Introduction and
motivations
3

Ziyad Hazim
4
(2011)
We simply get more memory
and keep it all

Ziyad Hazim
The rapid development that occurred at the beginning of the
twenty-first century in the information technology specially
in the field of web technologies.
 The expansion in the use of the internet, specially with the
official birth of e-government (2001) led to rapid
development of the e-management, e-learning, e-commerce
and e- health.
5

 The content of websites became related and covered all
daily citizen's activities like:
 Community services
 E-commerce
 E-education (E learning)
 Scientific research
 Strategically planning decision for companies and institutions.
 The widespread of social networks ((facebook:February 2004),
(Twitter, July 2006)) led to huge increasing in the use of
websites.
Ziyad Hazim
6

 Ubiquitous electronics record our decisions, our choices
in the supermarket, our financial habits, our comings
and goings. every swipe is a record in a database.
 The World Wide Web (WWW) overwhelms us with
information; meanwhile, every choice we make is
recorded. And all of these are just personal choices, and
they have countless counterparts in the world of
commerce and industry.
Ziyad Hazim
7

A Single View to the Customer
Customer
Social
Media
Gaming
Entertain
TV
Animation
Banking
Finance
Our
Known
History
Purchase
E learning
8

The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
9

 With more than two billions pages created
by millions of Web pages, the World Wide Web
is extremely rich knowledge base and a vast
resource of multiple types of information
in varied formats.
 We could all testify to the growing gap between the
generation of web data and our understanding of it. As
the volume of web data increases, inexorably, the
proportion of it that people understand decreases
alarmingly.
Ziyad Hazim
10

11
 Experts following the activities of President of the United States of
America "President Obama", reported that on December 17th, 2013,
he held a meeting with leaders of the information technology
companies:
in order to discuss the issue of data mining.
 And once more he reiterated the matter in his speech
on January 17th, 2014 by calling for reforms in data mining system.
 That confirms how much importance is being given to data mining
globally and internationally.
Ziyad Hazim
Apple, Microsoft, Google, Yahoo, Facebook, Twitter, LinkedIn,
Salesforce, Netflix, Etsy, Dropbox, Zynga, Sherpa Global, Comcast

Web Mining
categories
Web Content
Mining
(WCM)
Web Usage
Mining
(WUM)
Web
Structure
Mining
(WSM)
12

1- Web Content Mining:
Is a process of extracting useful information from the
content of web document, that may consist of:
- Text.
- Images.
- Audio & Video.
- Structure record.
- List.
- Table.
Web Content Mining involve techniques for:
- Summarization.
- Classification.
- Clustering.
Wen
Content
13

2- Web Usage Mining:
Is a process of identifying browsing patterns by analyzing the
user navigation behavior to analyze patterns like:
- How are people using a site.
- Which pages are accessed most frequently.
- Frequency of sites per document.
- Most resent sites per document.
- How frequently each hyperlink is clicked.
- Who is visiting which document from which location.
- Most recent use of each hyperlink.
14

3- Web Structure Mining (link Mining):
Is a process of extracting patterns from hyperlinks in the web.
It generates structural summary about website and webpage by
analyzing the links.
15

Web Data Representation
Matrix expression
● The basic units for web Mining are Web page set and user session
collection.
● A page set: is a collection of whole pages within a site.
● User session: is a set of sequence of Web pages clicked by a single
user during a specific period.
Matrix expression:
Has been widely used to model the co-occurrence activity
like Web data.
16

1- Document-Keyword Co-occurrence Matrix:
● In the web content mining, the relationships between a set of
documents (pages) and a set of keyword could be represented
by a Document-Keyword Co-occurrence Matrix.
● where the rows of the matrix represent the documents.
● while the columns of the matrix correspond to the keywords.
Documents (Pages)
Keywords
17
Keywords

1- Document-Keyword Co-occurrence Matrix:,
● If a keyword appears in a document, the corresponding matrix
element value is 1, otherwise 0.
● The element value could also be a precise weight rather than 1 or 0
only. Which exactly reflects the occurrence degree of two concerned
objects of document and keyword.
● For example: the element value could represent the frequent rate of a
specific keyword in a specific document.
18

2- Adjacent Matrix: ,
● The relationships between pages via their hyperlinks, that represent
the linkage information of a Web site, could be represented by an
Adjacent Matrix.
● The intersection value (aij)of the matrix indicates the hyperlink
linking of two pages.
● If there is a hyperlink from page i to page j (i ≠j), then the value of the
element (aij) is 1, otherwise 0.
Page hyperlink
Page hyperlink
19

2- Adjacent Matrix: (Continue)
,
● The linking relationship is directional.
● A hyperlink directed from page i to page j, then the link is an out-link
for i, while an in-link for j, and vice versa.
● The ith row of the adjacent matrix, which is a page vector, represents
the out-link relationships from page i to other pages.
● The jth column of the matrix represents the in-link relationships
linked to page j from other pages.
20

3- Usage Matrix: ,
● In Web usage mining, a user session could be modeled as a page
vector, i.e. user session is a collection of pages visited by the user in
the period along with their significant weights (verity degree of visits
on different web pages).
● The total collection of user sessions can, then, be expressed a usage
matrix.
users
Web Pages
21

3- Usage Matrix:(Continue)
,
● The ith row is the sequence of pages visited by user i during period
of time.
● The jth column of the matrix represents the fact which users have
clicked this page j in the server log file.
● The element value of the matrix, ai j, reflects the access interest
exhibited by user i on page j, which could be used to derive the
underlying access pattern of users.
22

Similarity Functions:,
● The two well-known and widely used similarity functions in
information retrieval and recommender systems are:
- Pearson correlation coefficient.
- cosine similarity.
23

1- Pearson correlation coefficient Function,
● Pearson correlation coefficient used to calculate the deviations
of users’ ratings on various items from their mean ratings on
the rated items.
● The attribute weight is expressed by a feature vector of
numeric ratings on various items, e.g. the rating can be from
1 to 5 where 1 stands for the lest like voting and 5 for the
most preferable one.
● Given two users i and j, and their rating vectors Ri and Rj the
Pearson correlation coefficient is then defined by:
24

2- Cosine-Based Similarity:
● Since in a vector expression form, any vector could be
considered as a line in a multiple-dimensional space, it
is intuitive to define the similarity (or distance) between
two vectors as the cosine function of angle between
two “lines”.
● The cosine coefficient can be calculated by the ratio of
the dot product of two vectors with respect to their
vector norms. Given two vectors A and B, the cosine
similarity is then defined as:
25

References:,
● Lan, H., Eibe Frank, and M. A. Hall. "Data mining:
Practical machine learning tools and techniques."
(2011).
● Xu, Guandong, Yanchun Zhang, and Lin Li. "Web
mining and social networking: techniques and
applications”. (2011).
26

Web Mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Web Mining

Similar to Web Mining (20)

Recently uploaded

Recently uploaded (17)

Web Mining