Web Mining

MINING CLIENT SIDE PARADATA FOR
ADAPTIVE WEBPAGES
By
Rami Shawkat Hatem Al-Salman

Advisor
Dr.Natheer Khasawneh

Co-Advisor
Dr. Ahmad Al-Hammouri
Page  1

Contents

 Introduction.
 Server logs data.
 Clients data.
 Framework for collecting and mining client side data.
 Three case studies.
 Results and Discussions.

 Conclusions.

 Future Work.

Page  2

Introduction

 In the recent years a large number of websites is published.

 Current web applications aim to interact with users through rich and
dynamic contents.

 In the recent years JavaScript has developed to be more interactive not
only with a client side but also with the server side, Thus, Asynchronous
JavaScript and XML (AJAX) is introduced.

 Web personalization is applied by several websites.

Page  3

Web personalization

 Web personalization concerns to support the user’s specific environment
related to their needs and domain.

 Many websites use recommender system for supporting a web
personalization.

 Webpage's are personalized based on clients preferences (i.e., interests,
country, gender etc…).

Page  4

AMAZON & Web personalization

 AMAZON uses recommender system relay on collaborative filtering
technique for producing personal recommendations.

 Personal (client) recommendations are generated by computing similarity
between client preference and others.

 Collaborative filtering technique consists of three steps:
 Record the preferences of a group of clients.
 Choose group of clients whose preferences are similar to the target client
using a similarity metric .
 Recommend options (i.e., products) to the target client .

Page  5

AMAZON as a real example

Recommendations based
Recommendations based on preferences of people
on browsing history with similar profile

Page  6

AMAZON as a real example

Recommendations based
on most recent viewed
items
Page  7

Server logs data

 server log is a log file that contains Entry name Server Log Info

vectors of data which are recorded by
web server. IP-Address 178.77.146.157

date [03/Jan/2011:15:20:06 -0800]
 The analysis for server logs can help to
understanding client’s behavior (i.e., request "GET/default.ASPX HTTP/1.0"
the most and least traffic).
status 200

bytes 8788

referrer http://www.just.edu.jo

agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"

Page  8

Apache server access.log

Page  9

Clients data

 Clients data is a data which is recorded Entry name Client Info

based on the client navigation to the Element name DIV1
visited Webpage elements.
 Clients data could record the Element value Yes

interactions between clients and the
Spent time 156.77 seconds
elements in the visited Webpage.
IP-Address 178.77.146.157
 For example: record the name,
value and spent time for specific date [03/Jan/2011:15:20:06 -0800]
Webpage element. request "GET/default.ASPX HTTP/1.0"

status 200

bytes 8788

referrer http://www.just.edu.jo

agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"

Page  10

Clients data example

Page  11

Problem statement

 Most previous studies are investigated by working on server logs data.

 The previous studies used Web Usage Mining (WUM) techniques for
extracting the knowledge from this data.

 Some tools and systems are proposed for tracking clients data.

 The previous studies which related to clients data have not shown the
usefulness of clients data.

 Unfortunately , until now there is no complete framework which could
record and mine in the clients logs data.
Page  12

Motivations

 Some entries can be extracted from the client’s mouse movements over
the visited Webpage.

 Extracting useful knowledge from clients data, will help to understanding
clients’ behaviors and attitudes in better way.

 Support clients with appropriate recommendations.

 The understanding of clients behaviors and needs, will improve the
advertisements for products in WWW.

Page  13

Contributions

 Until now there is no complete framework which could record and mine in
the clients data.
 Thus, the main contribution of this thesis is to building a complete
framework that can recode client’s events and apply the WUM techniques
on this data .
 We mainly show the usefulness of the client’s data.
• We customize the client’s data and then we apply WUM techniques on it.
• We build three different web applications and then we integrate our
framework with their.
• We build a recommendation engine which is able to discovering the
client’s patterns .
• We extract the useful information from the client’s data.
 We generate client’s data model based on client’s data statistics.
Page  14

Framework for collecting and mining client side data

 We propose a framework to record and mine client’s side data.
 Our framework consists of five phases respectively:
 Session identification

 Events identification and catching.

 Events storing.

 Merging and exporting events.

 Web mining.

Page  15

Framework for collecting and mining client side data

Page  16

Session identification

 Once a client requests a webpage, the session id is assigned for him.

 The session id presents the number of milliseconds since midnight Jan 1,
1970, by this way the assigned session id for each client is a unique.

 The generated session id is used to identify all recorded events which
belong to the same user.

 The session for the client can be finished by a target button or link.

Page  17

Events identification and recording

 We identify web elements and associated events.

 The clients data is transferred associated with session id via
XmlHttpRequest AJAX call.

 Based on AJAX, the transferring data is a lightweight operation (Clients
never feel while data is transferred to server ).

 Seven values are recorded: name, value, Item time, session id, Date,
Total mouse's clicks and Personalized.

 Personalized, represents the web element that finishes the session.
Page  18

Cont, Events identification and recording

 Our events are classified into two categories:
 Clickstream-based.
 Time based.

 In the clickstream-based category, the name and value of clicked element
will be transferred.

 In the time-based category, the name, the value and the spent time of web
element will be transferred.

Page  19

Snapshot of clickstream-based data (Events storing)

Page  20

Snapshot of time-based data (Events storing)

Page  21

Merging and Exporting data

 The records are grouped per client session (session id).
 Our merging algorithm works as follow:
1. Load a list of session id’s
2. For each session id:
i. If the data is clickstream-based then accumulate the sequence of
clicks.
ii. If the data is time-based then accumulate the spent time over each
element.

 The merged data is exported to another Database table.
 The output this phase will be the input for the web mining phase.

Page  22

Snapshot of merging data in clickstream-based

Page  23

Snapshot of merging data in time-based

Page  24

Web Mining

 As in every data mining task, the process of Web Usage Mining consists
of three steps:
• Data preprocessing.
• Pattern discovery and web mining.
• Information and Pattern analysis.

Page  25

Data preprocessing

 Preprocessing or data cleaning process is aiming to remove irrelevant
data and keeps the consistent data.

 The preprocessing is fulfilled based on thresholds.

 We mainly use two thresholds:
– The total session time.
– The total number of visited elements.

Page  26

Pattern discovery and web mining

Page  27

Information and Pattern analysis

 Most of times, the analysis of the generated patterns and information
allows us to understand clients behavior deeply.

 The output of this step can be formulated in many forms.

 One of the most important forms is a generated model which is usually
extracted from the statistics (i.e., frequencies.).

Page  28

Three case studies

 To validate the proposed framework we have integrated the framework
with three different web applications.
 The three web applications are:
1. Web based editor controls (TinyMCE).
2. E-commerece web application.
3. E-survey web application.
 The three web applications are hosted online.

Page  29

TinyMCE

 TinyMCE is a platform independent web based Javascript HTML editor
control.
 We modified TinyMCE source code to integrate the proposed framework
with it.
 The events of TinyMCE belong to general data (or clickstream-based
data).
 We applied data mining to cluster and discover the client’s sequence
patterns.
 Finally we classify the clustered output.

Page  30

Snapshot of TinyMCE

Page  31

Data Collection

 As a source of data 60 students from JUST in CPE 411 and CPE 311
classes are asked to use our system.

 We asked the students to write an advertisement using TinyMCE about
JUST to encourage students from Europe Union (EU) countries to study in
JUST.
 The click events are recorded.

 The events are merged in a general data mode.

 The merged data will be the input for the data preprocessing step.

Page  32

Snapshot of merged data

Page  33

Data Preprocessing

 The collected data was preprocessed by removing invalid sequences .

 The invalid sequences were determined based on two thresholds:
1. The number of clicked controls.
2. Total session time which is spent in the sequence .
 Heuristically we used 10 clicks as a first threshold and 200 seconds as a
second threshold.

 The data preprocessing step reduces the total number of sequences to
be 36 sequences (24 sequences are removed).

Page  34

Clustering

 We separated student’s sequences into clusters with similar clickstream
sequences.
 We applied K-means clustering technique using heuristics numbers
clusters equal to two, three, and four.
 We used edit distance as distance measure to calculating the similarity or
dissimilarity between any two objects closing to the mean point.
 The main goal of clustering is to label students sequences.

The points represent the student’s
sequences

Page  35

Pattern discovery

 The clustered sequences are used as an input to the pattern discovery
algorithm.
 We applied Generalize Sequence Pattern (GSP) to extract the patterns
from each cluster.
 GSP not only discovers the patterns sequences but also preserve the
order of these patterns.
 The output of GSP is a top ten patterns for a cluster.
 Theses patterns will be assigned later in classification step.

Page  36

Classification

 The output data of clustering step was used as an input to classification
models.

 Total session time, number of controls and the clickstream sequence are
used as three features for our classification models.

 The classification models are trained based on these features and data.

 We use two classifiers, Naive Bayes and Support Vector Machines.

 After training phase, our classifiers were able to classify the new clients to
one of two or three or four classes.
Page  37

E-commerce system

 In the second case study, E-commerce web application is built from
scratch.
 We integrate our framework with it.
 Our E-commerce system offers two categories of products, Camera’s and
Mobiles.
 The main goal of this web application is to proof, that the classification for
similar clients can be easily and directly done.
 Each product has seven features.

Page  38

Snapshot of E-commerce system for Mobile’s

Page  39

Snapshot of E-commerce system for Camera’s

Page  40

Data Collection

 As a source of data we depend on three sources:
• Students from JUST University.
• Students from Heinrich-Heine University of Duesseldorf (Germany).
• Social network websites (Facebook, Myspace, etc.).
 We record the events.
 The events are merged in a time-based mode.
 Based on the time-based mode, the times which are spent over any cell
within specific user session, they are aggregated.
 Based on our database statistics, 58 clients bought cameras and 54
clients bought mobiles.

Page  41

Snapshot of merged data in time-based mode

Page  42

Data Preprocessing

 The total session time and the number of visited features are used as two
thresholds.
 Based on our experiments, we set total session time to be 20 and number
of visited features to be 7.
 Based on these thresholds:
– For Cameras data, 40 clients transactions are pruned, and the remaining
clients transactions were 18.
– For Mobiles data, 35 clients transactions are pruned, and the remaining
clients transactions were 20.

Page  43

Classification

 In the time-based data mode, classification models can be directly
applied on preprocessed data .
 Each client transaction is labeled by a buy product button (i.e., client
who bought a camera #1).
 Aggregated times which are spent over 28 features (4 products * 7
features), are used as main features.
 Our classification models are trained by preprocessed time-based
data.
 We use three classifiers Naive Bayes, Support Vector Machines and
Decision Tree (C4.5 algorithm).

Page  44

E-survey

 In the third case study, E-survey web application is built from scratch.
 We integrate our framework with it.
 E-survey is a simple web application which allows students to assessing
lecturers by both multiple and assay questions.
 The main goal of E-survey is to understand student’s attitude and
behavior.
 E-survey Webpage consists of twelve questions (eleven multiple
questions and one assay question).
 Each multiple choice question, consists of four options (Can not dot it at
all, weak, good and very good).

Page  45

Snapshot of E-Survey

Page  46

Data Collection

 As a source of data we depend on three sources:
• Students from Yarmook-Accouncting class.
• Students from Jadara-Computer skills class.
• Students from Philadelphia-Design class.
 We record the events.
 The events are merged in the time-based mode.
 Based on the time-based mode, the times which are spent over any
question within specific user session, they are aggregated.
 Based on our database statistics, 101 students assessed their lecturers.
– 37 students from Yarmook University, 38 students from Philadelphia
University and 26 students from Jadara University.

Page  47

Data Preprocessing

 The total session time and the number of visited questions are used as
two thresholds.
 Based on our experiments, we set total session time to be 25 and number
of visited questions to be 12.
 Based on these thresholds 11 students transactions are discarded from
student Database.
– The remaining transactions are 90.

Page  48

Snapshot of preprocessed data

Page  49

Classification

 The aggregated times which are spent over 12 questions are used as
main 12 features.
 In E-Survey, the recorded transactions are not labeled directly.
 Labeling is done by a flag question.
 Our classification models are trained by preprocessed time-based data.
 We use three classifiers Naive Bayes, Support Vector Machines and
Decision Tree (C4.5 algorithm).

Page  50

The student’s data model (exponential)

Questions-Freq

450

400

350
Number of Questions

300

250
Questions-Freq
200

150

100

50

0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58
Time in seconds

Page  51

Evaluation

 For evaluation purpose, we use three well known measures which always
used in information retrieval topic, 1. Precision, 2. Recall, 3.F-measure.

 The False Positive (FP) and False Negative (FN) measures are used for
evaluating the errors in classification models.
 For testing purposes, the classifiers are testing in two modes :
– Training dataset method.
– 5 folds cross-validation method.
 Training dataset method uses dataset for both training and testing.
 5 folds cross-validation method divides dataset into subsets, one of them
used for testing and the remaining subsets for training.

Page  52

5 folds cross-validation method

Green color as training
subsets

Red color as testing
subset

Page  53

Results-TinyMCE

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters

The Precision, Recall and F-Measure values for NB and DT in 2, 3, 4 clusters using
5-folds cross-validation.

Page  54

Results-TinyMCE

0.6

0.5

0.4
FN
0.3
FP
0.2

0.1

0
NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters

False Positive and True Positive values for NB and DT in 2, 3, 4 clusters using 5-
folds cross-validation.

Page  55

Results E-Survey

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V

Using training dataset Using 5-folds cross-validation

Page  56

Results E-Survey

0.7

0.6

0.5

0.4 FN
0.3 FP

0.2

0.1

0
DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V

Using training dataset Using 5-folds cross-validation

Page  57

Conclusions

 Clients data is very useful.
 Clients data has a flexibility to be mined.
 Clients data could has multiple forms.
 Clustering should be used for labeling unlabeled clients transactions.
 Classification is very practical in clients data.
 Our complete framework will help to improve clients experiences.
 Our classification models show the ability to classify with high accuracy
rate.

Page  58

Future Work

 We are looking forward to deal with more clients data such as: x,y axis’s.

 We are looking for developing new clustering and classification
techniques which can deal efficiently with client’s data.

 We will extract more knowledge of clients data.

Page  59

Results for E-commerce camera’s

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
DT Naïve bayes SVM

0.45
0.4
0.35
0.3
0.25 FN
0.2 FP
0.15
0.1
0.05
0
DT Naïve bayes SVM

Page  61

Snapshot of the generated tree from decision tree model for
camera’s category

Page  62

Results for E-commerce mobile’s

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
DT Naïve bayes SVM

0.35
0.3
0.25
0.2 FN
0.15 FP
0.1
0.05
0
DT Naïve bayes SVM

Page  63

Snap shot of the generated tree from decision tree model for
mobiles category

Page  64

Web applications links

 http://web-engineering.orgfree.com/
 http://easyshoping.orgfree.com/
 http://questions.orgfree.com/

Page  65

Machine learning Algorithms

 Naïve Bayes is a probabilistic model based on Bayesian theorem .

p r ( F | C ) p r (C )
Pr (C | F ) 
pr ( F )

Page  66


 C4.5 is a supervised machine learning algorithm which it is developed
originally from ID3 algorithm .
 C4.5 generates decision trees from a set of training data based on an
information entropy concept.

Page  67


SVM is a supervised machine learning
algorithm. The main idea is to find a
separator line which called hyperplane.

Hyperplane separates the n- dimensional
data completely into its two (or more)
classes.

Page  68

Web Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Web Mining

Ähnlich wie Web Mining (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Web Mining