SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Downloaden Sie, um offline zu lesen
MINING CLIENT SIDE PARADATA FOR
      ADAPTIVE WEBPAGES
                       By
           Rami Shawkat Hatem Al-Salman


                     Advisor
              Dr.Natheer Khasawneh


                    Co-Advisor
              Dr. Ahmad Al-Hammouri
Page  1
Contents


 Introduction.
 Server logs data.
 Clients data.
 Framework for collecting and mining client side data.
 Three case studies.
 Results and Discussions.

 Conclusions.

 Future Work.




Page  2
Introduction


 In the recent years a large number of websites is published.


 Current web applications aim to interact with users through rich and
  dynamic contents.


 In the recent years JavaScript has developed to be more interactive not
  only with a client side but also with the server side, Thus, Asynchronous
  JavaScript and XML (AJAX) is introduced.


 Web personalization is applied by several websites.




Page  3
Web personalization


 Web personalization concerns to support the user’s specific environment
  related to their needs and domain.


 Many websites use recommender system for supporting a web
  personalization.


 Webpage's are personalized based on clients preferences (i.e., interests,
  country, gender etc…).




Page  4
AMAZON & Web personalization


 AMAZON uses recommender system relay on collaborative filtering
  technique for producing personal recommendations.


 Personal (client) recommendations are generated by computing similarity
  between client preference and others.


 Collaborative filtering technique consists of three steps:
        Record the preferences of a group of clients.
        Choose group of clients whose preferences are similar to the target client
         using a similarity metric .
        Recommend options (i.e., products) to the target client .



Page  5
AMAZON as a real example




                                        Recommendations based
           Recommendations based        on preferences of people
             on browsing history           with similar profile


Page  6
AMAZON as a real example




                   Recommendations based
                    on most recent viewed
                            items
Page  7
Server logs data


 server log is a log file that contains     Entry name   Server Log Info

  vectors of data which are recorded by
  web server.                                IP-Address   178.77.146.157



                                             date         [03/Jan/2011:15:20:06 -0800]
 The analysis for server logs can help to
  understanding client’s behavior (i.e.,     request      "GET/default.ASPX HTTP/1.0"
  the most and least traffic).
                                             status       200


                                             bytes        8788


                                             referrer     http://www.just.edu.jo


                                             agent        "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"




Page  8
Apache server access.log




Page  9
Clients data


 Clients data is a data which is recorded        Entry name      Client Info

  based on the client navigation to the           Element name    DIV1
  visited Webpage elements.
 Clients data could record the                   Element value   Yes

  interactions between clients and the
                                                  Spent time      156.77 seconds
  elements in the visited Webpage.
                                                  IP-Address      178.77.146.157
             For example: record the name,
              value and spent time for specific   date            [03/Jan/2011:15:20:06 -0800]
              Webpage element.                    request         "GET/default.ASPX HTTP/1.0"

                                                  status          200

                                                  bytes           8788

                                                  referrer        http://www.just.edu.jo

                                                  agent           "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"




Page  10
Clients data example




Page  11
Problem statement


 Most previous studies are investigated by working on server logs data.


 The previous studies used Web Usage Mining (WUM) techniques for
  extracting the knowledge from this data.


 Some tools and systems are proposed for tracking clients data.


 The previous studies which related to clients data have not shown the
  usefulness of clients data.


 Unfortunately , until now there is no complete framework which could
  record and mine in the clients logs data.
Page  12
Motivations


 Some entries can be extracted from the client’s mouse movements over
  the visited Webpage.


 Extracting useful knowledge from clients data, will help to understanding
  clients’ behaviors and attitudes in better way.


 Support clients with appropriate recommendations.


 The understanding of clients behaviors and needs, will improve the
  advertisements for products in WWW.




Page  13
Contributions


 Until now there is no complete framework which could record and mine in
  the clients data.
 Thus, the main contribution of this thesis is to building a complete
  framework that can recode client’s events and apply the WUM techniques
  on this data .
             We mainly show the usefulness of the client’s data.
• We customize the client’s data and then we apply WUM techniques on it.
• We build three different web applications and then we integrate our
  framework with their.
• We build a recommendation engine which is able to discovering the
  client’s patterns .
• We extract the useful information from the client’s data.
             We generate client’s data model based on client’s data statistics.
Page  14
Framework for collecting and mining client side data


 We propose a framework to record and mine client’s side data.
 Our framework consists of five phases respectively:
             Session identification


             Events identification and catching.


             Events storing.


             Merging and exporting events.


             Web mining.

Page  15
Framework for collecting and mining client side data




Page  16
Session identification


 Once a client requests a webpage, the session id is assigned for him.


 The session id presents the number of milliseconds since midnight Jan 1,
  1970, by this way the assigned session id for each client is a unique.


 The generated session id is used to identify all recorded events which
  belong to the same user.


 The session for the client can be finished by a target button or link.




Page  17
Events identification and recording


 We identify web elements and associated events.


 The clients data is transferred associated with session id via
  XmlHttpRequest AJAX call.


 Based on AJAX, the transferring data is a lightweight operation (Clients
  never feel while data is transferred to server ).


 Seven values are recorded: name, value, Item time, session id, Date,
  Total mouse's clicks and Personalized.


 Personalized, represents the web element that finishes the session.
Page  18
Cont, Events identification and recording


 Our events are classified into two categories:
       Clickstream-based.
       Time based.


 In the clickstream-based category, the name and value of clicked element
  will be transferred.


 In the time-based category, the name, the value and the spent time of web
  element will be transferred.




Page  19
Snapshot of clickstream-based data (Events storing)




Page  20
Snapshot of time-based data (Events storing)




Page  21
Merging and Exporting data


     The records are grouped per client session (session id).
     Our merging algorithm works as follow:
      1. Load a list of session id’s
      2. For each session id:
            i.   If the data is clickstream-based then accumulate the sequence of
                 clicks.
            ii. If the data is time-based then accumulate the spent time over each
                element.


     The merged data is exported to another Database table.
     The output this phase will be the input for the web mining phase.



Page  22
Snapshot of merging data in clickstream-based




Page  23
Snapshot of merging data in time-based




Page  24
Web Mining


 As in every data mining task, the process of Web Usage Mining consists
  of three steps:
      • Data preprocessing.
      • Pattern discovery and web mining.
      • Information and Pattern analysis.




Page  25
Data preprocessing


 Preprocessing or data cleaning process is aiming to remove irrelevant
  data and keeps the consistent data.


 The preprocessing is fulfilled based on thresholds.


 We mainly use two thresholds:
            – The total session time.
            – The total number of visited elements.




Page  26
Pattern discovery and web mining




Page  27
Information and Pattern analysis


 Most of times, the analysis of the generated patterns and information
  allows us to understand clients behavior deeply.


 The output of this step can be formulated in many forms.


 One of the most important forms is a generated model which is usually
  extracted from the statistics (i.e., frequencies.).




Page  28
Three case studies


    To validate the proposed framework we have integrated the framework
     with three different web applications.
    The three web applications are:
            1. Web based editor controls (TinyMCE).
            2. E-commerece web application.
            3. E-survey web application.
    The three web applications are hosted online.




Page  29
TinyMCE


 TinyMCE is a platform independent web based Javascript HTML editor
  control.
 We modified TinyMCE source code to integrate the proposed framework
  with it.
 The events of TinyMCE belong to general data (or clickstream-based
  data).
 We applied data mining to cluster and discover the client’s sequence
  patterns.
 Finally we classify the clustered output.




Page  30
Snapshot of TinyMCE




Page  31
Data Collection


 As a source of data 60 students from JUST in CPE 411 and CPE 311
  classes are asked to use our system.


 We asked the students to write an advertisement using TinyMCE about
  JUST to encourage students from Europe Union (EU) countries to study in
  JUST.
 The click events are recorded.


 The events are merged in a general data mode.


 The merged data will be the input for the data preprocessing step.


Page  32
Snapshot of merged data




Page  33
Data Preprocessing


    The collected data was preprocessed by removing invalid sequences .


    The invalid sequences were determined based on two thresholds:
            1. The number of clicked controls.
            2. Total session time which is spent in the sequence .
    Heuristically we used 10 clicks as a first threshold and 200 seconds as a
     second threshold.


    The data preprocessing step reduces the total number of sequences to
     be 36 sequences (24 sequences are removed).




Page  34
Clustering


 We separated student’s sequences into clusters with similar clickstream
  sequences.
 We applied K-means clustering technique using heuristics numbers
  clusters equal to two, three, and four.
 We used edit distance as distance measure to calculating the similarity or
  dissimilarity between any two objects closing to the mean point.
 The main goal of clustering is to label students sequences.




                                            The points represent the student’s
                                                        sequences

Page  35
Pattern discovery


 The clustered sequences are used as an input to the pattern discovery
  algorithm.
 We applied Generalize Sequence Pattern (GSP) to extract the patterns
  from each cluster.
 GSP not only discovers the patterns sequences but also preserve the
  order of these patterns.
 The output of GSP is a top ten patterns for a cluster.
 Theses patterns will be assigned later in classification step.




Page  36
Classification


 The output data of clustering step was used as an input to classification
  models.


 Total session time, number of controls and the clickstream sequence are
  used as three features for our classification models.


 The classification models are trained based on these features and data.


 We use two classifiers, Naive Bayes and Support Vector Machines.


 After training phase, our classifiers were able to classify the new clients to
  one of two or three or four classes.
Page  37
E-commerce system


 In the second case study, E-commerce web application is built from
  scratch.
 We integrate our framework with it.
 Our E-commerce system offers two categories of products, Camera’s and
  Mobiles.
 The main goal of this web application is to proof, that the classification for
  similar clients can be easily and directly done.
 Each product has seven features.




Page  38
Snapshot of E-commerce system for Mobile’s




Page  39
Snapshot of E-commerce system for Camera’s




Page  40
Data Collection


 As a source of data we depend on three sources:
      • Students from JUST University.
      • Students from Heinrich-Heine University of Duesseldorf (Germany).
      • Social network websites (Facebook, Myspace, etc.).
 We record the events.
 The events are merged in a time-based mode.
 Based on the time-based mode, the times which are spent over any cell
  within specific user session, they are aggregated.
 Based on our database statistics, 58 clients bought cameras and 54
  clients bought mobiles.




Page  41
Snapshot of merged data in time-based mode




Page  42
Data Preprocessing


 The total session time and the number of visited features are used as two
  thresholds.
 Based on our experiments, we set total session time to be 20 and number
  of visited features to be 7.
 Based on these thresholds:
            – For Cameras data, 40 clients transactions are pruned, and the remaining
              clients transactions were 18.
            – For Mobiles data, 35 clients transactions are pruned, and the remaining
              clients transactions were 20.




Page  43
Classification


 In the time-based data mode, classification models can be directly
  applied on preprocessed data .
 Each client transaction is labeled by a buy product button (i.e., client
  who bought a camera #1).
 Aggregated times which are spent over 28 features (4 products * 7
  features), are used as main features.
 Our classification models are trained by preprocessed time-based
  data.
 We use three classifiers Naive Bayes, Support Vector Machines and
  Decision Tree (C4.5 algorithm).




Page  44
E-survey


 In the third case study, E-survey web application is built from scratch.
 We integrate our framework with it.
 E-survey is a simple web application which allows students to assessing
  lecturers by both multiple and assay questions.
 The main goal of E-survey is to understand student’s attitude and
  behavior.
 E-survey Webpage consists of twelve questions (eleven multiple
  questions and one assay question).
 Each multiple choice question, consists of four options (Can not dot it at
  all, weak, good and very good).




Page  45
Snapshot of E-Survey




Page  46
Data Collection


 As a source of data we depend on three sources:
      • Students from Yarmook-Accouncting class.
      • Students from Jadara-Computer skills class.
      • Students from Philadelphia-Design class.
 We record the events.
 The events are merged in the time-based mode.
 Based on the time-based mode, the times which are spent over any
  question within specific user session, they are aggregated.
 Based on our database statistics, 101 students assessed their lecturers.
            – 37 students from Yarmook University, 38 students from Philadelphia
              University and 26 students from Jadara University.



Page  47
Data Preprocessing


 The total session time and the number of visited questions are used as
  two thresholds.
 Based on our experiments, we set total session time to be 25 and number
  of visited questions to be 12.
 Based on these thresholds 11 students transactions are discarded from
  student Database.
            – The remaining transactions are 90.




Page  48
Snapshot of preprocessed data




Page  49
Classification


 The aggregated times which are spent over 12 questions are used as
  main 12 features.
 In E-Survey, the recorded transactions are not labeled directly.
 Labeling is done by a flag question.
 Our classification models are trained by preprocessed time-based data.
 We use three classifiers Naive Bayes, Support Vector Machines and
  Decision Tree (C4.5 algorithm).




Page  50
The student’s data model (exponential)

                                                                        Questions-Freq

                              450

                              400

                              350
        Number of Questions




                              300

                              250
                                                                                                               Questions-Freq
                              200

                              150

                              100

                               50

                                0
                                    1    4   7   10   13 16 19 22 25   28 31 34 37 40    43 46 49 52 55   58
                                                                  Time in seconds


Page  51
Evaluation


 For evaluation purpose, we use three well known measures which always
  used in information retrieval topic, 1. Precision, 2. Recall, 3.F-measure.


 The False Positive (FP) and False Negative (FN) measures are used for
  evaluating the errors in classification models.
 For testing purposes, the classifiers are testing in two modes :
            – Training dataset method.
            – 5 folds cross-validation method.
 Training dataset method uses dataset for both training and testing.
 5 folds cross-validation method divides dataset into subsets, one of them
  used for testing and the remaining subsets for training.


Page  52
5 folds cross-validation method



                                              Green color as training
                                                     subsets




                                               Red color as testing
                                                     subset




Page  53
Results-TinyMCE



  1
0.9
0.8
0.7
0.6                                                                                                            Precision
0.5                                                                                                            Recall
0.4                                                                                                            F-Measure
0.3
0.2
0.1
  0
       NB 2 clusters   DT 2 clusters     NB 3 clusters    DT 3 clusters    NB 4 clusters    DT 4 clusters




                                The Precision, Recall and F-Measure values for NB and DT in 2, 3, 4 clusters using
                                                             5-folds cross-validation.

 Page  54
Results-TinyMCE


0.6

0.5

0.4
                                                                                                                   FN
0.3
                                                                                                                   FP
0.2

0.1

 0
       NB 2 clusters   DT 2 clusters    NB 3 clusters     DT 3 clusters     NB 4 clusters     DT 4 clusters




                              False Positive and True Positive values for NB and DT in 2, 3, 4 clusters using 5-
                                                           folds cross-validation.

Page  55
Results E-Survey



   1
 0.9
 0.8
 0.7
 0.6                                                                            Precision
 0.5                                                                            Recall
 0.4                                                                            F-Measure
 0.3
 0.2
 0.1
   0
             DT      Naïve bayes     SVM   DT-5-V   Naïve bayes-5-V   SVM-5-V




            Using training dataset         Using 5-folds cross-validation



Page  56
Results E-Survey



0.7

0.6

0.5

0.4                                                                                FN
0.3                                                                                FP

0.2

0.1

 0
             DT       Naïve bayes     SVM     DT-5-V   Naïve bayes-5-V   SVM-5-V




             Using training dataset         Using 5-folds cross-validation



 Page  57
Conclusions


 Clients data is very useful.
 Clients data has a flexibility to be mined.
 Clients data could has multiple forms.
 Clustering should be used for labeling unlabeled clients transactions.
 Classification is very practical in clients data.
 Our complete framework will help to improve clients experiences.
 Our classification models show the ability to classify with high accuracy
  rate.




Page  58
Future Work


 We are looking forward to deal with more clients data such as: x,y axis’s.


 We are looking for developing new clustering and classification
  techniques which can deal efficiently with client’s data.


 We will extract more knowledge of clients data.




Page  59
Thank You

Page  60
Results for E-commerce camera’s


               1
             0.9
             0.8
             0.7
             0.6                                            Precision
             0.5                                            Recall
             0.4                                            F-Measure
             0.3
             0.2
             0.1
               0
                      DT      Naïve bayes       SVM




            0.45
             0.4
            0.35
             0.3
            0.25                                                     FN
             0.2                                                     FP
            0.15
             0.1
            0.05
              0
                      DT          Naïve bayes         SVM


Page  61
Snapshot of the generated tree from decision tree model for
                        camera’s category




Page  62
Results for E-commerce mobile’s


              1
            0.9
            0.8
            0.7
            0.6                                              Precision
            0.5                                              Recall
            0.4                                              F-Measure
            0.3
            0.2
            0.1
              0
                       DT      Naïve bayes       SVM




            0.35
             0.3
            0.25
             0.2                                                         FN
            0.15                                                         FP
             0.1
            0.05
              0
                        DT         Naïve bayes         SVM



Page  63
Snap shot of the generated tree from decision tree model for
                        mobiles category




Page  64
Web applications links


 http://web-engineering.orgfree.com/
 http://easyshoping.orgfree.com/
 http://questions.orgfree.com/




Page  65
Machine learning Algorithms


 Naïve Bayes is a probabilistic model based on Bayesian theorem .




                  p r ( F | C ) p r (C )
    Pr (C | F ) 
                         pr ( F )




Page  66
Machine learning Algorithms


 C4.5 is a supervised machine learning algorithm which it is developed
  originally from ID3 algorithm .
 C4.5 generates decision trees from a set of training data based on an
  information entropy concept.




Page  67
Machine learning Algorithms


   SVM is a supervised machine learning
   algorithm. The main idea is to find a
   separator line which called hyperplane.

   Hyperplane separates the n- dimensional
   data completely into its two (or more)
   classes.




Page  68

Weitere ähnliche Inhalte

Was ist angesagt?

Webinar: MongoDB for Content Management
Webinar: MongoDB for Content ManagementWebinar: MongoDB for Content Management
Webinar: MongoDB for Content ManagementMongoDB
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couchdelagoya
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
Content Mangement Systems and MongoDB
Content Mangement Systems and MongoDBContent Mangement Systems and MongoDB
Content Mangement Systems and MongoDBMitch Pirtle
 
Webinar: MongoDB for Content Management
Webinar: MongoDB for Content ManagementWebinar: MongoDB for Content Management
Webinar: MongoDB for Content ManagementMongoDB
 
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...MongoDB
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Tuning for Performance: indexes & Queries
Tuning for Performance: indexes & QueriesTuning for Performance: indexes & Queries
Tuning for Performance: indexes & QueriesKeshav Murthy
 
User Data Management with MongoDB
User Data Management with MongoDB User Data Management with MongoDB
User Data Management with MongoDB MongoDB
 
Data persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbData persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbDimgba Kalu
 
Approaches to mobile site development
Approaches to mobile site developmentApproaches to mobile site development
Approaches to mobile site developmentErik Mitchell
 
Key note big data analytics ecosystem strategy
Key note   big data analytics ecosystem strategyKey note   big data analytics ecosystem strategy
Key note big data analytics ecosystem strategyIBM Sverige
 
Integrating Your Site With Internet Explorer 8
Integrating Your Site With Internet Explorer 8Integrating Your Site With Internet Explorer 8
Integrating Your Site With Internet Explorer 8goodfriday
 
CData Data Today: A Developer's Dilemma
CData Data Today: A Developer's DilemmaCData Data Today: A Developer's Dilemma
CData Data Today: A Developer's DilemmaJerod Johnson
 
Open analytics | Cameron Sim
Open analytics | Cameron SimOpen analytics | Cameron Sim
Open analytics | Cameron SimOpen Analytics
 
Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...
Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...
Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...Mahmoud Hamed Mahmoud
 

Was ist angesagt? (19)

Webinar: MongoDB for Content Management
Webinar: MongoDB for Content ManagementWebinar: MongoDB for Content Management
Webinar: MongoDB for Content Management
 
Spsl v unit - final
Spsl v unit - finalSpsl v unit - final
Spsl v unit - final
 
CouchDB : More Couch
CouchDB : More CouchCouchDB : More Couch
CouchDB : More Couch
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
Content Mangement Systems and MongoDB
Content Mangement Systems and MongoDBContent Mangement Systems and MongoDB
Content Mangement Systems and MongoDB
 
Webinar: MongoDB for Content Management
Webinar: MongoDB for Content ManagementWebinar: MongoDB for Content Management
Webinar: MongoDB for Content Management
 
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Tuning for Performance: indexes & Queries
Tuning for Performance: indexes & QueriesTuning for Performance: indexes & Queries
Tuning for Performance: indexes & Queries
 
User Data Management with MongoDB
User Data Management with MongoDB User Data Management with MongoDB
User Data Management with MongoDB
 
Data persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdbData persistence using pouchdb and couchdb
Data persistence using pouchdb and couchdb
 
Approaches to mobile site development
Approaches to mobile site developmentApproaches to mobile site development
Approaches to mobile site development
 
Caching in asp.net
Caching in asp.netCaching in asp.net
Caching in asp.net
 
Key note big data analytics ecosystem strategy
Key note   big data analytics ecosystem strategyKey note   big data analytics ecosystem strategy
Key note big data analytics ecosystem strategy
 
Integrating Your Site With Internet Explorer 8
Integrating Your Site With Internet Explorer 8Integrating Your Site With Internet Explorer 8
Integrating Your Site With Internet Explorer 8
 
CData Data Today: A Developer's Dilemma
CData Data Today: A Developer's DilemmaCData Data Today: A Developer's Dilemma
CData Data Today: A Developer's Dilemma
 
Asp.net
Asp.netAsp.net
Asp.net
 
Open analytics | Cameron Sim
Open analytics | Cameron SimOpen analytics | Cameron Sim
Open analytics | Cameron Sim
 
Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...
Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...
Exam 70-489 Developing Microsoft SharePoint Server 2013 Advanced Solutions Le...
 

Andere mochten auch

المدونات (Web log)
المدونات (Web log)المدونات (Web log)
المدونات (Web log)tech5101
 
Hallgrímur pétursson
Hallgrímur pétursson Hallgrímur pétursson
Hallgrímur pétursson odinnthor
 
Austur evropa
Austur evropaAustur evropa
Austur evropaodinnthor
 
Infromatika
InfromatikaInfromatika
Infromatikaainhoa05
 
Ordenagailuaren desmuntaketa
Ordenagailuaren desmuntaketaOrdenagailuaren desmuntaketa
Ordenagailuaren desmuntaketaainhoa05
 
المدونات (Web log)
المدونات (Web log)المدونات (Web log)
المدونات (Web log)tech5101
 
Austur evropa
Austur evropaAustur evropa
Austur evropaodinnthor
 
الشفافيات التعليمية
الشفافيات التعليميةالشفافيات التعليمية
الشفافيات التعليميةtech5101
 
PIC microcontroller
PIC microcontroller PIC microcontroller
PIC microcontroller Rami Alsalman
 
المدونات (Web log)
المدونات (Web log)المدونات (Web log)
المدونات (Web log)tech5101
 
Ordenagailuko 5 zati garrantsitsu
Ordenagailuko 5 zati garrantsitsuOrdenagailuko 5 zati garrantsitsu
Ordenagailuko 5 zati garrantsitsuainhoa05
 

Andere mochten auch (17)

المدونات (Web log)
المدونات (Web log)المدونات (Web log)
المدونات (Web log)
 
Hallgrímur pétursson
Hallgrímur pétursson Hallgrímur pétursson
Hallgrímur pétursson
 
Austur evropa
Austur evropaAustur evropa
Austur evropa
 
Broadband strategy
Broadband strategyBroadband strategy
Broadband strategy
 
Infromatika
InfromatikaInfromatika
Infromatika
 
Ordenagailuaren desmuntaketa
Ordenagailuaren desmuntaketaOrdenagailuaren desmuntaketa
Ordenagailuaren desmuntaketa
 
المدونات (Web log)
المدونات (Web log)المدونات (Web log)
المدونات (Web log)
 
Austur evropa
Austur evropaAustur evropa
Austur evropa
 
Grid computing
Grid computingGrid computing
Grid computing
 
CSR
CSRCSR
CSR
 
Hekla
HeklaHekla
Hekla
 
Nasdse
NasdseNasdse
Nasdse
 
الشفافيات التعليمية
الشفافيات التعليميةالشفافيات التعليمية
الشفافيات التعليمية
 
PIC microcontroller
PIC microcontroller PIC microcontroller
PIC microcontroller
 
المدونات (Web log)
المدونات (Web log)المدونات (Web log)
المدونات (Web log)
 
Ordenagailuko 5 zati garrantsitsu
Ordenagailuko 5 zati garrantsitsuOrdenagailuko 5 zati garrantsitsu
Ordenagailuko 5 zati garrantsitsu
 
Airplanes
AirplanesAirplanes
Airplanes
 

Ähnlich wie Web Mining

Clickstream Analysis
Clickstream AnalysisClickstream Analysis
Clickstream Analysisintuitiv.de
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining Editor IJMTER
 
Web Database
Web DatabaseWeb Database
Web Databaseidroos7
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioINFOGAIN PUBLICATION
 
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningIJMER
 
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...IJSRD
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Editor IJCATR
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage MiningDaminda Herath
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage miningDaminda Herath
 
Detective Controls: Gain Visibility and Record Change
Detective Controls: Gain Visibility and Record ChangeDetective Controls: Gain Visibility and Record Change
Detective Controls: Gain Visibility and Record ChangeAmazon Web Services
 

Ähnlich wie Web Mining (20)

C017231726
C017231726C017231726
C017231726
 
Clickstream Analysis
Clickstream AnalysisClickstream Analysis
Clickstream Analysis
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
 
Dos1
Dos1Dos1
Dos1
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Research Paper
Research PaperResearch Paper
Research Paper
 
Web Database
Web DatabaseWeb Database
Web Database
 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
 
E017413647
E017413647E017413647
E017413647
 
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
 
L017418893
L017418893L017418893
L017418893
 
Web servers
Web serversWeb servers
Web servers
 
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...
 
Nadee2018
Nadee2018Nadee2018
Nadee2018
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage mining
 
Detective Controls: Gain Visibility and Record Change
Detective Controls: Gain Visibility and Record ChangeDetective Controls: Gain Visibility and Record Change
Detective Controls: Gain Visibility and Record Change
 

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Web Mining

  • 1. MINING CLIENT SIDE PARADATA FOR ADAPTIVE WEBPAGES By Rami Shawkat Hatem Al-Salman Advisor Dr.Natheer Khasawneh Co-Advisor Dr. Ahmad Al-Hammouri Page  1
  • 2. Contents  Introduction.  Server logs data.  Clients data.  Framework for collecting and mining client side data.  Three case studies.  Results and Discussions.  Conclusions.  Future Work. Page  2
  • 3. Introduction  In the recent years a large number of websites is published.  Current web applications aim to interact with users through rich and dynamic contents.  In the recent years JavaScript has developed to be more interactive not only with a client side but also with the server side, Thus, Asynchronous JavaScript and XML (AJAX) is introduced.  Web personalization is applied by several websites. Page  3
  • 4. Web personalization  Web personalization concerns to support the user’s specific environment related to their needs and domain.  Many websites use recommender system for supporting a web personalization.  Webpage's are personalized based on clients preferences (i.e., interests, country, gender etc…). Page  4
  • 5. AMAZON & Web personalization  AMAZON uses recommender system relay on collaborative filtering technique for producing personal recommendations.  Personal (client) recommendations are generated by computing similarity between client preference and others.  Collaborative filtering technique consists of three steps:  Record the preferences of a group of clients.  Choose group of clients whose preferences are similar to the target client using a similarity metric .  Recommend options (i.e., products) to the target client . Page  5
  • 6. AMAZON as a real example Recommendations based Recommendations based on preferences of people on browsing history with similar profile Page  6
  • 7. AMAZON as a real example Recommendations based on most recent viewed items Page  7
  • 8. Server logs data  server log is a log file that contains Entry name Server Log Info vectors of data which are recorded by web server. IP-Address 178.77.146.157 date [03/Jan/2011:15:20:06 -0800]  The analysis for server logs can help to understanding client’s behavior (i.e., request "GET/default.ASPX HTTP/1.0" the most and least traffic). status 200 bytes 8788 referrer http://www.just.edu.jo agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)" Page  8
  • 10. Clients data  Clients data is a data which is recorded Entry name Client Info based on the client navigation to the Element name DIV1 visited Webpage elements.  Clients data could record the Element value Yes interactions between clients and the Spent time 156.77 seconds elements in the visited Webpage. IP-Address 178.77.146.157  For example: record the name, value and spent time for specific date [03/Jan/2011:15:20:06 -0800] Webpage element. request "GET/default.ASPX HTTP/1.0" status 200 bytes 8788 referrer http://www.just.edu.jo agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)" Page  10
  • 12. Problem statement  Most previous studies are investigated by working on server logs data.  The previous studies used Web Usage Mining (WUM) techniques for extracting the knowledge from this data.  Some tools and systems are proposed for tracking clients data.  The previous studies which related to clients data have not shown the usefulness of clients data.  Unfortunately , until now there is no complete framework which could record and mine in the clients logs data. Page  12
  • 13. Motivations  Some entries can be extracted from the client’s mouse movements over the visited Webpage.  Extracting useful knowledge from clients data, will help to understanding clients’ behaviors and attitudes in better way.  Support clients with appropriate recommendations.  The understanding of clients behaviors and needs, will improve the advertisements for products in WWW. Page  13
  • 14. Contributions  Until now there is no complete framework which could record and mine in the clients data.  Thus, the main contribution of this thesis is to building a complete framework that can recode client’s events and apply the WUM techniques on this data .  We mainly show the usefulness of the client’s data. • We customize the client’s data and then we apply WUM techniques on it. • We build three different web applications and then we integrate our framework with their. • We build a recommendation engine which is able to discovering the client’s patterns . • We extract the useful information from the client’s data.  We generate client’s data model based on client’s data statistics. Page  14
  • 15. Framework for collecting and mining client side data  We propose a framework to record and mine client’s side data.  Our framework consists of five phases respectively:  Session identification  Events identification and catching.  Events storing.  Merging and exporting events.  Web mining. Page  15
  • 16. Framework for collecting and mining client side data Page  16
  • 17. Session identification  Once a client requests a webpage, the session id is assigned for him.  The session id presents the number of milliseconds since midnight Jan 1, 1970, by this way the assigned session id for each client is a unique.  The generated session id is used to identify all recorded events which belong to the same user.  The session for the client can be finished by a target button or link. Page  17
  • 18. Events identification and recording  We identify web elements and associated events.  The clients data is transferred associated with session id via XmlHttpRequest AJAX call.  Based on AJAX, the transferring data is a lightweight operation (Clients never feel while data is transferred to server ).  Seven values are recorded: name, value, Item time, session id, Date, Total mouse's clicks and Personalized.  Personalized, represents the web element that finishes the session. Page  18
  • 19. Cont, Events identification and recording  Our events are classified into two categories:  Clickstream-based.  Time based.  In the clickstream-based category, the name and value of clicked element will be transferred.  In the time-based category, the name, the value and the spent time of web element will be transferred. Page  19
  • 20. Snapshot of clickstream-based data (Events storing) Page  20
  • 21. Snapshot of time-based data (Events storing) Page  21
  • 22. Merging and Exporting data  The records are grouped per client session (session id).  Our merging algorithm works as follow: 1. Load a list of session id’s 2. For each session id: i. If the data is clickstream-based then accumulate the sequence of clicks. ii. If the data is time-based then accumulate the spent time over each element.  The merged data is exported to another Database table.  The output this phase will be the input for the web mining phase. Page  22
  • 23. Snapshot of merging data in clickstream-based Page  23
  • 24. Snapshot of merging data in time-based Page  24
  • 25. Web Mining  As in every data mining task, the process of Web Usage Mining consists of three steps: • Data preprocessing. • Pattern discovery and web mining. • Information and Pattern analysis. Page  25
  • 26. Data preprocessing  Preprocessing or data cleaning process is aiming to remove irrelevant data and keeps the consistent data.  The preprocessing is fulfilled based on thresholds.  We mainly use two thresholds: – The total session time. – The total number of visited elements. Page  26
  • 27. Pattern discovery and web mining Page  27
  • 28. Information and Pattern analysis  Most of times, the analysis of the generated patterns and information allows us to understand clients behavior deeply.  The output of this step can be formulated in many forms.  One of the most important forms is a generated model which is usually extracted from the statistics (i.e., frequencies.). Page  28
  • 29. Three case studies  To validate the proposed framework we have integrated the framework with three different web applications.  The three web applications are: 1. Web based editor controls (TinyMCE). 2. E-commerece web application. 3. E-survey web application.  The three web applications are hosted online. Page  29
  • 30. TinyMCE  TinyMCE is a platform independent web based Javascript HTML editor control.  We modified TinyMCE source code to integrate the proposed framework with it.  The events of TinyMCE belong to general data (or clickstream-based data).  We applied data mining to cluster and discover the client’s sequence patterns.  Finally we classify the clustered output. Page  30
  • 32. Data Collection  As a source of data 60 students from JUST in CPE 411 and CPE 311 classes are asked to use our system.  We asked the students to write an advertisement using TinyMCE about JUST to encourage students from Europe Union (EU) countries to study in JUST.  The click events are recorded.  The events are merged in a general data mode.  The merged data will be the input for the data preprocessing step. Page  32
  • 33. Snapshot of merged data Page  33
  • 34. Data Preprocessing  The collected data was preprocessed by removing invalid sequences .  The invalid sequences were determined based on two thresholds: 1. The number of clicked controls. 2. Total session time which is spent in the sequence .  Heuristically we used 10 clicks as a first threshold and 200 seconds as a second threshold.  The data preprocessing step reduces the total number of sequences to be 36 sequences (24 sequences are removed). Page  34
  • 35. Clustering  We separated student’s sequences into clusters with similar clickstream sequences.  We applied K-means clustering technique using heuristics numbers clusters equal to two, three, and four.  We used edit distance as distance measure to calculating the similarity or dissimilarity between any two objects closing to the mean point.  The main goal of clustering is to label students sequences. The points represent the student’s sequences Page  35
  • 36. Pattern discovery  The clustered sequences are used as an input to the pattern discovery algorithm.  We applied Generalize Sequence Pattern (GSP) to extract the patterns from each cluster.  GSP not only discovers the patterns sequences but also preserve the order of these patterns.  The output of GSP is a top ten patterns for a cluster.  Theses patterns will be assigned later in classification step. Page  36
  • 37. Classification  The output data of clustering step was used as an input to classification models.  Total session time, number of controls and the clickstream sequence are used as three features for our classification models.  The classification models are trained based on these features and data.  We use two classifiers, Naive Bayes and Support Vector Machines.  After training phase, our classifiers were able to classify the new clients to one of two or three or four classes. Page  37
  • 38. E-commerce system  In the second case study, E-commerce web application is built from scratch.  We integrate our framework with it.  Our E-commerce system offers two categories of products, Camera’s and Mobiles.  The main goal of this web application is to proof, that the classification for similar clients can be easily and directly done.  Each product has seven features. Page  38
  • 39. Snapshot of E-commerce system for Mobile’s Page  39
  • 40. Snapshot of E-commerce system for Camera’s Page  40
  • 41. Data Collection  As a source of data we depend on three sources: • Students from JUST University. • Students from Heinrich-Heine University of Duesseldorf (Germany). • Social network websites (Facebook, Myspace, etc.).  We record the events.  The events are merged in a time-based mode.  Based on the time-based mode, the times which are spent over any cell within specific user session, they are aggregated.  Based on our database statistics, 58 clients bought cameras and 54 clients bought mobiles. Page  41
  • 42. Snapshot of merged data in time-based mode Page  42
  • 43. Data Preprocessing  The total session time and the number of visited features are used as two thresholds.  Based on our experiments, we set total session time to be 20 and number of visited features to be 7.  Based on these thresholds: – For Cameras data, 40 clients transactions are pruned, and the remaining clients transactions were 18. – For Mobiles data, 35 clients transactions are pruned, and the remaining clients transactions were 20. Page  43
  • 44. Classification  In the time-based data mode, classification models can be directly applied on preprocessed data .  Each client transaction is labeled by a buy product button (i.e., client who bought a camera #1).  Aggregated times which are spent over 28 features (4 products * 7 features), are used as main features.  Our classification models are trained by preprocessed time-based data.  We use three classifiers Naive Bayes, Support Vector Machines and Decision Tree (C4.5 algorithm). Page  44
  • 45. E-survey  In the third case study, E-survey web application is built from scratch.  We integrate our framework with it.  E-survey is a simple web application which allows students to assessing lecturers by both multiple and assay questions.  The main goal of E-survey is to understand student’s attitude and behavior.  E-survey Webpage consists of twelve questions (eleven multiple questions and one assay question).  Each multiple choice question, consists of four options (Can not dot it at all, weak, good and very good). Page  45
  • 47. Data Collection  As a source of data we depend on three sources: • Students from Yarmook-Accouncting class. • Students from Jadara-Computer skills class. • Students from Philadelphia-Design class.  We record the events.  The events are merged in the time-based mode.  Based on the time-based mode, the times which are spent over any question within specific user session, they are aggregated.  Based on our database statistics, 101 students assessed their lecturers. – 37 students from Yarmook University, 38 students from Philadelphia University and 26 students from Jadara University. Page  47
  • 48. Data Preprocessing  The total session time and the number of visited questions are used as two thresholds.  Based on our experiments, we set total session time to be 25 and number of visited questions to be 12.  Based on these thresholds 11 students transactions are discarded from student Database. – The remaining transactions are 90. Page  48
  • 49. Snapshot of preprocessed data Page  49
  • 50. Classification  The aggregated times which are spent over 12 questions are used as main 12 features.  In E-Survey, the recorded transactions are not labeled directly.  Labeling is done by a flag question.  Our classification models are trained by preprocessed time-based data.  We use three classifiers Naive Bayes, Support Vector Machines and Decision Tree (C4.5 algorithm). Page  50
  • 51. The student’s data model (exponential) Questions-Freq 450 400 350 Number of Questions 300 250 Questions-Freq 200 150 100 50 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 Time in seconds Page  51
  • 52. Evaluation  For evaluation purpose, we use three well known measures which always used in information retrieval topic, 1. Precision, 2. Recall, 3.F-measure.  The False Positive (FP) and False Negative (FN) measures are used for evaluating the errors in classification models.  For testing purposes, the classifiers are testing in two modes : – Training dataset method. – 5 folds cross-validation method.  Training dataset method uses dataset for both training and testing.  5 folds cross-validation method divides dataset into subsets, one of them used for testing and the remaining subsets for training. Page  52
  • 53. 5 folds cross-validation method Green color as training subsets Red color as testing subset Page  53
  • 54. Results-TinyMCE 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-Measure 0.3 0.2 0.1 0 NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters The Precision, Recall and F-Measure values for NB and DT in 2, 3, 4 clusters using 5-folds cross-validation. Page  54
  • 55. Results-TinyMCE 0.6 0.5 0.4 FN 0.3 FP 0.2 0.1 0 NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters False Positive and True Positive values for NB and DT in 2, 3, 4 clusters using 5- folds cross-validation. Page  55
  • 56. Results E-Survey 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-Measure 0.3 0.2 0.1 0 DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V Using training dataset Using 5-folds cross-validation Page  56
  • 57. Results E-Survey 0.7 0.6 0.5 0.4 FN 0.3 FP 0.2 0.1 0 DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V Using training dataset Using 5-folds cross-validation Page  57
  • 58. Conclusions  Clients data is very useful.  Clients data has a flexibility to be mined.  Clients data could has multiple forms.  Clustering should be used for labeling unlabeled clients transactions.  Classification is very practical in clients data.  Our complete framework will help to improve clients experiences.  Our classification models show the ability to classify with high accuracy rate. Page  58
  • 59. Future Work  We are looking forward to deal with more clients data such as: x,y axis’s.  We are looking for developing new clustering and classification techniques which can deal efficiently with client’s data.  We will extract more knowledge of clients data. Page  59
  • 61. Results for E-commerce camera’s 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-Measure 0.3 0.2 0.1 0 DT Naïve bayes SVM 0.45 0.4 0.35 0.3 0.25 FN 0.2 FP 0.15 0.1 0.05 0 DT Naïve bayes SVM Page  61
  • 62. Snapshot of the generated tree from decision tree model for camera’s category Page  62
  • 63. Results for E-commerce mobile’s 1 0.9 0.8 0.7 0.6 Precision 0.5 Recall 0.4 F-Measure 0.3 0.2 0.1 0 DT Naïve bayes SVM 0.35 0.3 0.25 0.2 FN 0.15 FP 0.1 0.05 0 DT Naïve bayes SVM Page  63
  • 64. Snap shot of the generated tree from decision tree model for mobiles category Page  64
  • 65. Web applications links  http://web-engineering.orgfree.com/  http://easyshoping.orgfree.com/  http://questions.orgfree.com/ Page  65
  • 66. Machine learning Algorithms  Naïve Bayes is a probabilistic model based on Bayesian theorem . p r ( F | C ) p r (C ) Pr (C | F )  pr ( F ) Page  66
  • 67. Machine learning Algorithms  C4.5 is a supervised machine learning algorithm which it is developed originally from ID3 algorithm .  C4.5 generates decision trees from a set of training data based on an information entropy concept. Page  67
  • 68. Machine learning Algorithms SVM is a supervised machine learning algorithm. The main idea is to find a separator line which called hyperplane. Hyperplane separates the n- dimensional data completely into its two (or more) classes. Page  68