SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
1
Preprocessing on Web Log Data for Web Usage Mining
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Masoud Sefidian
2
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
3
Outline:
•Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
4
Introduction
• Web has been growing as a dominant platform for retrieving information and discovering
knowledge from web data.
• Web usage analysis or web usage mining or web log mining or click stream analysis:
• Process of extracting useful knowledge from web server logs, database logs, user queries,
client side cookies and user profiles in order to analyze web users’ behavior.
• Applies data mining techniques in log data to extract the behavior of users which is used in
various applications.
5
Outline:
• Introduction
•Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
6
Logs
TYPES OF WEB SERVER LOG FILES
• Access logs:
• It stores information about which files are requested from web server.
• Referrer logs:
• Stores information of the URLs of web pages on other sites that link to web pages.
• If a user gets to one of the server‘s pages by clicking on a link from another site, the URL of that site will appear in
this log.
• Agent logs:
• It records information about the web clients that sends requests to web server. Contain type of browser and the
platform determines what a user is able to access on a web site.
• Error logs:
• It stores information about errors and failed requests of the web server.
Types of Web log file formats
• Common Log Format (CLF)
• W3C extended log file format
• Microsoft IIS (Internet Information Services) log file format
• NCSA Common log file format
7
Sources of Log Data For Web Usage Mining
Server side:
• All the click streams are recorded into the web server log.
• Contain basic information e.g. name and IP of the remote
host, date and time of the request etc.
• The web server stores data regarding request performed
by the client.
Client side:
• The client itself which sends information to a repository
regarding the users‘ behavior.
• Done either with an ad-hoc browsing application or
through client side application running standard browsers.
Proxy side:
• Proxy level collection is an intermediary between server
level and client level.
• Proxy servers collect data of groups of users accessing
huge groups of web servers.
We consider only the case of a Web Server Log data.
8
Outline:
• Introduction
• Web Logs Files
•Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
9
Phases of Web Usage Mining:
Data Preprocessing:
• Transform the raw click stream data into a set of user
profiles.
• One of the most complex phase of the Web Usage Mining
process.
Pattern Discovery:
• Extracting information from preprocessed data.
• Data mining, statistics, machine learning and pattern
recognition are applied to web usage data to discover user
access patterns of the web.
Pattern Analysis:
• Extract the interesting patterns from the pattern discovery
process by eliminating the irrelative patterns.
• Involves :
• Validation: remove the irrelative patterns
• Interpretation : using visualization techniques to
interpret mathematic results for humans.
Our Focus
10
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
•Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Conclusion
• Main references
11
Steps of Data Preprocessing
12
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Conclusion
• Main references
13
Data Cleaning:
• Irrelevant or redundant log records will be removed.
• Clean accessorial resources embedded in HTML file, robots requests and error requests.
• Almost no researches focus purely on web log cleaning.
Attributes Involved in Web Log Cleaning and Intrusion Detection:
• Multimedia(images, videos and audio) Files:
• Categorized as useless files in web log preprocessing.
• Web log files size can be reduced to less than 50% of its original sizes by eliminating the image request.
• Web Robots Request :
• Dramatically affect the web sites traffic statistics.
• These are not important from the mining perspective and hence must be removed.
• HTTP Status Codes:
• Log files with unsuccessful HTTP status code are usually eliminated during the web log cleaning process.
• The widely acceptable definition for unsuccessful HTTP status codes is a code under 200 and over 299.
• For Intrusion Detection:
• [3] Removes all log files with status code 200.
• [2] Argued that log files with status 200 series should be remained as these log files may include
web attacks like SQL injection and XSS which have been executed successfully.
14
Attributes Involved in Web Log Cleaning
• HTTP Methods:
• A few researches have included HTTP method as an attribute in web log cleaning.
• In the LODAP Data Cleaning Module All log files with HTTP request method other than GET should be
removed as these are non-significant in web usage mining.
• For Intrusion Detection:
• Someone proposed:
• HTTP request with POST method should be kept.
• Another one proposed:
• Keep the log files with HTTP GET and HEAD request to obtain more accurate referrer information.
• Other Files:
• Log files with request to accessorial resources (e.g. CSS file) embedded in HTML file should be removed.
15
Algorithm Design of Newest Methods:
• This method used for Data Cleaning and Intrusion Detection from log files.
• Total of six cleaning conditions is applied:
First:
Logs with HTTP status code 200 will be removed (probability for web logs with such criteria to contain malicious web attacks
is almost zero).
Second:
• Web logs with multimedia file extensions will be removed if
• The HTTP request in the web log is not HTTP POST and
• The HTTP status code is not 400 series and 500 series.
• Web logs with status code 400 series and 500 series should be kept as these may consider as malicious attempt.
• Users who triggered many web logs with HTTP error status code are subject to suspect.
• In common case, to launch web defacement attack, attacker will use HTTP POST method to replace part or all of the web
interface components.
Third:
• Legitimate web robots requests like Googlebot will be removed.
• Specific IP address will be included in the web robot IP whitelist.
• If there are web logs with web robots request from whitelist IP addresses, the web log will be removed.
16
Algorithm Design of Newest Methods:
Fourth:
Remove web log with legitimate file extension(.css, .pdf, .txt and .doc) if :
The web logs contain no HTTP status codes with 400 series and 500 series and the HTTP method is not HTTP POST.
Fifth:
Web log with HTTP HEAD method(used in a web monitoring system) and legitimate IP will be removed.
A large number of HTTP HEAD requests may indicate malicious web robots activities.
Sixth:
Web log with HTTP POST method will be removed if the file posted are legitimate.
For instance, it is legitimate if there is web log with .svc file extension in uri-stem and with HTTP POST method.
.svc file is a special content file which represents the Windows Communication Foundation (WCF) service hosted in IIS.
17
Implementation
Web log format:
Internet Information Services (IIS) Log Format
Simulating attacks carried out by using three web vulnerability assessment tools:
Acunetix(run on Microsoft Windows)
Nikto and w3af(run on BackTrack GNOME)
An e-commerce site web server is configured to send web logs to the log collector
server via User Datagram Protocol (UDP).
Architectural Diagram for Simulation Attack and Web Log Collection
18
Comparison of existing frameworks:
[1]: Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K. 2011
[2]: Patil, P., Patil, U. 2012
[3]: Yew Chuan Ong and Zuraini Ismail 2014
[1], [2] considered only three files extensions(.jpg, .gif and .css).
Algorithm 3 defined a total of sixteen multimedia file extensions + four other files extension.
Comparisons Factor
Algorithms
[1] [2] [3]
Multimedia Files Yes Yes Yes
Web Robots Request No No Yes
HTTP Status Code
200,
400 series,
500 series
200
200,
400series,
500 series
HTTP Method No GET
GET, POST,
HEAD
Others Files Yes Yes Yes
Number of Rules and Conditions 2 1 6
19
Evaluation of existing frameworks:
Evaluate the cleaning capability:
• size of web log file in bytes.
• # of web log entries based on the total number of lines in the web log file.
Percentage of reduction = (total # of web log entries removed / total # of web log entries) × 100%
higher is the percentage of reduction => the better is the cleaning capability
Evaluate the Intrusion Detection Readiness:
False negative rate = Total number of malicious request removed / total number of malicious request
Lower false negative rate => better intrusion detection readiness
Measuring Factors
Algorithms
[1] [2] [3]
File Size Reduced (bytes) 6945603 32423581 18957149
Number of Entries Removed 52916 215616 153372
Percentage of Reduction (%) 13.94 56.81 40.41
False Negative Rate 0.00144 0.15789 0.00531
Algorithm 3 has the second highest percentage of reduction and second lowest false negative rate compared
to the other algorithms.
20
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
21
User Identification:
 Identify each distinct user.
 User identification is one way of introducing a state into web stateless system.
 A very complex task because of proxy servers and caches.
1 User identification by IP address:
“Each different IP address represents different user.”
Problems:
• Several users can be used the same IP address or computer (i.e. college, internet café etc.).
• One user can have different IP addresses, since a user accesses the Web from different machines will have different IP address.
2 User identification using User registration Data:
If users have login of their information, it is easy to identify them.
Username and password are also stored in the web log files.
Problems:
• But these facilities are not available in every website so that it is not appropriated for the general web browsing .
• There are lots of user do not register their information.
3 User identification using Cookies:
Cookies are HTTP headers in string format.
By using Cookies we can extract the details of users and resources which are accessed by the user.
Problems:
• Users can lock the use of cookies.
• Users can delete the cookies.
22
Two heuristics proposed that can be used to help identify unique users:
• (P. Pirolli,J. Pitkow, and R. Rao) and (K.R. Suneetha, Dr. R. Krihnamoorthi(2009)) proposed:
“Even if the IP address is the same, if the agent log shows a change in browser software or operating system so:
Each different agent type for an IP address represents a different user.”
User 1: A→B → E →K →I → O→E →L
User 2: A → C →G →M →H→N
23
Another heuristic(L. Chaofeng (2006) and V. Chitraa (2010) ):
Use the access log in conjunction with the referrer log and site topology to construct browsing paths for each user:
“If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user assumes that
there is another user with the same IP address”.
Following the referrer field along user 1’s path through the Web site.
Unexpectedly, there is no referrer shown for the page I.html request.
There is no direct link between K.html and I.html:
It appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I.
It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by
entering the URL directly into the browser using the same browser version and operating system:
User 1: A→ B →E →K → E →L , User 2: A → C →G → M →H→N , User 3: I → O
24
• P. Yeng, Y. Zheng(2010) dedicated only to user identification through inspired rules:
• Four constraints are used to identify users. These constraints are: IP address, agent
information, site topology and time information.
• Has low efficiency, but accuracy increased significantly
• “Renáta Iváncsy, and Sándor Juhász” analysis of different user identification methods at
“Analysis of Web User Identification Methods”
• Heuristics are not error-proof.
• Different heuristics must be selected depending on different situations and applications.
25
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
26
Session Identification(Sessionization, Session Reconstruction):
Session Definition:
• Group of activities performed by a user from the moment he entered the website to the moment he left
it.
• A set of user clicks usually referred to as a click stream, across Web servers is defined as a user session.
• A sequence of web pages user browse in a single access.
Session Identification :
• Grouping the different activities of a single user.
• The process of segmenting the access log of each user into individual access sessions.
Session identification Goal:
• Group the page access of each user into individual access sessions.
• Identifying which user has spent how much time on the website.
• Each heuristic h scans the user activity logs to which the web server log is partitioned.
Two general approaches:
• Time-oriented heuristic methods
• Navigation-oriented heuristic methods
27
Session Identification:
• Time-oriented heuristic methods:
A set of pages visited by a specific user is considered as a single user session if the pages are requested at a
time interval not larger than a specified time period.
First Heuristic:
Total session duration may not exceed a threshold 𝜃.
𝑡0: the timestamp for the first request in a constructed session S.
“The request with a timestamp t is assigned to S, iff t − 𝑡0 ≤ 𝜃” (Liu, 2007).
𝜃 = 30𝑚𝑖𝑛 has been recommended from empirical findings (Spiliopoulou, Mobasher, Berendt, & Nakagawa,
2003).
Second Heuristic:
For the page-stay-time-based method:
Total time spent on a page may not exceed a threshold 𝛿.
𝑡1: the timestamp for request assigned to constructed session 𝑆
Next request with timestamp 𝑡2 is assigned to S iff 𝑡2 − 𝑡1 ≤ 𝛿 Liu, 2007.
A conservative threshold for page-stay time is 𝛿 = 10𝑚𝑖𝑛 has been proposed to capture the time for loading
and studying the contents of a page (Spiliopoulou et al., 2003).
28
Session Identification:
• Navigation-oriented heuristic methods :
Web users reach pages by following hyperlinks rather than by typing URLs.
Topology-based heuristic:
“If a web page is not connected with previously visited page in a session, then it is considered
as a different session.”
Referrer-basic heuristic(Cooley et al. (1999) ) based on the referrer information :
• The referrer of a requested page P should be a page already in the session(previously
visited pages); otherwise P is assigned to a different session.
• If The page has an empty referrer, then it is likely to be the first page of a new session.
29
Session Identification:
• “Spiliopoulou” evaluates different heuristics in “A Framework for the Evaluation of Session
Reconstruction Heuristics in Web Usage Analysis”:
• Time based methods are not reliable because users may involve in some other activities after
opening the web page.
• Referrer-based heuristics are more restrictive than the topology-based heuristics, because there
are cases where a page request has an empty referrer.
• Different methods are used by different applications.
• Experiments showed that there is no best heuristic for all cases.
• Even for a simple application, two variations in the method of assessing reconstruction quality led
to significantly different precision scores among the heuristics
• G. Shivaprasad, N.V. Subba Reddy, U. Dinesh Acharya and Prakash K. Aithal (2016) proposed:
• A combined technique based on both the heuristics for Session Identification.
• Uses web topology and page stay time.
30
Session Identification(Time-oriented heuristic example):
Session 1 (user 1): A →B→E →K
Session 2 (user 2): A → C →G → M →H →N
Session 3 (user 3): I → O
Session 4 (user 1): E →L
For user 1, there is more than a 30-minute delay between the request for page K.html and
the second request for page E.html,so :
31
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
32
Path Completion :
• Critical phase in the preprocessing.
• The number of URLs recorded in log maybe less than the real one:
• Some important page requests are not recorded in server log due to proxy servers, browsers back
button is pressed and local caching.
• Definition:
• “The process of reconstructing the user’s navigation path, by appending missed page requests (page
requests that are not recorded in server log) in order to analyze the data in a proper way within the identified
sessions”.
• Used to obtain the complete user access path.
33
Path Completion :
Methods similar to those used for user identification can be used for path
completion..
Heuristic methods based on referrer log and site topology are employed.
Cooley, R., Mobasher, B., & Srivastava, J. (1999):
Missing pages are added as follows:
The page request is checked whether it is directly linked to the last page or not:
If there is no link with last page check the recent history.
If the log record is available in recent history then it is clear that “back” button
is used for caching until the page has been reached.
34
Path Completion :
Considering session 2 :
Session 2 (user 2): A → C → G →M →H→N
There is no direct link between page M.html and page H.html. Therefore, the user is presumed to have hit the
“Back” button on the browser twice.
The path completion process leads us to insert “→G →C ” into the session path for session 2:
Session 2 (user 2): A → C →G →M →G →C →H →N
35
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
Main references
36
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Conclusion
• Main references
37
Main References:
1. Ong, Y. C., & Ismail, Z. (2014). Enhanced Web Log Cleaning Algorithm for Web Intrusion Detection. In Recent
Advances in Information and Communication Technology (pp. 315-324). Springer International Publishing.
2. Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K.: Web Server Logs Preprocessing for Web Intrusion
Detection. Computer and Information Science 4, 123–133 (2011)
3. Patil, P., Patil, U.: Preprocessing of web server log file for web mining. World Journal of Science and Technology 2,
14–18 (2012)
4. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Journal
of Knowledge and Information Systems 1 (1999).
5. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by
using path analysis method. Expert Systems with Applications 36(3), 6635–6644 (2009)
6. P. Yeng, Y. Zheng. (2010). Inspired Rule-Based User Identification, LNCS 6440, pp. 618-624.
7. K.R. Suneetha, Dr. R. Krihnamoorthi. (2009). Identifying User Behavior by Analyzing Web Server Access Log File,
IJCSNS, 2009.
8. …
QUESTION??...

Weitere ähnliche Inhalte

Was ist angesagt?

Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Fazli Amin
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Sri Prasanna
 
Global state recording in Distributed Systems
Global state recording in Distributed SystemsGlobal state recording in Distributed Systems
Global state recording in Distributed SystemsArsnet
 
Foult Tolerence In Distributed System
Foult Tolerence In Distributed SystemFoult Tolerence In Distributed System
Foult Tolerence In Distributed SystemRajan Kumar
 
Distributed File Systems
Distributed File Systems Distributed File Systems
Distributed File Systems Maurvi04
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
google file system
google file systemgoogle file system
google file systemdiptipan
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMSkoolkampus
 
Database ,14 Parallel DBMS
Database ,14 Parallel DBMSDatabase ,14 Parallel DBMS
Database ,14 Parallel DBMSAli Usman
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocolsZongYing Lyu
 

Was ist angesagt? (20)

Unit 1
Unit 1Unit 1
Unit 1
 
Web mining
Web miningWeb mining
Web mining
 
Lecture 1 (distributed systems)
Lecture 1 (distributed systems)Lecture 1 (distributed systems)
Lecture 1 (distributed systems)
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Global state recording in Distributed Systems
Global state recording in Distributed SystemsGlobal state recording in Distributed Systems
Global state recording in Distributed Systems
 
Foult Tolerence In Distributed System
Foult Tolerence In Distributed SystemFoult Tolerence In Distributed System
Foult Tolerence In Distributed System
 
Distributed File Systems
Distributed File Systems Distributed File Systems
Distributed File Systems
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Distributed Mutual exclusion algorithms
Distributed Mutual exclusion algorithmsDistributed Mutual exclusion algorithms
Distributed Mutual exclusion algorithms
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
google file system
google file systemgoogle file system
google file system
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
20. Parallel Databases in DBMS
20. Parallel Databases in DBMS20. Parallel Databases in DBMS
20. Parallel Databases in DBMS
 
File system structure
File system structureFile system structure
File system structure
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
Database ,14 Parallel DBMS
Database ,14 Parallel DBMSDatabase ,14 Parallel DBMS
Database ,14 Parallel DBMS
 
Consistency protocols
Consistency protocolsConsistency protocols
Consistency protocols
 
7 Deadlocks
7 Deadlocks7 Deadlocks
7 Deadlocks
 

Andere mochten auch

Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage MiningDaminda Herath
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logsijsrd.com
 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understandingZakaria Zubi
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
imortance of w3c validation
imortance  of w3c validationimortance  of w3c validation
imortance of w3c validationdjjgirish
 
Learning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction NetworksLearning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction NetworksSymeon Papadopoulos
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...idescitation
 
Dotnet titles 2016 17
Dotnet titles 2016 17Dotnet titles 2016 17
Dotnet titles 2016 17praba123456
 
Web Navigation Presentation
Web Navigation PresentationWeb Navigation Presentation
Web Navigation Presentationglvsav37
 
The FOCUS K3D Project
The FOCUS K3D ProjectThe FOCUS K3D Project
The FOCUS K3D ProjectFOCUS K3D
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithmPradip Kumar
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail MarketingJonathan Sedar
 

Andere mochten auch (20)

Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web LogsWeb Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
 
5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 
Applying web mining application for user behavior understanding
Applying web mining application for user behavior understandingApplying web mining application for user behavior understanding
Applying web mining application for user behavior understanding
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
imortance of w3c validation
imortance  of w3c validationimortance  of w3c validation
imortance of w3c validation
 
Learning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction NetworksLearning to Classify Users in Online Interaction Networks
Learning to Classify Users in Online Interaction Networks
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
 
Dotnet titles 2016 17
Dotnet titles 2016 17Dotnet titles 2016 17
Dotnet titles 2016 17
 
Webmining ppt
Webmining pptWebmining ppt
Webmining ppt
 
A survey on web usage mining techniques
A survey on web usage mining techniquesA survey on web usage mining techniques
A survey on web usage mining techniques
 
Data mining
Data miningData mining
Data mining
 
Web Navigation Presentation
Web Navigation PresentationWeb Navigation Presentation
Web Navigation Presentation
 
The FOCUS K3D Project
The FOCUS K3D ProjectThe FOCUS K3D Project
The FOCUS K3D Project
 
Data mining
Data miningData mining
Data mining
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Web mining
Web miningWeb mining
Web mining
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 

Ähnlich wie Preprocessing of Web Log Data for Web Usage Mining

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningIOSR Journals
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioINFOGAIN PUBLICATION
 
Clients and Servers.ppt
Clients and Servers.pptClients and Servers.ppt
Clients and Servers.pptMohammed Ilyas
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Editor IJCATR
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logsAnimesh Shaw
 
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningIJMER
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacksFrank Victory
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureCarst Vaartjes
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...NCCOMMS
 
Lantea platform
Lantea platformLantea platform
Lantea platformNeuzilla
 
SharePoint Development Workshop
SharePoint Development WorkshopSharePoint Development Workshop
SharePoint Development WorkshopMJ Ferdous
 
Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构Benjamin Tan
 
Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Brian Culver
 
CNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating ApplicationsCNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating ApplicationsSam Bowne
 
RESTful web
RESTful webRESTful web
RESTful webAlvin Qi
 
REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and TricksMaksym Bruner
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSPC Adriatics
 

Ähnlich wie Preprocessing of Web Log Data for Web Usage Mining (20)

Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
 
Clients and Servers.ppt
Clients and Servers.pptClients and Servers.ppt
Clients and Servers.ppt
 
Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...Identifying the Number of Visitors to improve Website Usability from Educatio...
Identifying the Number of Visitors to improve Website Usability from Educatio...
 
Investigating server logs
Investigating server logsInvestigating server logs
Investigating server logs
 
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
 
Lesson 6 web based attacks
Lesson 6 web based attacksLesson 6 web based attacks
Lesson 6 web based attacks
 
Bquery Reporting & Analytics Architecture
Bquery Reporting & Analytics ArchitectureBquery Reporting & Analytics Architecture
Bquery Reporting & Analytics Architecture
 
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
SPCA2013 - Best Practices & Considerations for Designing Your SharePoint Logi...
 
Lantea platform
Lantea platformLantea platform
Lantea platform
 
SharePoint Development Workshop
SharePoint Development WorkshopSharePoint Development Workshop
SharePoint Development Workshop
 
Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!Boost the Performance of SharePoint Today!
Boost the Performance of SharePoint Today!
 
CNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating ApplicationsCNIT 121: 14 Investigating Applications
CNIT 121: 14 Investigating Applications
 
RESTful web
RESTful webRESTful web
RESTful web
 
REST Api Tips and Tricks
REST Api Tips and TricksREST Api Tips and Tricks
REST Api Tips and Tricks
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Preprocessing of Web Log Data for Web Usage Mining

  • 1. 1 Preprocessing on Web Log Data for Web Usage Mining Shahid Rajaee Teacher Training University Faculty of Computer Engineering PRESENTED BY: Amir Masoud Sefidian
  • 2. 2 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 3. 3 Outline: •Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 4. 4 Introduction • Web has been growing as a dominant platform for retrieving information and discovering knowledge from web data. • Web usage analysis or web usage mining or web log mining or click stream analysis: • Process of extracting useful knowledge from web server logs, database logs, user queries, client side cookies and user profiles in order to analyze web users’ behavior. • Applies data mining techniques in log data to extract the behavior of users which is used in various applications.
  • 5. 5 Outline: • Introduction •Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 6. 6 Logs TYPES OF WEB SERVER LOG FILES • Access logs: • It stores information about which files are requested from web server. • Referrer logs: • Stores information of the URLs of web pages on other sites that link to web pages. • If a user gets to one of the server‘s pages by clicking on a link from another site, the URL of that site will appear in this log. • Agent logs: • It records information about the web clients that sends requests to web server. Contain type of browser and the platform determines what a user is able to access on a web site. • Error logs: • It stores information about errors and failed requests of the web server. Types of Web log file formats • Common Log Format (CLF) • W3C extended log file format • Microsoft IIS (Internet Information Services) log file format • NCSA Common log file format
  • 7. 7 Sources of Log Data For Web Usage Mining Server side: • All the click streams are recorded into the web server log. • Contain basic information e.g. name and IP of the remote host, date and time of the request etc. • The web server stores data regarding request performed by the client. Client side: • The client itself which sends information to a repository regarding the users‘ behavior. • Done either with an ad-hoc browsing application or through client side application running standard browsers. Proxy side: • Proxy level collection is an intermediary between server level and client level. • Proxy servers collect data of groups of users accessing huge groups of web servers. We consider only the case of a Web Server Log data.
  • 8. 8 Outline: • Introduction • Web Logs Files •Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 9. 9 Phases of Web Usage Mining: Data Preprocessing: • Transform the raw click stream data into a set of user profiles. • One of the most complex phase of the Web Usage Mining process. Pattern Discovery: • Extracting information from preprocessed data. • Data mining, statistics, machine learning and pattern recognition are applied to web usage data to discover user access patterns of the web. Pattern Analysis: • Extract the interesting patterns from the pattern discovery process by eliminating the irrelative patterns. • Involves : • Validation: remove the irrelative patterns • Interpretation : using visualization techniques to interpret mathematic results for humans. Our Focus
  • 10. 10 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining •Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Conclusion • Main references
  • 11. 11 Steps of Data Preprocessing
  • 12. 12 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Conclusion • Main references
  • 13. 13 Data Cleaning: • Irrelevant or redundant log records will be removed. • Clean accessorial resources embedded in HTML file, robots requests and error requests. • Almost no researches focus purely on web log cleaning. Attributes Involved in Web Log Cleaning and Intrusion Detection: • Multimedia(images, videos and audio) Files: • Categorized as useless files in web log preprocessing. • Web log files size can be reduced to less than 50% of its original sizes by eliminating the image request. • Web Robots Request : • Dramatically affect the web sites traffic statistics. • These are not important from the mining perspective and hence must be removed. • HTTP Status Codes: • Log files with unsuccessful HTTP status code are usually eliminated during the web log cleaning process. • The widely acceptable definition for unsuccessful HTTP status codes is a code under 200 and over 299. • For Intrusion Detection: • [3] Removes all log files with status code 200. • [2] Argued that log files with status 200 series should be remained as these log files may include web attacks like SQL injection and XSS which have been executed successfully.
  • 14. 14 Attributes Involved in Web Log Cleaning • HTTP Methods: • A few researches have included HTTP method as an attribute in web log cleaning. • In the LODAP Data Cleaning Module All log files with HTTP request method other than GET should be removed as these are non-significant in web usage mining. • For Intrusion Detection: • Someone proposed: • HTTP request with POST method should be kept. • Another one proposed: • Keep the log files with HTTP GET and HEAD request to obtain more accurate referrer information. • Other Files: • Log files with request to accessorial resources (e.g. CSS file) embedded in HTML file should be removed.
  • 15. 15 Algorithm Design of Newest Methods: • This method used for Data Cleaning and Intrusion Detection from log files. • Total of six cleaning conditions is applied: First: Logs with HTTP status code 200 will be removed (probability for web logs with such criteria to contain malicious web attacks is almost zero). Second: • Web logs with multimedia file extensions will be removed if • The HTTP request in the web log is not HTTP POST and • The HTTP status code is not 400 series and 500 series. • Web logs with status code 400 series and 500 series should be kept as these may consider as malicious attempt. • Users who triggered many web logs with HTTP error status code are subject to suspect. • In common case, to launch web defacement attack, attacker will use HTTP POST method to replace part or all of the web interface components. Third: • Legitimate web robots requests like Googlebot will be removed. • Specific IP address will be included in the web robot IP whitelist. • If there are web logs with web robots request from whitelist IP addresses, the web log will be removed.
  • 16. 16 Algorithm Design of Newest Methods: Fourth: Remove web log with legitimate file extension(.css, .pdf, .txt and .doc) if : The web logs contain no HTTP status codes with 400 series and 500 series and the HTTP method is not HTTP POST. Fifth: Web log with HTTP HEAD method(used in a web monitoring system) and legitimate IP will be removed. A large number of HTTP HEAD requests may indicate malicious web robots activities. Sixth: Web log with HTTP POST method will be removed if the file posted are legitimate. For instance, it is legitimate if there is web log with .svc file extension in uri-stem and with HTTP POST method. .svc file is a special content file which represents the Windows Communication Foundation (WCF) service hosted in IIS.
  • 17. 17 Implementation Web log format: Internet Information Services (IIS) Log Format Simulating attacks carried out by using three web vulnerability assessment tools: Acunetix(run on Microsoft Windows) Nikto and w3af(run on BackTrack GNOME) An e-commerce site web server is configured to send web logs to the log collector server via User Datagram Protocol (UDP). Architectural Diagram for Simulation Attack and Web Log Collection
  • 18. 18 Comparison of existing frameworks: [1]: Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K. 2011 [2]: Patil, P., Patil, U. 2012 [3]: Yew Chuan Ong and Zuraini Ismail 2014 [1], [2] considered only three files extensions(.jpg, .gif and .css). Algorithm 3 defined a total of sixteen multimedia file extensions + four other files extension. Comparisons Factor Algorithms [1] [2] [3] Multimedia Files Yes Yes Yes Web Robots Request No No Yes HTTP Status Code 200, 400 series, 500 series 200 200, 400series, 500 series HTTP Method No GET GET, POST, HEAD Others Files Yes Yes Yes Number of Rules and Conditions 2 1 6
  • 19. 19 Evaluation of existing frameworks: Evaluate the cleaning capability: • size of web log file in bytes. • # of web log entries based on the total number of lines in the web log file. Percentage of reduction = (total # of web log entries removed / total # of web log entries) × 100% higher is the percentage of reduction => the better is the cleaning capability Evaluate the Intrusion Detection Readiness: False negative rate = Total number of malicious request removed / total number of malicious request Lower false negative rate => better intrusion detection readiness Measuring Factors Algorithms [1] [2] [3] File Size Reduced (bytes) 6945603 32423581 18957149 Number of Entries Removed 52916 215616 153372 Percentage of Reduction (%) 13.94 56.81 40.41 False Negative Rate 0.00144 0.15789 0.00531 Algorithm 3 has the second highest percentage of reduction and second lowest false negative rate compared to the other algorithms.
  • 20. 20 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 21. 21 User Identification:  Identify each distinct user.  User identification is one way of introducing a state into web stateless system.  A very complex task because of proxy servers and caches. 1 User identification by IP address: “Each different IP address represents different user.” Problems: • Several users can be used the same IP address or computer (i.e. college, internet café etc.). • One user can have different IP addresses, since a user accesses the Web from different machines will have different IP address. 2 User identification using User registration Data: If users have login of their information, it is easy to identify them. Username and password are also stored in the web log files. Problems: • But these facilities are not available in every website so that it is not appropriated for the general web browsing . • There are lots of user do not register their information. 3 User identification using Cookies: Cookies are HTTP headers in string format. By using Cookies we can extract the details of users and resources which are accessed by the user. Problems: • Users can lock the use of cookies. • Users can delete the cookies.
  • 22. 22 Two heuristics proposed that can be used to help identify unique users: • (P. Pirolli,J. Pitkow, and R. Rao) and (K.R. Suneetha, Dr. R. Krihnamoorthi(2009)) proposed: “Even if the IP address is the same, if the agent log shows a change in browser software or operating system so: Each different agent type for an IP address represents a different user.” User 1: A→B → E →K →I → O→E →L User 2: A → C →G →M →H→N
  • 23. 23 Another heuristic(L. Chaofeng (2006) and V. Chitraa (2010) ): Use the access log in conjunction with the referrer log and site topology to construct browsing paths for each user: “If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user assumes that there is another user with the same IP address”. Following the referrer field along user 1’s path through the Web site. Unexpectedly, there is no referrer shown for the page I.html request. There is no direct link between K.html and I.html: It appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I. It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by entering the URL directly into the browser using the same browser version and operating system: User 1: A→ B →E →K → E →L , User 2: A → C →G → M →H→N , User 3: I → O
  • 24. 24 • P. Yeng, Y. Zheng(2010) dedicated only to user identification through inspired rules: • Four constraints are used to identify users. These constraints are: IP address, agent information, site topology and time information. • Has low efficiency, but accuracy increased significantly • “Renáta Iváncsy, and Sándor Juhász” analysis of different user identification methods at “Analysis of Web User Identification Methods” • Heuristics are not error-proof. • Different heuristics must be selected depending on different situations and applications.
  • 25. 25 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 26. 26 Session Identification(Sessionization, Session Reconstruction): Session Definition: • Group of activities performed by a user from the moment he entered the website to the moment he left it. • A set of user clicks usually referred to as a click stream, across Web servers is defined as a user session. • A sequence of web pages user browse in a single access. Session Identification : • Grouping the different activities of a single user. • The process of segmenting the access log of each user into individual access sessions. Session identification Goal: • Group the page access of each user into individual access sessions. • Identifying which user has spent how much time on the website. • Each heuristic h scans the user activity logs to which the web server log is partitioned. Two general approaches: • Time-oriented heuristic methods • Navigation-oriented heuristic methods
  • 27. 27 Session Identification: • Time-oriented heuristic methods: A set of pages visited by a specific user is considered as a single user session if the pages are requested at a time interval not larger than a specified time period. First Heuristic: Total session duration may not exceed a threshold 𝜃. 𝑡0: the timestamp for the first request in a constructed session S. “The request with a timestamp t is assigned to S, iff t − 𝑡0 ≤ 𝜃” (Liu, 2007). 𝜃 = 30𝑚𝑖𝑛 has been recommended from empirical findings (Spiliopoulou, Mobasher, Berendt, & Nakagawa, 2003). Second Heuristic: For the page-stay-time-based method: Total time spent on a page may not exceed a threshold 𝛿. 𝑡1: the timestamp for request assigned to constructed session 𝑆 Next request with timestamp 𝑡2 is assigned to S iff 𝑡2 − 𝑡1 ≤ 𝛿 Liu, 2007. A conservative threshold for page-stay time is 𝛿 = 10𝑚𝑖𝑛 has been proposed to capture the time for loading and studying the contents of a page (Spiliopoulou et al., 2003).
  • 28. 28 Session Identification: • Navigation-oriented heuristic methods : Web users reach pages by following hyperlinks rather than by typing URLs. Topology-based heuristic: “If a web page is not connected with previously visited page in a session, then it is considered as a different session.” Referrer-basic heuristic(Cooley et al. (1999) ) based on the referrer information : • The referrer of a requested page P should be a page already in the session(previously visited pages); otherwise P is assigned to a different session. • If The page has an empty referrer, then it is likely to be the first page of a new session.
  • 29. 29 Session Identification: • “Spiliopoulou” evaluates different heuristics in “A Framework for the Evaluation of Session Reconstruction Heuristics in Web Usage Analysis”: • Time based methods are not reliable because users may involve in some other activities after opening the web page. • Referrer-based heuristics are more restrictive than the topology-based heuristics, because there are cases where a page request has an empty referrer. • Different methods are used by different applications. • Experiments showed that there is no best heuristic for all cases. • Even for a simple application, two variations in the method of assessing reconstruction quality led to significantly different precision scores among the heuristics • G. Shivaprasad, N.V. Subba Reddy, U. Dinesh Acharya and Prakash K. Aithal (2016) proposed: • A combined technique based on both the heuristics for Session Identification. • Uses web topology and page stay time.
  • 30. 30 Session Identification(Time-oriented heuristic example): Session 1 (user 1): A →B→E →K Session 2 (user 2): A → C →G → M →H →N Session 3 (user 3): I → O Session 4 (user 1): E →L For user 1, there is more than a 30-minute delay between the request for page K.html and the second request for page E.html,so :
  • 31. 31 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 32. 32 Path Completion : • Critical phase in the preprocessing. • The number of URLs recorded in log maybe less than the real one: • Some important page requests are not recorded in server log due to proxy servers, browsers back button is pressed and local caching. • Definition: • “The process of reconstructing the user’s navigation path, by appending missed page requests (page requests that are not recorded in server log) in order to analyze the data in a proper way within the identified sessions”. • Used to obtain the complete user access path.
  • 33. 33 Path Completion : Methods similar to those used for user identification can be used for path completion.. Heuristic methods based on referrer log and site topology are employed. Cooley, R., Mobasher, B., & Srivastava, J. (1999): Missing pages are added as follows: The page request is checked whether it is directly linked to the last page or not: If there is no link with last page check the recent history. If the log record is available in recent history then it is clear that “back” button is used for caching until the page has been reached.
  • 34. 34 Path Completion : Considering session 2 : Session 2 (user 2): A → C → G →M →H→N There is no direct link between page M.html and page H.html. Therefore, the user is presumed to have hit the “Back” button on the browser twice. The path completion process leads us to insert “→G →C ” into the session path for session 2: Session 2 (user 2): A → C →G →M →G →C →H →N
  • 35. 35 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion Main references
  • 36. 36 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Conclusion • Main references
  • 37. 37 Main References: 1. Ong, Y. C., & Ismail, Z. (2014). Enhanced Web Log Cleaning Algorithm for Web Intrusion Detection. In Recent Advances in Information and Communication Technology (pp. 315-324). Springer International Publishing. 2. Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K.: Web Server Logs Preprocessing for Web Intrusion Detection. Computer and Information Science 4, 123–133 (2011) 3. Patil, P., Patil, U.: Preprocessing of web server log file for web mining. World Journal of Science and Technology 2, 14–18 (2012) 4. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems 1 (1999). 5. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Systems with Applications 36(3), 6635–6644 (2009) 6. P. Yeng, Y. Zheng. (2010). Inspired Rule-Based User Identification, LNCS 6440, pp. 618-624. 7. K.R. Suneetha, Dr. R. Krihnamoorthi. (2009). Identifying User Behavior by Analyzing Web Server Access Log File, IJCSNS, 2009. 8. …
  • 38.