SlideShare ist ein Scribd-Unternehmen logo
1 von 53
AJAX Crawl:
Making AJAX Applications Searchable
Cristian Duda
and many others
ICDE09
Outline
 Introduction
 Modeling AJAX
 AJAX Crawling
 Architecture of A Search Engine
 Experimental Results & Conclusions
What is AJAX?
 Asynchronous JavaScript and XML
What is AJAX?
 Asynchronous JavaScript and XML
 AJAX applications
 Google Mail, Yahoo! Mail, Google Maps.
 No URL changing
Why Search Engines Ignore AJAX Content?
 No caching/pre-crawl
 Events cannot be cached.
 Duplicate states.
 Several events can lead to the same state.
 Very granular events.
 Lead to a large set of very similar states.
 Infinite event invocation.
Now
 Introduction
 Modeling AJAX
 Event Model
 AJAX Page Model
 AJAX Web Sites Model
 AJAX Crawling
 ….
Event Model
 When JavaScript is used, the application reacts to
user events: click, doubleClick, mouseover, etc.
 Event structure in JavaScript:
AJAX Page Model
 An AJAX application =
 a simple page identified by an URL +
 a series of states, events and transitions
 Page model: a view of all states in a page (e.g., all
comment pages).
 In particular it is an automaton, a Transition Graph,
which contains all application entities (states,
events, transition).
Model of An AJAX Web Page
Model of An AJAX Web Page
Nodes: application state. An application
state is represented as a DOM tree.
Edges: transitions between states. A
transition is triggered by an event
activated on the source element
and applied to one or more target
elements, whose properties change
through an action.Annotation For The Transition
Graph of An AJAX Application
AJAX Web Sites Model
Now
 Introduction
 Modeling AJAX
 AJAX Crawling
 Architecture of A Search Engine
 Experimental Results & Conclusions
Crawling Algorithm
 Build the model of AJAX Web Site.
 Focus on how to build the AJAX Page Model. (i.e.,
for YouTube, indexing all comment pages of a
video).
A Basic AJAX Crawling Algorithm
First Step:
Read the initial DOM of the
Document at a given URI.(line 2)
A Basic AJAX Crawling Algorithm
Next Step:
AJAX-specific and consists of
running the onLoad event of the
body tag in the HTML document.
(line 3)
All JavaScript-enabled browsers
invoke this function at first.
A Basic AJAX Crawling Algorithm
Crawling starts after this initial
state has been constructed.
(line 5)
The algorithm performs a breadth
-first crawling, i.e., it triggers all
events in the page and invokes
the corresponding JavaScript
function.
A Basic AJAX Crawling Algorithm
Whenever the DOM changes, a
new state is created (line 11) and
the corresponding transition is
added to the application model
(line 16).
Problem of The Basic Algorithm
 The network time needed to fetch pages.
 In case of AJAX Crawling, multiple individual events per
page lead to fetching network content.
 Traditional way: pre-caching the Web and crawling
locally. Two pages can be checked to be identical
using a single URL.
A Heuristic Crawling Policy For AJAX Applications
 We observe:
 Stable structure,
 Contains a menu, present in all states,
 And a dynamic part.
 By identifying the same state but without fetching
the content.
JavaScript Invocation Graph
 The heuristic we use is based on the runtime
analysis of the JavaScript invocation graph.
JavaScript Invocation Graph
Events And Functionalities In The
JavaScript Invocation Graph
Nodes: JavaScript functions.
The functionally of an AJAX page is expressed
through events.
JavaScript Invocation Graph
Functions In The JavaScript Invocation
Graph On YouTube Page
Functions in the JavaScript code can be invoked
either directly by event triggers (event invocations)
or indirectly by other functions (local invocations).
The dependencies in the code are
listed below:
JavaScript Invocation Graph
Hot Node: the functions that fetch
content from the server.
Hot Call: a call to a hot node
A single function fetches content from
the server, i.e.,
getURLXMLResponseAndFillDiv.
In AJAX, the same function can be invoked in order to
fetch the same content from the server from different
comment pages.
In this approach we detect this situation and we avoid
invoking the same function twice.
How to solve it?
 We solve the problem of caching in AJAX applications
and detecting duplicate states by identifying and
reusing the result of server calls.
 Just as in traditional I/O analysis in databases, we
tend to minimize the number of the most expensive
operations, i.e., the Hot Calls, invocations which
generate AJAX calls to the server.
Optimized Crawling Algorithm
Optimized Crawling Algorithm
 Step 1: Identifying Hot Nodes.
 The crawler tags the Hot Nodes,
i.e., the functions that directly
contain AJAX calls.(line 34)
Optimized Crawling Algorithm
 Step 2: Building Hot Node
Cache.
 The crawler builds a table
containing all Hot Node
invocations, the actual
parameters used in the call and
the results returned by the
server(line 34-53). This step
uses the current runtime stack
trace.
Optimized Crawling Algorithm
 Step 3: Intercepting Hot Node
Calls.
 The crawler adopts the following
policy:
1. Intercept all invocations of Hot Nodes
(functions) and actual parameters
(line 34).
2. Lookup any function call within the
Hot Node Cache (line 37-39).
3. If match is found (hot node with same
parameters) do not invoke AJAX call
and reuse existing content instead
(line 41).
Simplifying Assumptions
 Snapshot Isolation
 An application does not change during crawling.
 No Forms
 Do not deal with AJAX parts that require user inputting
data in forms, such as Google Suggest.
 No Update Events
 Avoid triggering update events, such as Delete buttons.
 No Image-based Retrieval
Now
 Introduction
 Modeling AJAX
 AJAX Crawling
 Architecture of A Search Engine
 Experimental Results
The Components
 AJAX Crawler
 Indexing
 Query Processing
 Result Aggregation
Indexing
 Starts from the model of the AJAX Site and builds
the physical inverted file.
 Opposed to traditional way, a result is an URI and a
state.
Processing Simple Keyword Queries
 Each query returns the URI and the state(s) which
contain the keywords.
 Ranked by the score.
Processing Conjunctions
 Query: Morcheeba singer
Processing Conjunctions
 Conjunctions are computed as a merge between the
individual posting lists of the corresponding
keywords.
 Entries are compatible if the URLs are compatible,
then if the States are identical.
Parallelization
Crawling AJAX faces the difficulty of not being able to really cache
dynamic Web content and network connections must continuously
be created.
Parallelization
A precrawler is used to build the traditional, linked-based Web
site structure.
The total list of URLs of AJAX Web pages is then partitioned
and supplied to a set of parallel crawlers.
Parallelization
Each crawler applies the crawling algorithm and builds for
each crawled page the AJAX Model.
Parallelization
More indexes are then built from the disjunct sets of AJAX
Models.
Parallelization
Query processing is then performed by query shipping,
computing the results from each Index, and then performing
a merge of the individual results from each index, returning
the final list to the client.
Now
 Introduction
 Modeling AJAX
 AJAX Crawling
 Architecture of A Search Engine
 Experimental Results & Conclusions
Experimental Setup
 YouTube Datasets
 Algorithms:
 Traditional Crawling
 AJAX Non-Cached
 AJAX Cached
 AJAX Parallel
YouTube Statistics
Crawling Time
 Network time is predominant.
 Underline the importance of applying the Hot Node
optimization.
Total Crawling Time & Network Time
Crawling Time
 The Hot Node heuristics is effective.
 The heuristic approach of the Hot Nodes causes a 1.29
factor of improvement in crawling time as opposed to the
Non-Cached Approach.
Number of AJAX Events Resulting In
Network Requests
Crawling Time
 Parallelization is effective.
 The running time decreases almost by 25%, as opposed to
the AJAX Non-Parallel version.
Query Processing Time
 YouTube queries.
 Query processing times on YouTube.
Recall
 For each query we evaluated the number of videos
returned by just using the traditional approach, as
opposed to the total number of videos returned in
the AJAX Crawl approach, when also comment
pages are taken into account.
Discussions
 Combine with existing search engines.
 Focusing on a specific user’s interaction with the
server.
 Support more AJAX applications, such as forms.
 Irrelevant events. This paper focus on the most
important events (click, doubelclick, mouseover).
Questions? Comments?

Weitere ähnliche Inhalte

Was ist angesagt? (20)

Introduction to ajax
Introduction to ajaxIntroduction to ajax
Introduction to ajax
 
Introduction to ajax
Introduction to ajaxIntroduction to ajax
Introduction to ajax
 
Ajax
AjaxAjax
Ajax
 
Introduction to ajax
Introduction  to  ajaxIntroduction  to  ajax
Introduction to ajax
 
Ajax Tuturial
Ajax TuturialAjax Tuturial
Ajax Tuturial
 
jQuery Ajax
jQuery AjaxjQuery Ajax
jQuery Ajax
 
Ajax.ppt
Ajax.pptAjax.ppt
Ajax.ppt
 
25250716 seminar-on-ajax text
25250716 seminar-on-ajax text25250716 seminar-on-ajax text
25250716 seminar-on-ajax text
 
Ajax and Jquery
Ajax and JqueryAjax and Jquery
Ajax and Jquery
 
Ajax ppt - 32 slides
Ajax ppt - 32 slidesAjax ppt - 32 slides
Ajax ppt - 32 slides
 
M Ramya
M RamyaM Ramya
M Ramya
 
An Introduction to Ajax Programming
An Introduction to Ajax ProgrammingAn Introduction to Ajax Programming
An Introduction to Ajax Programming
 
Ajax
AjaxAjax
Ajax
 
SynapseIndia dotnet development ajax client library
SynapseIndia dotnet development ajax client librarySynapseIndia dotnet development ajax client library
SynapseIndia dotnet development ajax client library
 
Ajax & Reverse Ajax Presenation
Ajax & Reverse Ajax PresenationAjax & Reverse Ajax Presenation
Ajax & Reverse Ajax Presenation
 
Ajax
AjaxAjax
Ajax
 
Xml http request
Xml http requestXml http request
Xml http request
 
Ajax
AjaxAjax
Ajax
 
Rethinking Syncing at AltConf 2019
Rethinking Syncing at AltConf 2019Rethinking Syncing at AltConf 2019
Rethinking Syncing at AltConf 2019
 
Ajax Lecture Notes
Ajax Lecture NotesAjax Lecture Notes
Ajax Lecture Notes
 

Ähnlich wie AJAX Crawl (20)

Ajax tutorial by bally chohan
Ajax tutorial by bally chohanAjax tutorial by bally chohan
Ajax tutorial by bally chohan
 
Ajax Ppt 1
Ajax Ppt 1Ajax Ppt 1
Ajax Ppt 1
 
Ajax and ASP.NET AJAX
Ajax and ASP.NET AJAXAjax and ASP.NET AJAX
Ajax and ASP.NET AJAX
 
1 ppt-ajax with-j_query
1 ppt-ajax with-j_query1 ppt-ajax with-j_query
1 ppt-ajax with-j_query
 
Ajax ppt
Ajax pptAjax ppt
Ajax ppt
 
01 Ajax Intro
01 Ajax Intro01 Ajax Intro
01 Ajax Intro
 
Ajax Presentation
Ajax PresentationAjax Presentation
Ajax Presentation
 
Using Ajax In Domino Web Applications
Using Ajax In Domino Web ApplicationsUsing Ajax In Domino Web Applications
Using Ajax In Domino Web Applications
 
AJAX
AJAXAJAX
AJAX
 
Ajax
AjaxAjax
Ajax
 
Ajax Introduction
Ajax IntroductionAjax Introduction
Ajax Introduction
 
Ajax3
Ajax3Ajax3
Ajax3
 
Lessons
LessonsLessons
Lessons
 
Lessons from the Trenches: Engineering Great AJAX Experiences
Lessons from the Trenches: Engineering Great AJAX ExperiencesLessons from the Trenches: Engineering Great AJAX Experiences
Lessons from the Trenches: Engineering Great AJAX Experiences
 
Mvc interview questions – deep dive jinal desai
Mvc interview questions – deep dive   jinal desaiMvc interview questions – deep dive   jinal desai
Mvc interview questions – deep dive jinal desai
 
Ajax.pdf
Ajax.pdfAjax.pdf
Ajax.pdf
 
Ajax
AjaxAjax
Ajax
 
Jinal desai .net
Jinal desai .netJinal desai .net
Jinal desai .net
 
Ajax
AjaxAjax
Ajax
 
Ajax
AjaxAjax
Ajax
 

Kürzlich hochgeladen

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

AJAX Crawl

  • 1. AJAX Crawl: Making AJAX Applications Searchable Cristian Duda and many others ICDE09
  • 2. Outline  Introduction  Modeling AJAX  AJAX Crawling  Architecture of A Search Engine  Experimental Results & Conclusions
  • 3. What is AJAX?  Asynchronous JavaScript and XML
  • 4. What is AJAX?  Asynchronous JavaScript and XML  AJAX applications  Google Mail, Yahoo! Mail, Google Maps.  No URL changing
  • 5. Why Search Engines Ignore AJAX Content?  No caching/pre-crawl  Events cannot be cached.  Duplicate states.  Several events can lead to the same state.  Very granular events.  Lead to a large set of very similar states.  Infinite event invocation.
  • 6. Now  Introduction  Modeling AJAX  Event Model  AJAX Page Model  AJAX Web Sites Model  AJAX Crawling  ….
  • 7. Event Model  When JavaScript is used, the application reacts to user events: click, doubleClick, mouseover, etc.  Event structure in JavaScript:
  • 8. AJAX Page Model  An AJAX application =  a simple page identified by an URL +  a series of states, events and transitions  Page model: a view of all states in a page (e.g., all comment pages).  In particular it is an automaton, a Transition Graph, which contains all application entities (states, events, transition).
  • 9. Model of An AJAX Web Page
  • 10. Model of An AJAX Web Page Nodes: application state. An application state is represented as a DOM tree. Edges: transitions between states. A transition is triggered by an event activated on the source element and applied to one or more target elements, whose properties change through an action.Annotation For The Transition Graph of An AJAX Application
  • 11. AJAX Web Sites Model
  • 12. Now  Introduction  Modeling AJAX  AJAX Crawling  Architecture of A Search Engine  Experimental Results & Conclusions
  • 13. Crawling Algorithm  Build the model of AJAX Web Site.  Focus on how to build the AJAX Page Model. (i.e., for YouTube, indexing all comment pages of a video).
  • 14. A Basic AJAX Crawling Algorithm First Step: Read the initial DOM of the Document at a given URI.(line 2)
  • 15. A Basic AJAX Crawling Algorithm Next Step: AJAX-specific and consists of running the onLoad event of the body tag in the HTML document. (line 3) All JavaScript-enabled browsers invoke this function at first.
  • 16. A Basic AJAX Crawling Algorithm Crawling starts after this initial state has been constructed. (line 5) The algorithm performs a breadth -first crawling, i.e., it triggers all events in the page and invokes the corresponding JavaScript function.
  • 17. A Basic AJAX Crawling Algorithm Whenever the DOM changes, a new state is created (line 11) and the corresponding transition is added to the application model (line 16).
  • 18. Problem of The Basic Algorithm  The network time needed to fetch pages.  In case of AJAX Crawling, multiple individual events per page lead to fetching network content.  Traditional way: pre-caching the Web and crawling locally. Two pages can be checked to be identical using a single URL.
  • 19. A Heuristic Crawling Policy For AJAX Applications  We observe:  Stable structure,  Contains a menu, present in all states,  And a dynamic part.  By identifying the same state but without fetching the content.
  • 20. JavaScript Invocation Graph  The heuristic we use is based on the runtime analysis of the JavaScript invocation graph.
  • 21. JavaScript Invocation Graph Events And Functionalities In The JavaScript Invocation Graph Nodes: JavaScript functions. The functionally of an AJAX page is expressed through events.
  • 22. JavaScript Invocation Graph Functions In The JavaScript Invocation Graph On YouTube Page Functions in the JavaScript code can be invoked either directly by event triggers (event invocations) or indirectly by other functions (local invocations).
  • 23. The dependencies in the code are listed below:
  • 24. JavaScript Invocation Graph Hot Node: the functions that fetch content from the server. Hot Call: a call to a hot node A single function fetches content from the server, i.e., getURLXMLResponseAndFillDiv.
  • 25. In AJAX, the same function can be invoked in order to fetch the same content from the server from different comment pages. In this approach we detect this situation and we avoid invoking the same function twice.
  • 26. How to solve it?  We solve the problem of caching in AJAX applications and detecting duplicate states by identifying and reusing the result of server calls.  Just as in traditional I/O analysis in databases, we tend to minimize the number of the most expensive operations, i.e., the Hot Calls, invocations which generate AJAX calls to the server.
  • 28. Optimized Crawling Algorithm  Step 1: Identifying Hot Nodes.  The crawler tags the Hot Nodes, i.e., the functions that directly contain AJAX calls.(line 34)
  • 29. Optimized Crawling Algorithm  Step 2: Building Hot Node Cache.  The crawler builds a table containing all Hot Node invocations, the actual parameters used in the call and the results returned by the server(line 34-53). This step uses the current runtime stack trace.
  • 30. Optimized Crawling Algorithm  Step 3: Intercepting Hot Node Calls.  The crawler adopts the following policy: 1. Intercept all invocations of Hot Nodes (functions) and actual parameters (line 34). 2. Lookup any function call within the Hot Node Cache (line 37-39). 3. If match is found (hot node with same parameters) do not invoke AJAX call and reuse existing content instead (line 41).
  • 31. Simplifying Assumptions  Snapshot Isolation  An application does not change during crawling.  No Forms  Do not deal with AJAX parts that require user inputting data in forms, such as Google Suggest.  No Update Events  Avoid triggering update events, such as Delete buttons.  No Image-based Retrieval
  • 32. Now  Introduction  Modeling AJAX  AJAX Crawling  Architecture of A Search Engine  Experimental Results
  • 33. The Components  AJAX Crawler  Indexing  Query Processing  Result Aggregation
  • 34. Indexing  Starts from the model of the AJAX Site and builds the physical inverted file.  Opposed to traditional way, a result is an URI and a state.
  • 35. Processing Simple Keyword Queries  Each query returns the URI and the state(s) which contain the keywords.  Ranked by the score.
  • 37. Processing Conjunctions  Conjunctions are computed as a merge between the individual posting lists of the corresponding keywords.  Entries are compatible if the URLs are compatible, then if the States are identical.
  • 38. Parallelization Crawling AJAX faces the difficulty of not being able to really cache dynamic Web content and network connections must continuously be created.
  • 39. Parallelization A precrawler is used to build the traditional, linked-based Web site structure. The total list of URLs of AJAX Web pages is then partitioned and supplied to a set of parallel crawlers.
  • 40. Parallelization Each crawler applies the crawling algorithm and builds for each crawled page the AJAX Model.
  • 41. Parallelization More indexes are then built from the disjunct sets of AJAX Models.
  • 42. Parallelization Query processing is then performed by query shipping, computing the results from each Index, and then performing a merge of the individual results from each index, returning the final list to the client.
  • 43. Now  Introduction  Modeling AJAX  AJAX Crawling  Architecture of A Search Engine  Experimental Results & Conclusions
  • 44. Experimental Setup  YouTube Datasets  Algorithms:  Traditional Crawling  AJAX Non-Cached  AJAX Cached  AJAX Parallel YouTube Statistics
  • 45. Crawling Time  Network time is predominant.  Underline the importance of applying the Hot Node optimization.
  • 46. Total Crawling Time & Network Time
  • 47. Crawling Time  The Hot Node heuristics is effective.  The heuristic approach of the Hot Nodes causes a 1.29 factor of improvement in crawling time as opposed to the Non-Cached Approach.
  • 48. Number of AJAX Events Resulting In Network Requests
  • 49. Crawling Time  Parallelization is effective.  The running time decreases almost by 25%, as opposed to the AJAX Non-Parallel version.
  • 50. Query Processing Time  YouTube queries.  Query processing times on YouTube.
  • 51. Recall  For each query we evaluated the number of videos returned by just using the traditional approach, as opposed to the total number of videos returned in the AJAX Crawl approach, when also comment pages are taken into account.
  • 52. Discussions  Combine with existing search engines.  Focusing on a specific user’s interaction with the server.  Support more AJAX applications, such as forms.  Irrelevant events. This paper focus on the most important events (click, doubelclick, mouseover).