SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Search Team
Engineering Achievements
Agenda

   Challenges
   Why a Platform?
   Information Extraction
       Need, Impact
       Research / Evaluations
       Approach / Implementation
   Information Retrieval
       Need, Impact
       Research / Evaluations
       Approach / Implementation
Challenges
   Job Alerts
        Over 13 Million searches, 3 times a week
        Complex Matching: Multiple Filters, Boosts, Sorts
   Resdex
        130K active users daily
        470K searches daily
        Over 220 million resumes and growing.
   Job Search
        High QPS 112, 760K searches a day
        Near Real-time Indexing
                            Jobs Refreshed 92 times daily
   Product Demands
        > 99.99% uptime, Stability, Scalability → User Experience
        Varied Functional Requirements (Complexity)
                            NIRM, FN Suggestors, etc.
        Turnaround Time
                            Over 17 applications and growing
                            About a week to deploy / configure a new one
Why a Platform?
   Technical Challenges
        Code / Bug Duplication, Reusability
        Agility
                          Product Requirements Drive Platform-Wide Features
                          SOA, Integration, Business Logic Separation
                          Comprehensive Documentation
        Scalability
        Development and QA Time/Cost Reduction
   Product Challenges
        Turnaround Time
        Business Logic Implementation = Configuration
   Miscellaneous
        Maintenance Cost Reduction
        Resource Optimization/Integration (...Cloud)
        Standardized Reporting / Health Monitoring
Information Extraction
   Data/Information Acquisition
   Structurize Raw Information
       Training based Models for Class Inference
                 Functional Area Detection
       Rule based Extraction
               Nested Funnels/Filter Layers
               Regular Expressions
   Feedback Loop
       Wisdom of Crowd/Collective Intelligence
                  SAP/SimCV: Capture User Response for
                   Recommendations
       Continuous Quality Improvement
IE:Use Cases/Impact

   Resume Parsing
                Resman (unreg apply flow)
                         Accuracy ~75%
                         Dropouts reduced ~44%
   Job Acquisition/Aggregation
                Naukri India:
                       JobMail: 23 Sites, 8.8K Jobs
                       JobPosting: 16 Competition Sites, 472K Jobs


                Naukri Gulf: JobAlerts, JobSearch
                         21 Sites, 6.5K Jobs
   Taxonomy Acquisition (Entity Extraction)
                FN Institute Names
                Contact Information
IE: Research / Evaluation

   Nutch
   Selenium
   Celerity
   UIMA*
   Heritrix
   HTTP Unit
   HTML Unit*
   Open NLP*
   Net::LWP*
IE: Architecture
Crawler Framework

   Crawler Propagation Capabilities
       JavaScript/Ajax/Event Support
                  Follow JavaScript Links
       URL Detection (Final URL Presentation)
                  URLs obtained via JavaScript Execution
                  Recursively Redirected URLs
                  Handle Dom Events
                         button, link, check-box, click, mouseOver, doubleClick.
                  URLs obtained after form-submission
Crawler Framework (contd.)

   Browser Emulation
               Built over Headless Browser
               Human-Like Propagation Strategies
               Handles Cookies
               Handles POST/GET methods
   Compliance (Obeys Robots.txt)
   Configurable Stateful/Delta Crawling
   Nested Multi-threaded Execution
                Pause/Resume/Restart Capabilities at Site/Seed
                 URL levels
   Controllable Depth
Crawler Framework (contd.)

   Real-time Crawler Statistics
       State Information
       MISes
   Crawl-Payload persistence strategies
       Multiple, Combinable Persistence Modules
       Multiple Output Format Support
                   Flat File, XML
                   JDBC-connectable data stores
                   Search Engine Index Formats (e.g. Lucene,
                     Sphinx)
                   Archive Formats (bz, gz, rar, zip, ...)
Extraction Strategies

   Analysis Plugins
       Entity Extraction
                   Composable Funnelling Filters
                          Sections, Subsections, ..., Entity
                   Regex-based Subpart matching
                   Corpus, NLP-based matching
                          UIMA, OpenNLP

       Machine-Learning Approaches
                   Classification / Tagging (Bayesian, SVM)
                   Clustering
Information Retrieval

   Custom/Controllable Relevance/Matching
   Scalability of Search
       Large Volumes
       High Churn
       QPS
   Extraction/Acquisition Pipeline Pluggability
   Results Post Processing
IR: Research / Evaluations

   MySQL Full-Text
   FastESP
   Solr
   Sphinx
   Lucene *
   OpenNLP *
   LingPipe
   Mozilla Rhino (JavaScript) *
IR: Use Cases/Impact

   NSE on Resdex India
       Relevance
IR: Use Cases/Impact
       Error Count the week Before: 91, week After: 1
       Availability (Before: 97.71% - 99.44%, After: 99.99%)
       Performance
                       Slow Queries ( 10 secs): < 0.2%
                       Average Search Time: 0.55 secs
                       QA Quote
                        ”There is an overall decrease in the page download time for
                          Resdex Search Results page. Incase the cache is cleared the
                          page download time has decreased by 34% to 35%, while the
                          page download time has drastically decreased, more than 73%,
                          when checked without clearing cache.”
   NSE on Resdex FirstNaukri
       PM Quote
        ”Hardly any bugs considering the complexity of project. Search results are also
        coming @ speed of thought.”
IR: Use Cases/Impact (contd.)

   Improved Concurrency → ` ` `
IR: Architecture
IR: Platform Features

   High Availability, Stability, Performance
        Caching
                       Adaptive Caching of Hit Attributes
                       Caching of Expression Evaluations
                       Pre-configurable Caching Query Filters
        Distributed Search
                       Search over Sharded Indexes
                       Auto Failover
                       Auto Healing
        Search/Sort/Group Millions of results
                       Complex expressions.
        Miscellaneous
                       Status Reports, Performance Analytics
                       Suggestive Garbage Collection
                       Preload Indexes into RAM
                       Ease of Deployment
IR: Platform Features

   Text Transformations
        Tokenization/Transformation/Tagging
       Controlled, Combinable Stemming
                      Plural, Tenses, Noun-Forms, etc. [Relevance ]
                      Inversion of Stem-roots
                               Highlighting/Did You Mean/Query Expansion
       Phonetic Token Mapping/Augmentation
       Custom Word Mapping/Synonyms (iMatch)
       Linguistic Tagging
                      PoS, Entity Extraction
                      Match/Boost on Tags
       Sentence Detection
       Apply different analytics to different fields
       Context Sensitive Spelling Correction
IR: Platform Features

   Indexing
       Dynamic Rule Based Sharding, Distributed Search
       Multiple Data Source Type Support
       (Near-)Real Time Indexing, Search
       Generic Auxillary Index Format
                  Fast Updation/Retrieval
                  Realtime Per-User Filtering/Sorting
   Matching/Filtering
        Lucene Query Functionality
                  Phrase, Proximity, Fuzzy, Wildcard
                  FirstNaukri Suggestor Implementation
IR: Platform Features

   Result Grouping/Clustering
   Expressions
       Embedded JavaScript Support
       Aggregate Functions (superset of SQL)
                   Sort/Group/Filter during indexing, search
   Sorting
       Dynamic/Stateful Sorting, e.g. for Ad Rotation
       Quota-Based Resorting
IR: Platform Features

   Scoring
       Fully Controlled, Customizable Relevance Scores
       More controllable/testable than Solr/Default Lucene
        Scoring
       Named Query Parts usable in Expressions
       Custom Scorer Variables
                 Vector Space, Query Boost, LCS, Numwords
   Configurability, API
       SQL-like client wrapper
                 Engine-App interactions look like SQL
       XML configurability
Road Ahead


If you don't know where you are going,
   any road will get you there.

                        - The Cheshire Cat,
                        Alice in Wonderland.
Road Ahead

   Cloud → ` ` `

   Semantic Relevance (Search is Dead!)
       Contextual


   Information Extraction
       NLP
       Ontology Extraction
Thanks!

Weitere ähnliche Inhalte

Andere mochten auch

Архивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборотаАрхивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборота
Natasha Khramtsovsky
 
Linoma CryptoComplete
Linoma CryptoCompleteLinoma CryptoComplete
Linoma CryptoComplete
Stuart Marsh
 
Hs Classroom Guidelines
Hs Classroom GuidelinesHs Classroom Guidelines
Hs Classroom Guidelines
jaspang
 

Andere mochten auch (20)

Petra Gone Google
Petra Gone GooglePetra Gone Google
Petra Gone Google
 
Erasmus+
Erasmus+Erasmus+
Erasmus+
 
Workshop sociusonderzoek
Workshop sociusonderzoekWorkshop sociusonderzoek
Workshop sociusonderzoek
 
Better Support for Functional Programming in Angular 2
Better Support for Functional Programming in Angular 2Better Support for Functional Programming in Angular 2
Better Support for Functional Programming in Angular 2
 
Solidariteit en kapitalisme
Solidariteit en kapitalismeSolidariteit en kapitalisme
Solidariteit en kapitalisme
 
Google analytics-socius
Google analytics-sociusGoogle analytics-socius
Google analytics-socius
 
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار طبعة القسطنطينية
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار   طبعة القسطنطينيةدلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار   طبعة القسطنطينية
دلائل الخيرات و شوارق الأنوار فى ذكر الصلاة على النبى المختار طبعة القسطنطينية
 
Tooldag 'Financiële planning'
Tooldag 'Financiële planning'Tooldag 'Financiële planning'
Tooldag 'Financiële planning'
 
Voorstelling EPALE
Voorstelling EPALEVoorstelling EPALE
Voorstelling EPALE
 
Gruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
Gruppo Irpini: Sviluppo e cultura della Sicurezza InformaticaGruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
Gruppo Irpini: Sviluppo e cultura della Sicurezza Informatica
 
Timotheus vanuit een multilevelbril
Timotheus vanuit een multilevelbrilTimotheus vanuit een multilevelbril
Timotheus vanuit een multilevelbril
 
Dlaa5il alxayraat
Dlaa5il alxayraatDlaa5il alxayraat
Dlaa5il alxayraat
 
Архивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборотаАрхивное хранение документов в условиях электронного документооборота
Архивное хранение документов в условиях электронного документооборота
 
La empresa
La empresaLa empresa
La empresa
 
Ringland (Sven Augusteyns)
Ringland (Sven Augusteyns)Ringland (Sven Augusteyns)
Ringland (Sven Augusteyns)
 
Innoveren naar een duurzame toekomst - Matthias lievens
Innoveren naar een duurzame toekomst - Matthias lievensInnoveren naar een duurzame toekomst - Matthias lievens
Innoveren naar een duurzame toekomst - Matthias lievens
 
Linoma CryptoComplete
Linoma CryptoCompleteLinoma CryptoComplete
Linoma CryptoComplete
 
Gruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
Gruppo Liburni: Tutella della Salute e Sicurezza sul LavoroGruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
Gruppo Liburni: Tutella della Salute e Sicurezza sul Lavoro
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Hs Classroom Guidelines
Hs Classroom GuidelinesHs Classroom Guidelines
Hs Classroom Guidelines
 

Ähnlich wie Naukri Search Team achievements, 2009-2010

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Timothy Chen
 
Lifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationLifecycle of a FAST Search Implementation
Lifecycle of a FAST Search Implementation
Perficient, Inc.
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Salford Systems
 

Ähnlich wie Naukri Search Team achievements, 2009-2010 (20)

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
Solr 101
Solr 101Solr 101
Solr 101
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Tech Award Presentation, 2011
Tech Award Presentation, 2011Tech Award Presentation, 2011
Tech Award Presentation, 2011
 
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hoodSilicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
Silicon Valley Code Camp 2010: Social Platforms : What goes on under the hood
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Lifecycle of a FAST Search Implementation
Lifecycle of a FAST Search ImplementationLifecycle of a FAST Search Implementation
Lifecycle of a FAST Search Implementation
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
 
Solr -
Solr - Solr -
Solr -
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Application architecture for cloud
Application architecture for cloudApplication architecture for cloud
Application architecture for cloud
 
I/O & virtualization performance with a search engine based on an xml databa...
 I/O & virtualization performance with a search engine based on an xml databa... I/O & virtualization performance with a search engine based on an xml databa...
I/O & virtualization performance with a search engine based on an xml databa...
 
ML studio overview v1.1
ML studio overview v1.1ML studio overview v1.1
ML studio overview v1.1
 
Azure ml studio_overview_v1.1
Azure ml studio_overview_v1.1Azure ml studio_overview_v1.1
Azure ml studio_overview_v1.1
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)
 
Performance on a budget
Performance on a budgetPerformance on a budget
Performance on a budget
 
Managing the cloud
Managing the cloudManaging the cloud
Managing the cloud
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Naukri Search Team achievements, 2009-2010

  • 2. Agenda  Challenges  Why a Platform?  Information Extraction  Need, Impact  Research / Evaluations  Approach / Implementation  Information Retrieval  Need, Impact  Research / Evaluations  Approach / Implementation
  • 3. Challenges  Job Alerts  Over 13 Million searches, 3 times a week  Complex Matching: Multiple Filters, Boosts, Sorts  Resdex  130K active users daily  470K searches daily  Over 220 million resumes and growing.  Job Search  High QPS 112, 760K searches a day  Near Real-time Indexing  Jobs Refreshed 92 times daily  Product Demands  > 99.99% uptime, Stability, Scalability → User Experience  Varied Functional Requirements (Complexity)  NIRM, FN Suggestors, etc.  Turnaround Time  Over 17 applications and growing  About a week to deploy / configure a new one
  • 4. Why a Platform?  Technical Challenges  Code / Bug Duplication, Reusability  Agility  Product Requirements Drive Platform-Wide Features  SOA, Integration, Business Logic Separation  Comprehensive Documentation  Scalability  Development and QA Time/Cost Reduction  Product Challenges  Turnaround Time  Business Logic Implementation = Configuration  Miscellaneous  Maintenance Cost Reduction  Resource Optimization/Integration (...Cloud)  Standardized Reporting / Health Monitoring
  • 5. Information Extraction  Data/Information Acquisition  Structurize Raw Information  Training based Models for Class Inference Functional Area Detection  Rule based Extraction Nested Funnels/Filter Layers Regular Expressions  Feedback Loop  Wisdom of Crowd/Collective Intelligence SAP/SimCV: Capture User Response for Recommendations  Continuous Quality Improvement
  • 6. IE:Use Cases/Impact  Resume Parsing  Resman (unreg apply flow)  Accuracy ~75%  Dropouts reduced ~44%  Job Acquisition/Aggregation  Naukri India:  JobMail: 23 Sites, 8.8K Jobs  JobPosting: 16 Competition Sites, 472K Jobs  Naukri Gulf: JobAlerts, JobSearch  21 Sites, 6.5K Jobs  Taxonomy Acquisition (Entity Extraction)  FN Institute Names  Contact Information
  • 7. IE: Research / Evaluation  Nutch  Selenium  Celerity  UIMA*  Heritrix  HTTP Unit  HTML Unit*  Open NLP*  Net::LWP*
  • 9. Crawler Framework  Crawler Propagation Capabilities  JavaScript/Ajax/Event Support  Follow JavaScript Links  URL Detection (Final URL Presentation)  URLs obtained via JavaScript Execution  Recursively Redirected URLs  Handle Dom Events button, link, check-box, click, mouseOver, doubleClick.  URLs obtained after form-submission
  • 10. Crawler Framework (contd.)  Browser Emulation  Built over Headless Browser  Human-Like Propagation Strategies  Handles Cookies  Handles POST/GET methods  Compliance (Obeys Robots.txt)  Configurable Stateful/Delta Crawling  Nested Multi-threaded Execution Pause/Resume/Restart Capabilities at Site/Seed URL levels  Controllable Depth
  • 11. Crawler Framework (contd.)  Real-time Crawler Statistics  State Information  MISes  Crawl-Payload persistence strategies  Multiple, Combinable Persistence Modules  Multiple Output Format Support  Flat File, XML  JDBC-connectable data stores  Search Engine Index Formats (e.g. Lucene, Sphinx)  Archive Formats (bz, gz, rar, zip, ...)
  • 12. Extraction Strategies  Analysis Plugins  Entity Extraction  Composable Funnelling Filters Sections, Subsections, ..., Entity  Regex-based Subpart matching  Corpus, NLP-based matching UIMA, OpenNLP  Machine-Learning Approaches  Classification / Tagging (Bayesian, SVM)  Clustering
  • 13. Information Retrieval  Custom/Controllable Relevance/Matching  Scalability of Search  Large Volumes  High Churn  QPS  Extraction/Acquisition Pipeline Pluggability  Results Post Processing
  • 14. IR: Research / Evaluations  MySQL Full-Text  FastESP  Solr  Sphinx  Lucene *  OpenNLP *  LingPipe  Mozilla Rhino (JavaScript) *
  • 15. IR: Use Cases/Impact  NSE on Resdex India  Relevance
  • 16. IR: Use Cases/Impact  Error Count the week Before: 91, week After: 1  Availability (Before: 97.71% - 99.44%, After: 99.99%)  Performance  Slow Queries ( 10 secs): < 0.2%  Average Search Time: 0.55 secs  QA Quote ”There is an overall decrease in the page download time for Resdex Search Results page. Incase the cache is cleared the page download time has decreased by 34% to 35%, while the page download time has drastically decreased, more than 73%, when checked without clearing cache.”  NSE on Resdex FirstNaukri  PM Quote ”Hardly any bugs considering the complexity of project. Search results are also coming @ speed of thought.”
  • 17. IR: Use Cases/Impact (contd.)  Improved Concurrency → ` ` `
  • 19. IR: Platform Features  High Availability, Stability, Performance  Caching  Adaptive Caching of Hit Attributes  Caching of Expression Evaluations  Pre-configurable Caching Query Filters  Distributed Search  Search over Sharded Indexes  Auto Failover  Auto Healing  Search/Sort/Group Millions of results  Complex expressions.  Miscellaneous  Status Reports, Performance Analytics  Suggestive Garbage Collection  Preload Indexes into RAM  Ease of Deployment
  • 20. IR: Platform Features  Text Transformations Tokenization/Transformation/Tagging  Controlled, Combinable Stemming  Plural, Tenses, Noun-Forms, etc. [Relevance ]  Inversion of Stem-roots Highlighting/Did You Mean/Query Expansion  Phonetic Token Mapping/Augmentation  Custom Word Mapping/Synonyms (iMatch)  Linguistic Tagging  PoS, Entity Extraction  Match/Boost on Tags  Sentence Detection  Apply different analytics to different fields  Context Sensitive Spelling Correction
  • 21. IR: Platform Features  Indexing  Dynamic Rule Based Sharding, Distributed Search  Multiple Data Source Type Support  (Near-)Real Time Indexing, Search  Generic Auxillary Index Format  Fast Updation/Retrieval  Realtime Per-User Filtering/Sorting  Matching/Filtering Lucene Query Functionality  Phrase, Proximity, Fuzzy, Wildcard  FirstNaukri Suggestor Implementation
  • 22. IR: Platform Features  Result Grouping/Clustering  Expressions  Embedded JavaScript Support  Aggregate Functions (superset of SQL)  Sort/Group/Filter during indexing, search  Sorting  Dynamic/Stateful Sorting, e.g. for Ad Rotation  Quota-Based Resorting
  • 23. IR: Platform Features  Scoring  Fully Controlled, Customizable Relevance Scores  More controllable/testable than Solr/Default Lucene Scoring  Named Query Parts usable in Expressions  Custom Scorer Variables Vector Space, Query Boost, LCS, Numwords  Configurability, API  SQL-like client wrapper Engine-App interactions look like SQL  XML configurability
  • 24. Road Ahead If you don't know where you are going, any road will get you there. - The Cheshire Cat, Alice in Wonderland.
  • 25. Road Ahead  Cloud → ` ` `  Semantic Relevance (Search is Dead!)  Contextual  Information Extraction  NLP  Ontology Extraction