SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Improving search using the pipeline
  in FAST Search for SharePoint

            Miles Kehoe
            Author of: Professional Microsoft Search
            Miles.kehoe@ideaeng.com
            www.enterprisesearchblog.com
            @miles_kehoe
            mileskehoe




                                  ideaeng.com          SurfRay.com
Agenda
• Introductions

• When FS4SP makes sense

• What is the FS4SP indexing pipeline?

• Why is it important to you?

• How do you use it?

• Wrap Up
About Me
• Founder of New Idea Engineering Inc.

• Work with enterprise search since 1989

• Co-Author Professional Microsoft Search/Wrox

• Author several blogs:

   -   Enterprisesearchblog.com

   -   SearchComponentsOnline.com

• Search nerd
When to use FS4SP

Large datasets

 •   SP Search indexes 100M documents

 •   FS4SP virtually unlimited (650M in tests)

 •   Rows and Columns concept

Need to fine-tune index & search

 •   Pipeline

 •   Need custom relevance profiles

 •   Need to fine-tune queries for relevance
What is the FS4SP indexing
               pipeline?
Standard sequence of ‘stages’ from crawl to index
  •   Format conversion & language detection
  •   Lemmatization / Stemming
  •   Entity extraction
  •   Map crawled properties to managed properties
Unique to FAST: the ability to insert custom processing
  •   ‘Must’ be just before mapper
  •   C# supported; but any code using STDIN/STDOUT ok
  •   Time critical!
A great way to fix up messy data!
Pipeline Architecture
                                  Index Flow




                                                                      Content                                                      Indexer     Query
                Crawler
                                                                     Processor                                                               Processor




Data Sources                                                                                                                                             User Queries

                                              FS4SP Pipeline
                                                                                                            …
                                                                                       Entity Extraction
                                                                       Lemmatization
                                                Language Detection
                          Format Conversion




                                                                                                           Custom Extensibility

                                                                                                                                  Mapper
Why is the pipeline
             important to you?
Sometimes content IS messy:
 • URLs with abbreviations
 • Additional metadata is in external sources
 • Geo-tag documents

Diagnose problems in the indexing process:
 • Identify bad or missing metadata
Examples where the pipeline
                  can save you
Cryptic URLs
     •   With URLs like www.myco.com/mkt/prodmgmt/products.aspx
     •   I can add specific metadata to the document
           ‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’)

Adding valuable metadata:
•   When I find a user name in a document I can lookup and return phone number and email
•   When I find a city name I can geo-tag with latitude and longitude

Debugging the indexing process
•   When things are not as they seem I can diagnose problems in the indexing process
How do you use the pipeline?
Pipeline configuration files in FASTSearchetc
    • PipelineConfig.xml
    • PipelineExtensibility.xml
For each Document Processor node:
    • Create an entry for a new ‘processor’
   • Add your new processor name to the <pipelines> node
   • Restart the ‘FAST processor server’ from CMD: psctrl reset
   • Submit a single known test document
   • Check your results
Config Files
Adding a Processor Stage
On each FAST document processor node:
• Edit %FASTSEARCH%etcpipelineconfig.xml
    <processor name=“Spy1" type="general" hidden="0">
             <load module="processors.Spy" class="Spy"/>
             <config>
             <param name="SpyDumpFile" value="var/log/spy.txt" type="str"/>
             <param name="FileStringCutOffLen" value="32768" type="int"/>
             </config>
             <inputs>
             </inputs>
     </processor>
• In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14
   pipeline)
     <processor name=“Spy1” />
• Reset (each) document processor node:
     psctrl reset
FS4SP Pipeline Extensibility
How do you create a
                        custom stage?
Edit file %FASTSEARCH%etcpipelineconfig as above
Edit file %FASTSearch%etcPipelineExtensibility.xml

<PipelineExtensibility>
      <Run command=“YourCode.EXE %(input)s %(output)s">
      <Input>
        <CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" />
      </Input>
      <Output>
         <CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/>
         <CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/>
      </Output>
      </Run>
 </PipelineExtensibility>
Restart content servers from command Line prompt
    psctrl reset
Pipeline is
            performance-critical
Pipeline runs in ‘sandbox’ environment
 •   NOT the same type of ‘sandbox’ in O365
 •   File I/O only allowed in C:users<fast service user>AppDataLocalLow
 •   Maximum of 10 seconds to live
 •   Permissions restricted regardless of FAST Service user permissions
 •   Each Document Processor (DP) is an individual instance
 •   Only one item passes thru a DP at a time
 •   If each document takes 1 second then10 DPs can process at best 10 docs/sec
 •   Consider 1 sec for each of 100K docs ~ 3 hours!
Pipeline Hints
MS only supports:
 • Single custom stage (in PipelineConfig.xml)
 • .NET languages (C#, etc)
But:
 • A custom stage can appear in multiple places in PipelineConfig.xml even
   w/ different parameters
 • Theoretically any executable that handles STDIN/STDOUT will do
 • VC#/VC++/VBScript/CMD files seem to work
 • Web services calls are supported
Using web services in Sandbox
                         Web Service



                           Stage


                           Stage

                   XML
                           Stage

                   XML
                           Stage




                         XML Config
Ontolica FAST Management
Ontolica Fast Management provides clear and easy to use configuration directly from
within the SharePoint admin GUI. Forget XML configuration files, manual file
deployments, and tricky PowerShell configuration with easy management consoles.

Key Features:

•    Backup, Manage, & Deploy Configurations
•    Manage FAST Relevance Profiles
•    Upload & Manage Pipeline Extensions
•    Create & Manage JDBC Connections
•    FAST Webcrawler Configuration
•    Manage FAST Server Processes from Central
     Admin
Additional Resources
• This slide deck live at http://slidesha.re/sCGAaP

• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/

• Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/

• Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/

• ESW Blog - http://www.enterprisesearchwiki.com/wp/

• TechNet/MSDN/Microsoft

• And of course: SurfRay.com (Robert Piddocke & Josh Noble)
Q/A & Contact Details
 Miles Kehoe
 Author of: Professional Microsoft Search
 Miles.kehoe@ideaeng.com
 www.enterprisesearchblog.com
 @miles_kehoe
 mileskehoe


 Robert Piddocke
 Author: Pro SharePoint 2010 Search
 rcp@surfray.com
 @rpiddocke
 R Piddocke
                                ideaeng.com   SurfRay.com

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Using the Fast Search for SharePoint Pipeline to Improve Search

  • 1. Improving search using the pipeline in FAST Search for SharePoint Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblog.com @miles_kehoe mileskehoe ideaeng.com SurfRay.com
  • 2. Agenda • Introductions • When FS4SP makes sense • What is the FS4SP indexing pipeline? • Why is it important to you? • How do you use it? • Wrap Up
  • 3. About Me • Founder of New Idea Engineering Inc. • Work with enterprise search since 1989 • Co-Author Professional Microsoft Search/Wrox • Author several blogs: - Enterprisesearchblog.com - SearchComponentsOnline.com • Search nerd
  • 4. When to use FS4SP Large datasets • SP Search indexes 100M documents • FS4SP virtually unlimited (650M in tests) • Rows and Columns concept Need to fine-tune index & search • Pipeline • Need custom relevance profiles • Need to fine-tune queries for relevance
  • 5. What is the FS4SP indexing pipeline? Standard sequence of ‘stages’ from crawl to index • Format conversion & language detection • Lemmatization / Stemming • Entity extraction • Map crawled properties to managed properties Unique to FAST: the ability to insert custom processing • ‘Must’ be just before mapper • C# supported; but any code using STDIN/STDOUT ok • Time critical! A great way to fix up messy data!
  • 6. Pipeline Architecture Index Flow Content Indexer Query Crawler Processor Processor Data Sources User Queries FS4SP Pipeline … Entity Extraction Lemmatization Language Detection Format Conversion Custom Extensibility Mapper
  • 7. Why is the pipeline important to you? Sometimes content IS messy: • URLs with abbreviations • Additional metadata is in external sources • Geo-tag documents Diagnose problems in the indexing process: • Identify bad or missing metadata
  • 8. Examples where the pipeline can save you Cryptic URLs • With URLs like www.myco.com/mkt/prodmgmt/products.aspx • I can add specific metadata to the document ‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’) Adding valuable metadata: • When I find a user name in a document I can lookup and return phone number and email • When I find a city name I can geo-tag with latitude and longitude Debugging the indexing process • When things are not as they seem I can diagnose problems in the indexing process
  • 9. How do you use the pipeline? Pipeline configuration files in FASTSearchetc • PipelineConfig.xml • PipelineExtensibility.xml For each Document Processor node: • Create an entry for a new ‘processor’ • Add your new processor name to the <pipelines> node • Restart the ‘FAST processor server’ from CMD: psctrl reset • Submit a single known test document • Check your results
  • 11. Adding a Processor Stage On each FAST document processor node: • Edit %FASTSEARCH%etcpipelineconfig.xml <processor name=“Spy1" type="general" hidden="0"> <load module="processors.Spy" class="Spy"/> <config> <param name="SpyDumpFile" value="var/log/spy.txt" type="str"/> <param name="FileStringCutOffLen" value="32768" type="int"/> </config> <inputs> </inputs> </processor> • In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14 pipeline) <processor name=“Spy1” /> • Reset (each) document processor node: psctrl reset
  • 13. How do you create a custom stage? Edit file %FASTSEARCH%etcpipelineconfig as above Edit file %FASTSearch%etcPipelineExtensibility.xml <PipelineExtensibility> <Run command=“YourCode.EXE %(input)s %(output)s"> <Input> <CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" /> </Input> <Output> <CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/> <CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/> </Output> </Run> </PipelineExtensibility> Restart content servers from command Line prompt psctrl reset
  • 14. Pipeline is performance-critical Pipeline runs in ‘sandbox’ environment • NOT the same type of ‘sandbox’ in O365 • File I/O only allowed in C:users<fast service user>AppDataLocalLow • Maximum of 10 seconds to live • Permissions restricted regardless of FAST Service user permissions • Each Document Processor (DP) is an individual instance • Only one item passes thru a DP at a time • If each document takes 1 second then10 DPs can process at best 10 docs/sec • Consider 1 sec for each of 100K docs ~ 3 hours!
  • 15. Pipeline Hints MS only supports: • Single custom stage (in PipelineConfig.xml) • .NET languages (C#, etc) But: • A custom stage can appear in multiple places in PipelineConfig.xml even w/ different parameters • Theoretically any executable that handles STDIN/STDOUT will do • VC#/VC++/VBScript/CMD files seem to work • Web services calls are supported
  • 16. Using web services in Sandbox Web Service Stage Stage XML Stage XML Stage XML Config
  • 17. Ontolica FAST Management Ontolica Fast Management provides clear and easy to use configuration directly from within the SharePoint admin GUI. Forget XML configuration files, manual file deployments, and tricky PowerShell configuration with easy management consoles. Key Features: • Backup, Manage, & Deploy Configurations • Manage FAST Relevance Profiles • Upload & Manage Pipeline Extensions • Create & Manage JDBC Connections • FAST Webcrawler Configuration • Manage FAST Server Processes from Central Admin
  • 18. Additional Resources • This slide deck live at http://slidesha.re/sCGAaP • SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/ • Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/ • Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/ • ESW Blog - http://www.enterprisesearchwiki.com/wp/ • TechNet/MSDN/Microsoft • And of course: SurfRay.com (Robert Piddocke & Josh Noble)
  • 19. Q/A & Contact Details Miles Kehoe Author of: Professional Microsoft Search Miles.kehoe@ideaeng.com www.enterprisesearchblog.com @miles_kehoe mileskehoe Robert Piddocke Author: Pro SharePoint 2010 Search rcp@surfray.com @rpiddocke R Piddocke ideaeng.com SurfRay.com

Hinweis der Redaktion

  1. By default two pipelines defined – Attachments and Office14
  2. http://fs4sp.blogspot.com/2011/05/manipulating-crawled-properties-in-fast.html