FAST Search Server for SharePoint 2010 provides more customization options than Google Search Appliance for implementing metadata-driven enterprise search scenarios in a SharePoint environment. Key advantages of FAST include its ability to index content from various sources, support early authorization for security trimming, customize ranking algorithms and sorting, and leverage metadata through managed properties for improved search and refinement. Overall, FAST is more extensible than GSA and better suited for utilizing metadata to ensure 100% of relevant search results.
Concept Searching ConceptClassifier For SharePoint
Â
Search Engine Face-Off: Keyword vs Metadata Search Costs and Effectiveness
1. Search Engine Face-Off
Keyword Search versus Metadata Search
Don Miller, VP of Business Development Val Orekhov, Chief Architect
(408) 828-3400 (240) 450-2166 x103
donm@conceptsearching.com val@portalsolutions.net
2. Agenda
ď Introductions
ď Concept Searching:
ď What is Metadata
ď Keyword vs.. Metadata Search
ď Keyword vs.. Metadata Costs
ď Google vs.. SharePoint vs.. FAST
ď Portal Solutions:
ď Enterprise Search â Google vs. FAST in SharePoint 2010
ď Indexing Options
ď Approach to Security Trimming
ď Ranking Algorithms & Sorting Options
ď Metadata & Search Refinements
ď Concept Searching - How Do I apply metadata:
ď Microsoftâs approach to applying metadata
ď How to automate the Microsoft approach with conceptClassifier for
SharePoint 2010
ď Demo
3. Concept Searching, Inc.
ďCompany founded in 2002
ď Product launched in 2003
ď Focus on management of structured and unstructured information
ď Technology
ď Automatic concept identification, content tagging, auto-
classification, taxonomy management
ď Only statistical vendor that can extract conceptual metadata
ď 2009 and 2010 â100 Companies that Matter in KMâ (KM World
Magazine)
ď KMWorld âTrend Setting Productâ of 2009
and 2010
ď Locations: US, UK, & South Africa
ďClient base: Fortune 500/1000 organizations
ď Managed Partner under Microsoft global ISV Program - âgo to partnerâ
for Microsoft for auto-classification and taxonomy management
ď Microsoft Enterprise Search ISV , FAST Partner
ď Product Suite: conceptSearch, conceptTaxonomyManager,
conceptClassifier, conceptClassifier for SharePoint,
contentTypeUpdater for SharePoint
4. What is metadata
⢠Metadata is a means to apply structure to unstructured or structured content or
information. Metadata describes what the document is about.
⢠Metadata makes it easier to find information.
⢠There are usually multiple metadata terms per item or document.
⢠Metadata can also be used for rights management, governance, retention code policies,
sensitive information removal and of course improved findability.
5. What Is Keyword vs. Metadata Costing You?
Problem Pre Migration Search Records Management Data Privacy Protection
â˘60% of stored â˘âItâs not about better â˘67% of data loss in â˘Average cost per
documents are searchâ Records Management is exposed record is $197
obsolete â˘Less than 50% of content due to end user error and ranges from $90-
â˘50% of documents are is correctly indexed, meta â˘It costs and organization $305 per record
duplicates tagged or efficiently $180 per document to â˘70% of breaches are due
â˘Requires resources to searchable recreate it when it is not to a mistake or malicious
identify what â˘85% of relevant tagged correctly and intent by an
should/not be migrated documents are never cannot be found organizationâs own staff
retrieved in search
â˘Eliminate duplicate â˘Eliminate manual tagging â˘Eliminate inconsistent â˘Identify any type of
Solution end user tagging organizationally defined
documents & replace with automatic
â˘Identify privacy data identification of multi- â˘Automatically declare privacy data
exposures word concepts documents of record â˘Combines pattern
â˘Identify and declare â˘Provide guided based on vocabulary and matching with associated
records that were not navigation via the retention codes vocabulary
previously identified taxonomy structure (i.e. â˘Automatically change the â˘Automatic Content Type
â˘Identify high value concepts) Content Type and route updating enabling
content â˘Go beyond dynamic to the Records workflows and rights
â˘Migrating required clustering with Management repository management
content to a structure conceptual clustering
based on the taxonomies
Benefit â˘Reduces migration â˘Taxonomy navigation â˘Savings of $4.00 - $7.04 â˘Average cost runs from
costs is 36% - 48% faster per record by eliminating $225K to $35M
â˘Ensures â˘Savings 2.5 hours manual tagging
compliance and per user per day â˘Ensures compliance and
protection of reduces potential
content assets litigation exposures
6. USAF Human Performance Clearinghouse
GOAL : Leverage Existing USAF, AFDW, and AFMS License Agreements to
Enable IM, RM, & Privacy & Security Compliance
Requirements
⢠DoDD 8320 (Data Sharing in a Net-Centric DoD)
⢠DoDD 5015 (Records Management) Data Privacy
⢠USAF Privacy Act Program & HIPAA
⢠Freedom of Information Act (FOIA)
Migration
Migration
Records
Management
Search
eDiscovery &
FOIA
Tel: 703.246.9360 | Fax: 240.465.1182
Distribution Statement A: Approved for public release; distribution is unlimited.
Distribution Statement A: Approved for public release; distribution is
311 ABG/PA No. 09-488, 16 Oct 2009 unlimited.
311 ABG/PA No. 09-488, 16 Oct 2009
7. What Type of Search or Information Architecture Do You Need?
Keyword Search = ~66%+ Metadata Search = 100% of
of results (Recall) results (Recall)
⢠Simple ⢠Guided Navigation
⢠No administration ⢠Records Management
⢠Good enough ⢠Sensitive Information
Removal
⢠Collaboration
Recall (information retrieval), a
statistical measure (contrasted with ⢠Improved Precision and
precision), the fraction of (all) relevant
material that are returned by a search
Recall
query ⢠Evolution of Keyword
Precision (information retrieval), the
percentage of documents returned that Search
are relevant
8. Metadata Search vs.. Keyword and Guided Navigation âProposalâ
âSoftware Licenseâ âSLAâ âLicenseeâ âAddendumâ
âLicense Agreementâ âLicenseâ
100% of Results
Results âDocuments of Recordâ Metadata Search
also known
as âRecallâ âProposalsâ âContractâ
66% Key + Synonym Search
âProposalâ
Entity Extraction
33% Keyword Search
20-33% of results
Entity extraction without complex rules
is ineffective. It is just keyword match, Cost (Time, Money and Complex)
which is what keyword search is, which
is 33% effective.
9. Similar Features Against Total Number of Documents Returned
Google SharePoint FAST
Index 500 M + 100 M 500 M +
Key Word â 33% Yes Yes â Good as Yes
of results Google or FAST
Synonyms - Up to Yes Yes Yes
50-66%+ of
results for topic
Ranking Somewhat Somewhat Very Tunable
Algorithm + Best Tunable Tunable
Bets: Does not
improve number
of results only
how presented
10. What Is Missing To Get to 100% of Relevant Results in Every Search?
Metadata Google SharePoint FAST
Auto No â No â Entity extraction,
Classification Missing 33-50% Missing 33-50% which is the same
of results on any of results on any as keyword
particular topic particular topic search 33%
results. Provides
some refinement
capabilities.
Taxonomy No Yes, but not used Same as
Management for auto SharePoint
classification this
release.
11. Miscellaneous Items to Review
Google SharePoint FAST
SharePoint Hard Yes â Easy to use Medium â Initial
Refiners and for standard release, does not
Navigators with search. No leverage Term
counts. counts on results. Store yet. XML â
Powershell based
RECALL
Customization Limited Limited Extendable
12. Summary
⢠Google â Best for no administration, install and walk away. However, keyword
approach usually missing 33%-50% of results on any given topic because of missing
metadata. Not easy to integrate refiners or navigators into SharePoint UI.
⢠SharePoint Search â Cost effective, comes free with SharePoint. Also very easy to
install. Search Algorithm is as good as FAST or Google. Limited extensibility. Easy
integration for refiners and navigators (no counts). However, keyword approach still
missing 50% of results on any topic.
⢠FAST â Extremely customizable, but requires training or professional services to
customize. Most likely Microsoft long term platform for search. Very scalable and
can provide refiner counts. However, keyword approach still missing 33-50% of
results from any given search because of metadata inconsistency.
⢠However, they are all missing a true metadata strategy which is the only way to
ensure 100% of results (Recall).
14. Google Search Appliance 6.8
vs..
FAST Search Server for SharePoint 2010
For metadata-driven search scenarios in a SharePoint environment
Val Orekhov, Chief Architect
Portal Solutions
Email: val@portalsolutions.net
Phone: (240) 450-2166 x 103
www.portalsolutions.net
15. Agenda
⢠Enterprise Search Technologies
⢠Google Search Appliance 6.8 and FAST for SharePoint
⢠Content Indexing Options
⢠Approach to Security Trimming
⢠Ranking Algorithms and Searching Options
⢠Index Schema Management, Metadata & Search Refinements
⢠Conclusions
⢠Q&A
16. Enterprise Search Technologies
⢠Heterogeneous content sources:
⢠HTML, Documents and LOBs records
⢠Located on Portals, File Systems and in Databases
⢠Required Security Trimming:
⢠Integrate with Identity Providers (AD, LDAP, SQL)
⢠Implement authorization decision logic
⢠Able to take advantage of metadata stored with
documents and LOBs
17. Introducing the Contenders
Google Search Appliance (GSA)
⢠Search Appliance, Google.com in a box
⢠Hardware & Software Solution
⢠Pre-packaged functionality ready to work
⢠âBlack boxâ approach to search results
FAST Search Server for SharePoint 2010
⢠Spin off of the earlier FAST ESP
⢠Software-only solution
⢠Allows to customize many aspects of the engine functionality
down to relevancy tuning algorithms
⢠Platform rather than a product
21. Security Trimming
⢠Answers the âWho Am Iâ and âWhat Results Can I Seeâ
questions
⢠Required with most Enterprise Search scenarios
⢠Approaches include Late & Early Authorization/Biding
Authorization Access Rights Pros Cons
Approach (ACLs)
Late Checked at run - Up-to-date presentation - Slow on larger
time against system sections of result
of record sets
Early Information stored - Fast - Duplicates info
in the index at item - Facilitates metadata - Potential for
level clustering outdated results
22. Security Trimming Options Support
GSA FAST SharePoint
2010
Late - âDefaultâ option in - ? - Custom
Authorization many scenarios
- Via Kerberos, SAML
Bridge or Connector
Early - Rel. 6.0 âHigh level - Item-level ACLs for Native support
Authorization Policy ACLs configured Windows and for Item-level
by admins or through a SharePoint security ACLs with
remote API * principals supported Windows and
- Rel. 6.8 â Item-level natively SharePoint
ACLs) ** - Allows to setup multiple security
user property stores and principals
map user principals
* Best applied to enterprises with a manageable number of high level policies, or able to invest into custom ACL sync tools
** SharePoint Connector Rel. 2.6.4 sends SharePoint Site Groups with the feed but the Groups are not expanded property by GSA
25. Result Set Ranking
⢠Fidelity of keyword matches (All Engines)
⢠Proximity
⢠Frequency
⢠Completeness
⢠Hyper Text Matching (GSA only)
⢠Analyzes keyword location on a rendered page and related pages
⢠Hub and Spoke Algorithm (All engines)
⢠Driven by linkages between web pages
⢠Pages receiving or providing most links have higher rankings
⢠GSA â PageRank; FAST â Document authority;
⢠Static rank biasing, document importance
⢠Document, Site, Metadata -based promotion / demotion (All engines)
⢠User-tagged documents receive higher importance (FAST, SharePoint search)
⢠Adaptive ranking
⢠User clicks in search results (FAST, SharePoint search)
⢠Custom Ranking
⢠Build custom ranking models w/ FAST
26. Result Set Sorting
⢠GSA
⢠Date/Time only (Document Modification Date, or a date extracted
from Title, Metadata or Body of a document)
⢠FAST
⢠Any property marked as Sortable
⢠Supported data types: String, Number, Date/Time
27. Comparing FS4SP and GSA
⢠Indexing Options
⢠Approach to Security Trimming
⢠Ranking Algorithms & Sorting Options
⢠Index Schema Management, Metadata & Search
Refinements
28. Index Schema Management
⢠GSA (All-inclusive)
⢠All discovered metadata (Crawled Properties) are stored in the index by default
⢠Metadata from MS Office documents stored in the index results. (GSA Feature
Request ID# 1371024)
⢠All string-type metadata is associated with FTI by default, matches on metadata
controlled through query time (allintext:, allintitle: keyword filters)
⢠Metadata in results limited to 1,500 chars per field (Rel. 6.8; prev. releases â 320
chars)
⢠FAST (Opt-in)
⢠Crawled properties have to be associated with Managed Properties (MPs) to be
stored in the index
⢠MPs represent a level of abstraction from Content Sources
⢠MPs can be configured to be used as:
⢠Stored in the index (Queryable)
⢠Associated with FTI (Searchable)
⢠Sortable
⢠Refiner-enabled
29. Search Refinement with Metadata
Approach Completeness Pros Cons
Run-time Smaller sample of - Smaller index size - Degraded
clustering / much larger set; performance w/
Shallow Top 50-100 query larger samples
refiners results. - No cluster counts
Index-based Entire result set - Fast - Increases index
clustering / stored in the index. - Allows for precise cluster size
Deep refiners counts
30. Search Refinement with Metadata
GSA FAST SharePoint
2010
Run-time - The only option prior to - OTB - OTB
clustering / Rel. 6.8 (Custom)
Shallow refiners
Index-based - âPreviewâ status in Rel. - OTB for MPs marked as - Not available
clustering / 6.8 (OTB) Refinable
Deep refiners - Inverted Index and
Metadata Property Store
combined into a high
performance OLAP cube
31. Conclusions*
⢠SharePoint intranet as a hub + ⢠Heterogeneous content sources
GSA
FAST
document libraries, LOBs; dominated by web pages
⢠Search results served from the ⢠Search UI served by GSA
SharePoint portal ⢠Predominantly Keyword âdriven
⢠Active Directory -tied systems w/ search experience,
content security policies applied ⢠Custom run-time search refiners for
broadly protected content; OTB âDynamic
⢠Fine level of control over index Navigationâ for LOB / public data
schema and document processing ⢠Result biasing via URL patterns,
⢠Custom search results ranking / metadata values
relevancy models ⢠Medium complexity metadata-based
⢠High complexity metadata-based search scenarios
search scenarios
⢠Full & Mini Search-driven
applications
* Usage scenarios best aligned with OTB functionality, minimum possible customizations.
34. In Summary: Enterprise Search Comparison for SharePoint vs. Google vs. FAST
Why Enterprise Search needs Metadata and Taxonomy Management
â Recall â Ensures you bring back 100% of Results
â Enhances Precision â Fastest way to filter to the right results so that you are looking at the
documents that matter the most
â Boosts the relevancy of documents
â Drives Records Management, Sensitive Information Removal, Retention Code Policies
MUST HAVES:
â Heterogeneous content sources:
⢠HTML, Documents and LOBs records
⢠Located on Portals, File Systems and in Databases
â Required Security Trimming:
⢠Integrate with Identity Providers (AD, LDAP, SQL)
⢠Implement authorization decision logic
â Able to take advantage of metadata stored in documents and LOBs
36. Microsoftâs approach to solving the metadata
problem for Records Management, Governance
Policies, Sensitive Information Removal and
Findability:
Content Types, The Term Store
and Enterprise Managed
Metadata Services
37. What is a content type
⢠A Content Types is a means to apply structure to unstructured or structured content with in
SharePoint. Content Types inherit their parent content types.
⢠This is usually a combination of a term or terms from a single or multiple term sets.
⢠Terms are metadata and metadata is information about information.
⢠Terms can also include governance and retention code policies and also can be for the
sole purpose of improved findability
⢠However, it is best to align Content Types with business goals and business use cases.
38. Introducing EMM, The Term Store and Term Store Management Definitions
SharePoint 2010
conceptClassifier for
Enterprise Managed
SharePoint 2010
Metadata Service
SharePoint 2010 Farm
Term Store
Management Subscription Service
Auto Classification Content Type Hub
Content Type Term Store Site Collection
Updating
Records Library
39. The Managed Metadata Service
Managed Metadata Service
Manages Enterprise Content Types via the
Content Type Hub
Manages Term Store
Term Sets (taxonomies) and terms can be
shared across multiple SharePoint site
Enterprise Managed Metadata Service collections
Multiple manage metadata services can be
created
Enables search filtering
30,000 Terms per Term Set Two types of terms:
(1 Taxonomy) Managed terms â pre-defined by an
enterprise administrator and may be
1,000 Term Sets hierarchical. Surfaced in the
"managed metadata" column type
Tested to 1,000,000 Preferred Terms Managed keywords â non-hierarchical
words or phrases that have been
added to SharePoint 2010 items by
users (folksonomy)
40. conceptClassifier for SharePoint is the only native Term Store Management tool for
2010
Term Set
Parent Term Build term sets/taxonomies
Child Term here in SharePoint 2010
EMM. Plan for 30,000
Grand Child Term values
A content type can contain one or many taxonomies based on specific
business user requirement. The values can shown as columns or can
be hidden from users for administrative or governance purposes only.
41. Traditional manual approach is subjective, cumbersome and overwhelming
End user must select
values from multiple
term sets. Up to 30,000
values per term set and
1,000 term sets per
term store. Manual
approach is impractical.
42. conceptClassifier for
SharePoint 2010
An automated solution for applying metadata and
providing term store management to enhance
SharePoint 2010 capabilities for Records
Management, Governance Policies, Rights
Management, Sensitive Information Removal
and Findability.
43. A Manual Metadata Approach Will Fail 95%+ Of The Time
Issue Organizational Impact
Inconsistent Less than 50% of content is correctly indexed, meta-tagged or
efficiently searchable rendering it unusable to the organization (IDC)
Subjective Highly trained Information Specialists will agree on meta tags
between 33% - 50% of the time. (C. Cleverdon)
Cumbersome - Expensive Average cost of manually tagging one item runs from $4 - $7 per
document and does not factor in the accuracy of the meta tags nor
the repercussions from mis-tagged content (Hoovers)
Malicious Compliance End users select first value in list (Perspectives on Metadata, Sarah Courier)
No perceived value for end Whatâs in it for me? End user creates document, does not see value
user for organization nor risks associated with litigation and non
conformance to policies.
What have you seen Metadata will continue to be a problem due to inconsistent human
behavior
The answer to consistent metadata is an automated approach that can extract the
meaning from content eliminating manual metadata generation yet still providing the
ability to manage knowledge assets in alignment with the unique corporate knowledge
infrastructure.
44. conceptClassifier for SharePoint 2010 provides an automated metadata approach
for an immediate ROI and to drives business value
ď Create enterprise automated metadata
framework/model
ď Average return on investment minimum of
38% and runs as high as 600% (IDC) 1. Model and
Validate
ď Apply consistent meaningful metadata to
enterprise content
ď Incorrect meta tags costs an organization 6. Life Cycle 2. Automate
Management Tagging
$2,500 per user per year â in addition
potential costs for non-compliance (IDC)
ď Guide users to relevant content with taxonomy
navigation
ď Savings of $8,965 per year per user based
5. Records
on an $80K salary (Chen & Dumais) Management 3. Findability
ď 100% âRecallâ of content, 35% Faster and PII
access to content âPrecisionâ
4. Business
ď Use automatic conceptual metadata generation Processes
to improve Records Management
ď Eliminate inconsistent end user tagging at
$4-$7 per record (Hoovers)
ď Improve compliance processes, eliminate
potential privacy exposures
45. conceptClassifier provides a native integration into Term Store
Native integration into Term No Service Pack Updates, no custom code.
Store conceptClassifier is a native integration.
No custom property types Every item is synchronized with term store
and is a part of managed metadata service.
All search features work natively as they
should. No custom search property values
which require custom code updates and
additional custom search controls.
conceptClassifier is a native integration.
Why do we work with native Because it is the natural place that you
term store natively should store metadata if you are driving
economies of scale by leveraging Microsoft
stack. That is Microsoftâs road map for
metadata management.
Easy Upgrade If you want to go back to a pure manual
application, there is no code rewrite.
conceptClassifier is a native integration.
You just unplug and you are back to native.
46. Automated Multi Word Term Suggestions for Term Store
ďConcept Searchingâs unique statistical concept identification underpins all technologies.
ďMulti word suggestion is explicitly more valuable than single term suggestion algorithms.
Concept Searching
provides Automatic
Concept Term Extraction
Triple Heart Bypass
Baseball Organ Highway
Three Center Avoid
ď conceptClassifier will generate conceptual metadata by
extracting multi-word terms that identifies âtriple heart bypassâ
as a concept as opposed to single keywords .
ďMetadata can be used by any search engine index or any
application/process that uses metadata.
47. conceptClassifier for SharePoint 2010 drives immediate value for end users for
Search, Records Management and Sensitive Information Removal
conceptClassifier for SharePoint 2010
ď Automatically applies Metadata
ď Automatically Applies Content Types
ď Auto Applies Retention Code Policies
ď Automatically applies Windows Rights
Management Policies
ď Automatic Term Boosting for FAST
ď Pulls hierarchy directly from Term
Store, therefore updates are
immediate and accurate for guided
taxonomy navigation in FAST
48. Enterprise Taxonomy Management and Auto-classification
ď Multi User Distributed Branch and Term
Support for Enterprise
ď Native Term Store Integration for
SharePoint 2010
ď Accelerate building out taxonomies by
75% with automatic Term/Clue
Suggestion
ď Enables the ability for information
architects to build model and validate
ď Automatic Term Boosting for
FAST/Search Platforms
ď Pragmatic Ontology Features for
subject matter experts (You donât need
to be a librarian)
ď Broad to Narrow
ď Preferred Term
ď Non preferred terms
ď Poly hierarchies â Not supported in
Term Store
ď Relations â Not supported in Term
Store
49. conceptClassifier for FAST Search
ď Improves search outcomes by placing
conceptual metadata in the FAST Search ď Provides accurate metadata filters such as numeric
index to increase relevancy of search results range searching and wildcard alphanumeric
matching
ďEnables import of FAST Entities into the ď Removes documents from search results that are
conceptClassifier taxonomy manager to fine- confidential/sensitive through automatic Content
tune them with metadata generated from your Type updating and routing to secure server
own content and nomenclature
ď Automatically tags content with both vocabulary
ď Runs natively as a FAST Pipeline Stage and retention codes and respects SharePoint
eliminating integration and customization security that could prevent access to the document
issues once it has been declared a record
ďEliminates vocabulary normalization issues
across global boundaries through controlled
vocabularies
ď Improves faceted search results as facets are
based on concepts aligned with the taxonomy
ď Provides taxonomy browse capabilities based
on the nodes within the corporate taxonomy(s)
51. Traditional manual approach is subjective, cumbersome and ineffective
End user must select
values from multiple
term sets. Up to 30,000
values per term set and
1,000 term sets per term
store. Manual approach
is impractical.
52. An automated approach ensures accurate Records Management, Sensitive
Information Removal and improved Search/Findability
c
Metadata is automatically applied to content by ConceptClassifier via
TaxonomyManager. contentTypeUpdater can take it a step further and can modify
content type to redirect document/object to a different content type or migrate it to
another site collection or document library. In this example the documents are being
changed from document content type to PII or Records Center Content Type.
53. Term Store Management is provided by Taxonomy Manager and
conceptClassifier
TaxonomyManager is an
intuitive and elegant to Deep capabilities to build out rules classification
tool to manage how and approaches including: standard term, phonetics,
when term sets are metadata, class ID, language, case sensitive,
applied within regular expression and boosting.
SharePoint 2010 and
what new terms to add to
the term store.
54. An automated approach ensures accurate Records Management, Sensitive
Information Removal and improved Search/Findability
The documents with 10 in front of them have had their content types updated.
In this example the documents are being changed from document content type
to PII or Records Center Content Type. They could have also been moved to
a different folder if that was the desired outcome.
55. conceptClassifier for FAST and SharePoint 2010 Search
conceptClassifier for 2010 Product Suite provides intuitive guided navigation for FAST
Multi value select with in a term set is the single fastest approach you can provide for end
users to get access to the correct content. It is just like picking values when you are on
Best Buy or Amazon but it is with your personalized corporate term set vocabulary.
56. Demo â How to automate the
process of applying metadata in a
SharePoint 2010 native term
store environment to improve
Findability and Records
Management
58. Thank You
Don Miller, VP of Business Development Val Orekhov, Chief Architect
(408) 828-3400 (240) 450-2166 x103
donm@conceptsearching.com val@portalsolutions.net