ALM Search Presentation for the VSS Arch Council

ALM Search
Sunita Shrivastava
6/20/2014
Microsoft Confidential 1

ALM Search
 Start with Code Search but eventually support search for other artefacts
 Agenda
 Discuss the current architecture and concerns
 Share the investigations
 Share the learning
 Get feedback on open design issues

Indexing Engine Choices
 BING and Elastic Search
 Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations
 Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas
 Our Evaluation shared at :
https://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%2
0Eval%20Summary.pptx?web=1
 ES Observations so far in context of Code Search
 Schema-less
 Multiple artefacts can be stored in the same index
 Can deal with change in data schema of the artefact
 Main Value Add of ES over Lucene
 Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES!
 This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts
 Highly Extensible
 Code Element Search
 Move from Nested Documents to Custom Analyzer
 Highlighting
 ES allows the REST APIs to be extended/added
 We chose a custom query extension mechanism
 Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search
 Feeding ES
 For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important.

High level Architecture
Datacenter A
Search Service
Search Service Front End
Search Service Backend
REST API
Web UX
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/
Feed Pipeline
Datacenter B
Search Service
Search Service Front End
Search Service Backend
REST API
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/
Feed Pipeline
Mapper Data
Mapper Data

Planned Service Architecture
 XSS Scripting and circular dependency problems during build force the Search Client
 This is going to be more and more common as more standalone services come into existence
 Thanks to Patrick/Phecda for this picture

Deployment for MsEng
 Thanks to Sharad Agarwal for this Slide!
 Elastic Search cluster (Indexer)
 3 (Master + Query) Nodes (A2)
 3 Data Nodes (A5)
 Probably 1 Marvel node (A2) – Need data from AppInsight’s team
 Search Service (CPF + Query)
 3 Job Agent Nodes (A2)
 3 App Tier Nodes (A2)
 Config DB (SQL Azure)
 1 Azure Storage account
 Portal UX in TFS
 Both Search Service, ES, Marvel cluster are within a VNET
 Search Service talks to the ES query/ingestion nodes through an ILB
 This helped take care of DNS issues

Logging, Diagnostics and Monitoring
 Logging
 All our code will be instrumented, including the code inside ES. Developers can get these
logs.
 Diagnostics
 Each team, provides diagnostics data, which is higher level data that provides insights
into the usage/activities happening in the context of the component.
 Query Pipeline Telemetry
 Total Number of Queries
 Successful Queries
 Failed Queries
 Slow Queries
 Portal Telemetry
 Total Number Queries
 Queries that don’t result in a click on the facet or result page in the top 20 results
 Queries that result in a click beyond 20 results
 Search Usage per account

Diagnostics, Monitoring (cont)
 Indexing Telemetry
 Storage Used for Temporary Data(Blobs)
 Storage Used for Entity State Data(Tables?)
 Storage Used for Meta Data
 Storage used for Provisioning Data
 Amount of Data (Mbytes) indexed in the last one hour
 Number of commits handled in the last one hour
 Number of pending tasks
 Number of pending pipelines
 Number of pending commits
 Cold Start Summary
 Monitoring
 Use Marvel

Query Pipeline
 Quick, Low Overhead
 Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search
 Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate indices
 Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level
Indexor
Elastic
Search
Query Pipeline
FileHash Mapper(For
Dedup)
REST
Endpoint
Repo Access/Auth
Mgmt
Query BuilderQuery Builder
(Format Checking,
Query Parsing)
Security
Trimmer(only for
tfvc)
Aggregator
Mapper
Highlighter
AddIn

Query Pipeline Component Diagram
 Thanks to Bittu and Neeraj for this diagram!
Search UI
Rest API for Query
Interaction
Query Builder
Search String
& Filters
Search Query
Backend
Custom Highlighter
TFS GIT Repo
Query String Parser
ES Client
Query Monitor
OI
Query Executor
ES Cluster
Repo Access
Management/
Authentication
Search Response
ES Search Results
Custom Query
Custom Analyzer

Security
 Three Options
 Use Remote Security Name Spaces for caching artefact permissions
 GIT +
 WIT -
 Index level permissios
 Mostly Open Model

Indexing Pipeline
 Currently built on VSSF Framework,
 Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities
 E.g. of a Task Creation : Request to perform an indexing pipeline related operation
X(e.g. reindex,start,stop) on some Entity results in creation of a task
 Tasks create pipelines, a task is completed when all pipelines spawned are finished
Indexing Pipeline
Crawl
BEREST
Endpoint
Meta Data Analysis Cold Start
Index Prep
Index Provisioning
Parse Feed
Indexor
Elastic
Search
Ready Index
for Query
Mapper
Update
Crawl Parse Feed
Cold Start
Cleanup
Dedup
Detection(opt)

Indexing Pipeline Component Design
TFS Commit SyncTFS Account SyncRe-indexer
Crawer Abstraction Layer Crawler Extensions
Parser
Parser Extensions
Feeder
ES Wrapper
CPF Arbritrator
ES Map and
Topology
Configurator
Index Monitor
OI
Job
Scheduler
ES Extensions
(Custom Analyzer/
Plugins)
De-dup
Multi-
tenancy
...
Logger/
Telemetry
Repo Content DB
Abstraction Layer
Parser DB
Abstraction Layer
ES Cluster
Data
Data
 Thanks to Tapas for this diagram!

Indexing Pipeline (cont)
 Cold Start
 Crawl Spec :
 For GIT, the ‘default branch’ is enabled for Indexing by default
 Others will need to get whitelisted explicitly
 TFS Repo has many topic and feature branches
 Need Closure on UX experience on this
 For TFVC : TBD
 For Work Items : TBD

Performance Summary
 For up to 5 Million Files, performance of 90% queries remained under 60 msec
 Feeder ran into issues quickly on A2 configurations, because of low memory issues
 By not storing the file content, but only term vectors the performance came down
from a range of ~1.5 msec to 20 msec on A5 configurations.
 Following in Progress
 Multiple Smaller Indexes on the same node
 Queries during Continuous Indexing
 Indexing Performance with Multiple Replicas
 Multi-Index Search
 Detailed analysis available at
 https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedo
c={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis-
1.pptx&action=default

UX
 Requirements :
 Search UI needs to be uncluttered and simple
 User should not lose context of what he was doing
 Experience should be largely similar for searching different artefacts
 Sharepoint has a precedent for multi-artefact search
 Search launches a different page
 Seems like a reasonable model to follow

Indexing Pipeline (Cont)
 Crawler Strategy : Current plan is to use the LibGit2Sharp
 Following methods were compared
 Crawl File by File with GitHttpClient(current implementation)
 Download Zipped trees using GitHttpClient
 Clone a Repo using Git Command line
 LibGit2Sharp
 https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7-
33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default
 Implications
 Entire Git repo is brought down to Azure storage(Blob Store)
 To Dedup or not to Dedup
 TFS repo on mseng has ~35 feature branches, ~300 scope branches
 Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))
 No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time
 Single Document : ~3GB, 50 minutes, 3.7 ms
 Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms
 https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0-
85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default
 Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities
 E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some ‘Entity’ results in creation of a task
 Tasks create pipelines, a task is completed when all pipelines spawned are finished

Indexing Pipeline Scaleout
 We want to host indexes for different artefacts on the same ES cluster
 This will enable search aggregation through ES
 This opens ups several interesting scenarios in future
 Scale-out and Isolation for different pipelines based on Job Infra is not
possible
 To leverage efforts across teams
 Implies that the Crawl/Parse/Feed pipeline should be generalized
 Potentially we might want to think of extensibilities at the query pipeline as well

ALM Search Deployment Topology
AT
Job
ES
Load Balancer
Private Network
InternalLoad
Balancer
ALM Search
Service
AT
Job
TR
Data
Nodes
Search
Data
Nodes
TR
Query/I
ndexing
Nodes
Search
Query/I
ndexing
Nodes
Shared Master
Nodes

Cross Account Search and Public Repositories
 There is a desire to include all public repositories in Search either by default
or as an option to the user
 How will VSO support the notion of a public repository?
 Will there be public accounts ?

On Premise and Cloud Search Federation
 Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth
 http://msdn.microsoft.com/en-us/library/dn155905.aspx
 Three models for federation
 Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well
 Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well
 Both ways : Search is symmetric
 Look out for more details in this space next time!
Code Search
Service
Code
Search
Service
Aggregator
Indexorc
Repository
Indexora
Repository
Indexora
Repository
Indexorb
Cloud
Repository
Indexora
VS IDE
VSO Web UX
Aggregator
VSO Web UX

Futures
 Semantic Search
 OSS Search Requirements
 Extensions to Code Search for Test Cases

Appendix

Perf Testing on 1 Node for upto 4 M Files

Indexing Rate Analysis(Thanks to Perf Crew)
0
50
100
150
200
250
300
350
A7-1N1S A7-1N3S A7-1N5S A6-1N1S A6-1N3S A6-1N5S A5-1N1S A5-1N3S A5-1N5S
DocsIndexed/sec
Files Indexed
Indexing Rate
10K 100K 500K 1M 2M 3M 4M
 Setup
• A5: 6GB allocated to JVM Heap
• A6: 12GB allocated to JVM Heap.
• A7: 20G allocated to JVM Heap.
• Feeder: A4 machine feeding asynchronously.
 Observation
• On A5 indexing rate remains same across shards.
• On A6 & A7 using more than 1 shard improved Index rate. 3 and 5
shards behavior remained same.
• Indexing rate remained linear across post 500K files during whole
indexing period.
• On A5 maximum indexing rate is 160 Docs/sec while minimum is
107 Docs/sec.
• On A6 Maximum indexing rate is 200 Docs/sec while minimum rate
is 125 Docs/sec.
• On A7 indexing rate is 302 Docs/sec while minimum is 120
Docs/sec.
 Conclusion
• Indexing rate remained linear once 500K docs were indexed.
• For onboarding a new repo we can clearly predict/estimate the
maximum time needed to index the repo.

ALM Search Presentation for the VSS Arch Council

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ALM Search Presentation for the VSS Arch Council

Ähnlich wie ALM Search Presentation for the VSS Arch Council (20)

Mehr von Sunita Shrivastava

Mehr von Sunita Shrivastava (6)

ALM Search Presentation for the VSS Arch Council