2. ALM Search
Start with Code Search but eventually support search for other artefacts
Agenda
Discuss the current architecture and concerns
Share the investigations
Share the learning
Get feedback on open design issues
Microsoft Confidential 2
3. Indexing Engine Choices
BING and Elastic Search
Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations
Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas
Our Evaluation shared at :
https://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%2
0Eval%20Summary.pptx?web=1
ES Observations so far in context of Code Search
Schema-less
Multiple artefacts can be stored in the same index
Can deal with change in data schema of the artefact
Main Value Add of ES over Lucene
Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES!
This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts
Highly Extensible
Code Element Search
Move from Nested Documents to Custom Analyzer
Highlighting
ES allows the REST APIs to be extended/added
We chose a custom query extension mechanism
Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search
Feeding ES
For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important.
Microsoft Confidential 3
4. High level Architecture
Datacenter A
Search Service
Search Service Front End
Search Service Backend
REST API
Web UX
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/
Feed Pipeline
Datacenter B
Search Service
Search Service Front End
Search Service Backend
REST API
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/
Feed Pipeline
Mapper Data
Mapper Data
Microsoft Confidential 4
5. Planned Service Architecture
XSS Scripting and circular dependency problems during build force the Search Client
This is going to be more and more common as more standalone services come into existence
Thanks to Patrick/Phecda for this picture
Microsoft Confidential 5
6. Deployment for MsEng
Thanks to Sharad Agarwal for this Slide!
Elastic Search cluster (Indexer)
3 (Master + Query) Nodes (A2)
3 Data Nodes (A5)
Probably 1 Marvel node (A2) – Need data from AppInsight’s team
Search Service (CPF + Query)
3 Job Agent Nodes (A2)
3 App Tier Nodes (A2)
Config DB (SQL Azure)
1 Azure Storage account
Portal UX in TFS
Both Search Service, ES, Marvel cluster are within a VNET
Search Service talks to the ES query/ingestion nodes through an ILB
This helped take care of DNS issues
Microsoft Confidential 6
7. Logging, Diagnostics and Monitoring
Logging
All our code will be instrumented, including the code inside ES. Developers can get these
logs.
Diagnostics
Each team, provides diagnostics data, which is higher level data that provides insights
into the usage/activities happening in the context of the component.
Query Pipeline Telemetry
Total Number of Queries
Successful Queries
Failed Queries
Slow Queries
Portal Telemetry
Total Number Queries
Queries that don’t result in a click on the facet or result page in the top 20 results
Queries that result in a click beyond 20 results
Search Usage per account
Microsoft Confidential 7
8. Diagnostics, Monitoring (cont)
Indexing Telemetry
Storage Used for Temporary Data(Blobs)
Storage Used for Entity State Data(Tables?)
Storage Used for Meta Data
Storage used for Provisioning Data
Amount of Data (Mbytes) indexed in the last one hour
Number of commits handled in the last one hour
Number of pending tasks
Number of pending pipelines
Number of pending commits
Cold Start Summary
Monitoring
Use Marvel
Microsoft Confidential 8
9. Query Pipeline
Quick, Low Overhead
Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search
Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate indices
Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level
Indexor
Elastic
Search
Query Pipeline
FileHash Mapper(For
Dedup)
REST
Endpoint
Repo Access/Auth
Mgmt
Query BuilderQuery Builder
(Format Checking,
Query Parsing)
Security
Trimmer(only for
tfvc)
Aggregator
Mapper
Highlighter
AddIn
Microsoft Confidential 9
10. Query Pipeline Component Diagram
Thanks to Bittu and Neeraj for this diagram!
Search UI
Rest API for Query
Interaction
Query Builder
Search String
& Filters
Search Query
Backend
Custom Highlighter
TFS GIT Repo
Query String Parser
ES Client
Query Monitor
OI
Query Executor
ES Cluster
Repo Access
Management/
Authentication
Search Response
ES Search Results
Custom Query
Custom Analyzer
Microsoft Confidential 10
11. Security
Three Options
Use Remote Security Name Spaces for caching artefact permissions
GIT +
WIT -
Index level permissios
Mostly Open Model
Microsoft Confidential 11
12. Indexing Pipeline
Currently built on VSSF Framework,
Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities
E.g. of a Task Creation : Request to perform an indexing pipeline related operation
X(e.g. reindex,start,stop) on some Entity results in creation of a task
Tasks create pipelines, a task is completed when all pipelines spawned are finished
Indexing Pipeline
Crawl
BEREST
Endpoint
Meta Data Analysis Cold Start
Index Prep
Index Provisioning
Parse Feed
Indexor
Elastic
Search
Ready Index
for Query
Mapper
Update
Crawl Parse Feed
Cold Start
Cleanup
Dedup
Detection(opt)
Microsoft Confidential 12
13. Indexing Pipeline Component Design
TFS Commit SyncTFS Account SyncRe-indexer
Crawer Abstraction Layer Crawler Extensions
Parser
Parser Extensions
Feeder
ES Wrapper
CPF Arbritrator
ES Map and
Topology
Configurator
Index Monitor
OI
Job
Scheduler
ES Extensions
(Custom Analyzer/
Plugins)
De-dup
Multi-
tenancy
...
Logger/
Telemetry
Repo Content DB
Abstraction Layer
Parser DB
Abstraction Layer
ES Cluster
Data
Data
Microsoft Confidential 13
Thanks to Tapas for this diagram!
14. Indexing Pipeline (cont)
Cold Start
Crawl Spec :
For GIT, the ‘default branch’ is enabled for Indexing by default
Others will need to get whitelisted explicitly
TFS Repo has many topic and feature branches
Need Closure on UX experience on this
For TFVC : TBD
For Work Items : TBD
Microsoft Confidential 14
15. Performance Summary
For up to 5 Million Files, performance of 90% queries remained under 60 msec
Feeder ran into issues quickly on A2 configurations, because of low memory issues
By not storing the file content, but only term vectors the performance came down
from a range of ~1.5 msec to 20 msec on A5 configurations.
Following in Progress
Multiple Smaller Indexes on the same node
Queries during Continuous Indexing
Indexing Performance with Multiple Replicas
Multi-Index Search
Detailed analysis available at
https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedo
c={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis-
1.pptx&action=default
Microsoft Confidential 15
16. UX
Requirements :
Search UI needs to be uncluttered and simple
User should not lose context of what he was doing
Experience should be largely similar for searching different artefacts
Sharepoint has a precedent for multi-artefact search
Search launches a different page
Seems like a reasonable model to follow
Microsoft Confidential 16
17. Indexing Pipeline (Cont)
Crawler Strategy : Current plan is to use the LibGit2Sharp
Following methods were compared
Crawl File by File with GitHttpClient(current implementation)
Download Zipped trees using GitHttpClient
Clone a Repo using Git Command line
LibGit2Sharp
https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7-
33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default
Implications
Entire Git repo is brought down to Azure storage(Blob Store)
To Dedup or not to Dedup
TFS repo on mseng has ~35 feature branches, ~300 scope branches
Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))
No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time
Single Document : ~3GB, 50 minutes, 3.7 ms
Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms
https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0-
85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default
Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities
E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some ‘Entity’ results in creation of a task
Tasks create pipelines, a task is completed when all pipelines spawned are finished
Microsoft Confidential 17
18. Indexing Pipeline Scaleout
We want to host indexes for different artefacts on the same ES cluster
This will enable search aggregation through ES
This opens ups several interesting scenarios in future
Scale-out and Isolation for different pipelines based on Job Infra is not
possible
To leverage efforts across teams
Implies that the Crawl/Parse/Feed pipeline should be generalized
Potentially we might want to think of extensibilities at the query pipeline as well
Microsoft Confidential 18
19. ALM Search Deployment Topology
Microsoft Confidential 19
AT
Job
ES
Load Balancer
Private Network
InternalLoad
Balancer
ALM Search
Service
AT
Job
TR
Data
Nodes
Search
Data
Nodes
TR
Query/I
ndexing
Nodes
Search
Query/I
ndexing
Nodes
Shared Master
Nodes
20. Cross Account Search and Public Repositories
There is a desire to include all public repositories in Search either by default
or as an option to the user
How will VSO support the notion of a public repository?
Will there be public accounts ?
Microsoft Confidential 20
21. On Premise and Cloud Search Federation
Microsoft Confidential 21
Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth
http://msdn.microsoft.com/en-us/library/dn155905.aspx
Three models for federation
Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well
Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well
Both ways : Search is symmetric
Look out for more details in this space next time!
Code Search
Service
Code
Search
Service
Aggregator
Indexorc
Repository
Indexora
Repository
Indexora
Repository
Indexorb
Cloud
Repository
Indexora
VS IDE
VSO Web UX
Aggregator
VSO Web UX
22. Futures
Semantic Search
OSS Search Requirements
Extensions to Code Search for Test Cases
Microsoft Confidential 22
24. Perf Testing on 1 Node for upto 4 M Files
Microsoft Confidential 24
25. Indexing Rate Analysis(Thanks to Perf Crew)
Microsoft Confidential 25
0
50
100
150
200
250
300
350
A7-1N1S A7-1N3S A7-1N5S A6-1N1S A6-1N3S A6-1N5S A5-1N1S A5-1N3S A5-1N5S
DocsIndexed/sec
Files Indexed
Indexing Rate
10K 100K 500K 1M 2M 3M 4M
Setup
• A5: 6GB allocated to JVM Heap
• A6: 12GB allocated to JVM Heap.
• A7: 20G allocated to JVM Heap.
• Feeder: A4 machine feeding asynchronously.
Observation
• On A5 indexing rate remains same across shards.
• On A6 & A7 using more than 1 shard improved Index rate. 3 and 5
shards behavior remained same.
• Indexing rate remained linear across post 500K files during whole
indexing period.
• On A5 maximum indexing rate is 160 Docs/sec while minimum is
107 Docs/sec.
• On A6 Maximum indexing rate is 200 Docs/sec while minimum rate
is 125 Docs/sec.
• On A7 indexing rate is 302 Docs/sec while minimum is 120
Docs/sec.
Conclusion
• Indexing rate remained linear once 500K docs were indexed.
• For onboarding a new repo we can clearly predict/estimate the
maximum time needed to index the repo.