SlideShare ist ein Scribd-Unternehmen logo
1 von 25
ALM Search
Sunita Shrivastava
6/20/2014
Microsoft Confidential 1
ALM Search
 Start with Code Search but eventually support search for other artefacts
 Agenda
 Discuss the current architecture and concerns
 Share the investigations
 Share the learning
 Get feedback on open design issues
Microsoft Confidential 2
Indexing Engine Choices
 BING and Elastic Search
 Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations
 Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas
 Our Evaluation shared at :
https://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%2
0Eval%20Summary.pptx?web=1
 ES Observations so far in context of Code Search
 Schema-less
 Multiple artefacts can be stored in the same index
 Can deal with change in data schema of the artefact
 Main Value Add of ES over Lucene
 Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES!
 This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts
 Highly Extensible
 Code Element Search
 Move from Nested Documents to Custom Analyzer
 Highlighting
 ES allows the REST APIs to be extended/added
 We chose a custom query extension mechanism
 Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search
 Feeding ES
 For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important.
Microsoft Confidential 3
High level Architecture
Datacenter A
Search Service
Search Service Front End
Search Service Backend
REST API
Web UX
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/
Feed Pipeline
Datacenter B
Search Service
Search Service Front End
Search Service Backend
REST API
Index Servers
Index Servers
VSO Service
Query Pipeline
Crawl/Parse/
Feed Pipeline
Mapper Data
Mapper Data
Microsoft Confidential 4
Planned Service Architecture
 XSS Scripting and circular dependency problems during build force the Search Client
 This is going to be more and more common as more standalone services come into existence
 Thanks to Patrick/Phecda for this picture
Microsoft Confidential 5
Deployment for MsEng
 Thanks to Sharad Agarwal for this Slide!
 Elastic Search cluster (Indexer)
 3 (Master + Query) Nodes (A2)
 3 Data Nodes (A5)
 Probably 1 Marvel node (A2) – Need data from AppInsight’s team
 Search Service (CPF + Query)
 3 Job Agent Nodes (A2)
 3 App Tier Nodes (A2)
 Config DB (SQL Azure)
 1 Azure Storage account
 Portal UX in TFS
 Both Search Service, ES, Marvel cluster are within a VNET
 Search Service talks to the ES query/ingestion nodes through an ILB
 This helped take care of DNS issues
Microsoft Confidential 6
Logging, Diagnostics and Monitoring
 Logging
 All our code will be instrumented, including the code inside ES. Developers can get these
logs.
 Diagnostics
 Each team, provides diagnostics data, which is higher level data that provides insights
into the usage/activities happening in the context of the component.
 Query Pipeline Telemetry
 Total Number of Queries
 Successful Queries
 Failed Queries
 Slow Queries
 Portal Telemetry
 Total Number Queries
 Queries that don’t result in a click on the facet or result page in the top 20 results
 Queries that result in a click beyond 20 results
 Search Usage per account
Microsoft Confidential 7
Diagnostics, Monitoring (cont)
 Indexing Telemetry
 Storage Used for Temporary Data(Blobs)
 Storage Used for Entity State Data(Tables?)
 Storage Used for Meta Data
 Storage used for Provisioning Data
 Amount of Data (Mbytes) indexed in the last one hour
 Number of commits handled in the last one hour
 Number of pending tasks
 Number of pending pipelines
 Number of pending commits
 Cold Start Summary
 Monitoring
 Use Marvel
Microsoft Confidential 8
Query Pipeline
 Quick, Low Overhead
 Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search
 Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate indices
 Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level
Indexor
Elastic
Search
Query Pipeline
FileHash Mapper(For
Dedup)
REST
Endpoint
Repo Access/Auth
Mgmt
Query BuilderQuery Builder
(Format Checking,
Query Parsing)
Security
Trimmer(only for
tfvc)
Aggregator
Mapper
Highlighter
AddIn
Microsoft Confidential 9
Query Pipeline Component Diagram
 Thanks to Bittu and Neeraj for this diagram!
Search UI
Rest API for Query
Interaction
Query Builder
Search String
& Filters
Search Query
Backend
Custom Highlighter
TFS GIT Repo
Query String Parser
ES Client
Query Monitor
OI
Query Executor
ES Cluster
Repo Access
Management/
Authentication
Search Response
ES Search Results
Custom Query
Custom Analyzer
Microsoft Confidential 10
Security
 Three Options
 Use Remote Security Name Spaces for caching artefact permissions
 GIT +
 WIT -
 Index level permissios
 Mostly Open Model
Microsoft Confidential 11
Indexing Pipeline
 Currently built on VSSF Framework,
 Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities
 E.g. of a Task Creation : Request to perform an indexing pipeline related operation
X(e.g. reindex,start,stop) on some Entity results in creation of a task
 Tasks create pipelines, a task is completed when all pipelines spawned are finished
Indexing Pipeline
Crawl
BEREST
Endpoint
Meta Data Analysis Cold Start
Index Prep
Index Provisioning
Parse Feed
Indexor
Elastic
Search
Ready Index
for Query
Mapper
Update
Crawl Parse Feed
Cold Start
Cleanup
Dedup
Detection(opt)
Microsoft Confidential 12
Indexing Pipeline Component Design
TFS Commit SyncTFS Account SyncRe-indexer
Crawer Abstraction Layer Crawler Extensions
Parser
Parser Extensions
Feeder
ES Wrapper
CPF Arbritrator
ES Map and
Topology
Configurator
Index Monitor
OI
Job
Scheduler
ES Extensions
(Custom Analyzer/
Plugins)
De-dup
Multi-
tenancy
...
Logger/
Telemetry
Repo Content DB
Abstraction Layer
Parser DB
Abstraction Layer
ES Cluster
Data
Data
Microsoft Confidential 13
 Thanks to Tapas for this diagram!
Indexing Pipeline (cont)
 Cold Start
 Crawl Spec :
 For GIT, the ‘default branch’ is enabled for Indexing by default
 Others will need to get whitelisted explicitly
 TFS Repo has many topic and feature branches
 Need Closure on UX experience on this
 For TFVC : TBD
 For Work Items : TBD
Microsoft Confidential 14
Performance Summary
 For up to 5 Million Files, performance of 90% queries remained under 60 msec
 Feeder ran into issues quickly on A2 configurations, because of low memory issues
 By not storing the file content, but only term vectors the performance came down
from a range of ~1.5 msec to 20 msec on A5 configurations.
 Following in Progress
 Multiple Smaller Indexes on the same node
 Queries during Continuous Indexing
 Indexing Performance with Multiple Replicas
 Multi-Index Search
 Detailed analysis available at
 https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedo
c={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis-
1.pptx&action=default
Microsoft Confidential 15
UX
 Requirements :
 Search UI needs to be uncluttered and simple
 User should not lose context of what he was doing
 Experience should be largely similar for searching different artefacts
 Sharepoint has a precedent for multi-artefact search
 Search launches a different page
 Seems like a reasonable model to follow
Microsoft Confidential 16
Indexing Pipeline (Cont)
 Crawler Strategy : Current plan is to use the LibGit2Sharp
 Following methods were compared
 Crawl File by File with GitHttpClient(current implementation)
 Download Zipped trees using GitHttpClient
 Clone a Repo using Git Command line
 LibGit2Sharp
 https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7-
33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default
 Implications
 Entire Git repo is brought down to Azure storage(Blob Store)
 To Dedup or not to Dedup
 TFS repo on mseng has ~35 feature branches, ~300 scope branches
 Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))
 No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time
 Single Document : ~3GB, 50 minutes, 3.7 ms
 Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms
 https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0-
85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default
 Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities
 E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some ‘Entity’ results in creation of a task
 Tasks create pipelines, a task is completed when all pipelines spawned are finished
Microsoft Confidential 17
Indexing Pipeline Scaleout
 We want to host indexes for different artefacts on the same ES cluster
 This will enable search aggregation through ES
 This opens ups several interesting scenarios in future
 Scale-out and Isolation for different pipelines based on Job Infra is not
possible
 To leverage efforts across teams
 Implies that the Crawl/Parse/Feed pipeline should be generalized
 Potentially we might want to think of extensibilities at the query pipeline as well
Microsoft Confidential 18
ALM Search Deployment Topology
Microsoft Confidential 19
AT
Job
ES
Load Balancer
Private Network
InternalLoad
Balancer
ALM Search
Service
AT
Job
TR
Data
Nodes
Search
Data
Nodes
TR
Query/I
ndexing
Nodes
Search
Query/I
ndexing
Nodes
Shared Master
Nodes
Cross Account Search and Public Repositories
 There is a desire to include all public repositories in Search either by default
or as an option to the user
 How will VSO support the notion of a public repository?
 Will there be public accounts ?
Microsoft Confidential 20
On Premise and Cloud Search Federation
Microsoft Confidential 21
 Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth
 http://msdn.microsoft.com/en-us/library/dn155905.aspx
 Three models for federation
 Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well
 Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well
 Both ways : Search is symmetric
 Look out for more details in this space next time!
Code Search
Service
Code
Search
Service
Aggregator
Indexorc
Repository
Indexora
Repository
Indexora
Repository
Indexorb
Cloud
Repository
Indexora
VS IDE
VSO Web UX
Aggregator
VSO Web UX
Futures
 Semantic Search
 OSS Search Requirements
 Extensions to Code Search for Test Cases
Microsoft Confidential 22
Appendix
Microsoft Confidential 23
Perf Testing on 1 Node for upto 4 M Files
Microsoft Confidential 24
Indexing Rate Analysis(Thanks to Perf Crew)
Microsoft Confidential 25
0
50
100
150
200
250
300
350
A7-1N1S A7-1N3S A7-1N5S A6-1N1S A6-1N3S A6-1N5S A5-1N1S A5-1N3S A5-1N5S
DocsIndexed/sec
Files Indexed
Indexing Rate
10K 100K 500K 1M 2M 3M 4M
 Setup
• A5: 6GB allocated to JVM Heap
• A6: 12GB allocated to JVM Heap.
• A7: 20G allocated to JVM Heap.
• Feeder: A4 machine feeding asynchronously.
 Observation
• On A5 indexing rate remains same across shards.
• On A6 & A7 using more than 1 shard improved Index rate. 3 and 5
shards behavior remained same.
• Indexing rate remained linear across post 500K files during whole
indexing period.
• On A5 maximum indexing rate is 160 Docs/sec while minimum is
107 Docs/sec.
• On A6 Maximum indexing rate is 200 Docs/sec while minimum rate
is 125 Docs/sec.
• On A7 indexing rate is 302 Docs/sec while minimum is 120
Docs/sec.
 Conclusion
• Indexing rate remained linear once 500K docs were indexed.
• For onboarding a new repo we can clearly predict/estimate the
maximum time needed to index the repo.

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Databricks
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkDatabricks
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 

Was ist angesagt? (20)

Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache SparkProject Hydrogen: State-of-the-Art Deep Learning on Apache Spark
Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 

Ähnlich wie ALM Search Presentation for the VSS Arch Council

Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementAndreas Schreiber
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsAndreas Schreiber
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real ExperienceIhor Bobak
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Enginefschupp
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiTaswar Bhatti
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305Mark Tabladillo
 
Building nTier Applications with Entity Framework Services (Part 1)
Building nTier Applications with Entity Framework Services (Part 1)Building nTier Applications with Entity Framework Services (Part 1)
Building nTier Applications with Entity Framework Services (Part 1)David McCarter
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationDenodo
 

Ähnlich wie ALM Search Presentation for the VSS Arch Council (20)

Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305Secrets of Enterprise Data Mining 201305
Secrets of Enterprise Data Mining 201305
 
Building nTier Applications with Entity Framework Services (Part 1)
Building nTier Applications with Entity Framework Services (Part 1)Building nTier Applications with Entity Framework Services (Part 1)
Building nTier Applications with Entity Framework Services (Part 1)
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 

Mehr von Sunita Shrivastava

Bing Phone Book Service Arch Spec
Bing Phone Book Service Arch SpecBing Phone Book Service Arch Spec
Bing Phone Book Service Arch SpecSunita Shrivastava
 
Cognito Unified API Specification
Cognito Unified API SpecificationCognito Unified API Specification
Cognito Unified API SpecificationSunita Shrivastava
 
Dev Analytics Aggregate DB Design Analysis
Dev Analytics Aggregate DB Design AnalysisDev Analytics Aggregate DB Design Analysis
Dev Analytics Aggregate DB Design AnalysisSunita Shrivastava
 
Logical Architecture for Protection
Logical Architecture for ProtectionLogical Architecture for Protection
Logical Architecture for ProtectionSunita Shrivastava
 
Index Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationIndex Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationSunita Shrivastava
 

Mehr von Sunita Shrivastava (6)

Bing Phone Book Service Arch Spec
Bing Phone Book Service Arch SpecBing Phone Book Service Arch Spec
Bing Phone Book Service Arch Spec
 
Cognito Unified API Specification
Cognito Unified API SpecificationCognito Unified API Specification
Cognito Unified API Specification
 
Dev Analytics Overview
Dev Analytics OverviewDev Analytics Overview
Dev Analytics Overview
 
Dev Analytics Aggregate DB Design Analysis
Dev Analytics Aggregate DB Design AnalysisDev Analytics Aggregate DB Design Analysis
Dev Analytics Aggregate DB Design Analysis
 
Logical Architecture for Protection
Logical Architecture for ProtectionLogical Architecture for Protection
Logical Architecture for Protection
 
Index Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationIndex Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My Presentation
 

ALM Search Presentation for the VSS Arch Council

  • 2. ALM Search  Start with Code Search but eventually support search for other artefacts  Agenda  Discuss the current architecture and concerns  Share the investigations  Share the learning  Get feedback on open design issues Microsoft Confidential 2
  • 3. Indexing Engine Choices  BING and Elastic Search  Our requirements : Code Element Search, Phrase Search, AND/OR/NOT Search, WildCard Search, Faceting, Highlighting, Paginations  Indexing : Support for Continuous indexing, Performant, Scale out feasibility, Real timeness, Different Schemas  Our Evaluation shared at : https://microsoft.sharepoint.com/teams/EngSys/Documents/Modernizing%20Our%20Engineering%20System%20AOI/Search/TechEval/ElasticSearch/Tech%2 0Eval%20Summary.pptx?web=1  ES Observations so far in context of Code Search  Schema-less  Multiple artefacts can be stored in the same index  Can deal with change in data schema of the artefact  Main Value Add of ES over Lucene  Is Aggregation! If you need aggregation of search across different indexes, aim to use the aggregator of ES!  This means that it is likely that sharing a large ES cluster is the right thing to do for search aggregation across VSO artefacts  Highly Extensible  Code Element Search  Move from Nested Documents to Custom Analyzer  Highlighting  ES allows the REST APIs to be extended/added  We chose a custom query extension mechanism  Azure Search Service, though layered on ES, hid these mechanisms making it intractable for code element search  Feeding ES  For Large Scale Indexing, being able to feed it fast enough is important, so the scalability of the pipeline is important. Microsoft Confidential 3
  • 4. High level Architecture Datacenter A Search Service Search Service Front End Search Service Backend REST API Web UX Index Servers Index Servers VSO Service Query Pipeline Crawl/Parse/ Feed Pipeline Datacenter B Search Service Search Service Front End Search Service Backend REST API Index Servers Index Servers VSO Service Query Pipeline Crawl/Parse/ Feed Pipeline Mapper Data Mapper Data Microsoft Confidential 4
  • 5. Planned Service Architecture  XSS Scripting and circular dependency problems during build force the Search Client  This is going to be more and more common as more standalone services come into existence  Thanks to Patrick/Phecda for this picture Microsoft Confidential 5
  • 6. Deployment for MsEng  Thanks to Sharad Agarwal for this Slide!  Elastic Search cluster (Indexer)  3 (Master + Query) Nodes (A2)  3 Data Nodes (A5)  Probably 1 Marvel node (A2) – Need data from AppInsight’s team  Search Service (CPF + Query)  3 Job Agent Nodes (A2)  3 App Tier Nodes (A2)  Config DB (SQL Azure)  1 Azure Storage account  Portal UX in TFS  Both Search Service, ES, Marvel cluster are within a VNET  Search Service talks to the ES query/ingestion nodes through an ILB  This helped take care of DNS issues Microsoft Confidential 6
  • 7. Logging, Diagnostics and Monitoring  Logging  All our code will be instrumented, including the code inside ES. Developers can get these logs.  Diagnostics  Each team, provides diagnostics data, which is higher level data that provides insights into the usage/activities happening in the context of the component.  Query Pipeline Telemetry  Total Number of Queries  Successful Queries  Failed Queries  Slow Queries  Portal Telemetry  Total Number Queries  Queries that don’t result in a click on the facet or result page in the top 20 results  Queries that result in a click beyond 20 results  Search Usage per account Microsoft Confidential 7
  • 8. Diagnostics, Monitoring (cont)  Indexing Telemetry  Storage Used for Temporary Data(Blobs)  Storage Used for Entity State Data(Tables?)  Storage Used for Meta Data  Storage used for Provisioning Data  Amount of Data (Mbytes) indexed in the last one hour  Number of commits handled in the last one hour  Number of pending tasks  Number of pending pipelines  Number of pending commits  Cold Start Summary  Monitoring  Use Marvel Microsoft Confidential 8
  • 9. Query Pipeline  Quick, Low Overhead  Query Builder uses the Arriba Parser, NEST is used to talk to Elastic Search  Mapper(Not Required): The scope of search defines a unique ES Index Alias which refers to the appropriate indices  Security Trimming (Cause of Concern) : For TFVC at file level, For WorkItems at Area Level Indexor Elastic Search Query Pipeline FileHash Mapper(For Dedup) REST Endpoint Repo Access/Auth Mgmt Query BuilderQuery Builder (Format Checking, Query Parsing) Security Trimmer(only for tfvc) Aggregator Mapper Highlighter AddIn Microsoft Confidential 9
  • 10. Query Pipeline Component Diagram  Thanks to Bittu and Neeraj for this diagram! Search UI Rest API for Query Interaction Query Builder Search String & Filters Search Query Backend Custom Highlighter TFS GIT Repo Query String Parser ES Client Query Monitor OI Query Executor ES Cluster Repo Access Management/ Authentication Search Response ES Search Results Custom Query Custom Analyzer Microsoft Confidential 10
  • 11. Security  Three Options  Use Remote Security Name Spaces for caching artefact permissions  GIT +  WIT -  Index level permissios  Mostly Open Model Microsoft Confidential 11
  • 12. Indexing Pipeline  Currently built on VSSF Framework,  Backend REST APIs : Seminal objects are Tasks, Pipelines, Entities  E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some Entity results in creation of a task  Tasks create pipelines, a task is completed when all pipelines spawned are finished Indexing Pipeline Crawl BEREST Endpoint Meta Data Analysis Cold Start Index Prep Index Provisioning Parse Feed Indexor Elastic Search Ready Index for Query Mapper Update Crawl Parse Feed Cold Start Cleanup Dedup Detection(opt) Microsoft Confidential 12
  • 13. Indexing Pipeline Component Design TFS Commit SyncTFS Account SyncRe-indexer Crawer Abstraction Layer Crawler Extensions Parser Parser Extensions Feeder ES Wrapper CPF Arbritrator ES Map and Topology Configurator Index Monitor OI Job Scheduler ES Extensions (Custom Analyzer/ Plugins) De-dup Multi- tenancy ... Logger/ Telemetry Repo Content DB Abstraction Layer Parser DB Abstraction Layer ES Cluster Data Data Microsoft Confidential 13  Thanks to Tapas for this diagram!
  • 14. Indexing Pipeline (cont)  Cold Start  Crawl Spec :  For GIT, the ‘default branch’ is enabled for Indexing by default  Others will need to get whitelisted explicitly  TFS Repo has many topic and feature branches  Need Closure on UX experience on this  For TFVC : TBD  For Work Items : TBD Microsoft Confidential 14
  • 15. Performance Summary  For up to 5 Million Files, performance of 90% queries remained under 60 msec  Feeder ran into issues quickly on A2 configurations, because of low memory issues  By not storing the file content, but only term vectors the performance came down from a range of ~1.5 msec to 20 msec on A5 configurations.  Following in Progress  Multiple Smaller Indexes on the same node  Queries during Continuous Indexing  Indexing Performance with Multiple Replicas  Multi-Index Search  Detailed analysis available at  https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedo c={83CFFEAA-1C78-46FA-BFE7-9D3E36DCA3CA}&file=Sprint66_PerftAnalysis- 1.pptx&action=default Microsoft Confidential 15
  • 16. UX  Requirements :  Search UI needs to be uncluttered and simple  User should not lose context of what he was doing  Experience should be largely similar for searching different artefacts  Sharepoint has a precedent for multi-artefact search  Search launches a different page  Seems like a reasonable model to follow Microsoft Confidential 16
  • 17. Indexing Pipeline (Cont)  Crawler Strategy : Current plan is to use the LibGit2Sharp  Following methods were compared  Crawl File by File with GitHttpClient(current implementation)  Download Zipped trees using GitHttpClient  Clone a Repo using Git Command line  LibGit2Sharp  https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={43E4F3C7-D54E-432E-BDF7- 33F96A912E58}&file=Git%20Repos%20Crawl%20Option%20Comparison.docx&action=default  Implications  Entire Git repo is brought down to Azure storage(Blob Store)  To Dedup or not to Dedup  TFS repo on mseng has ~35 feature branches, ~300 scope branches  Results (10 Million Files, .4 M Unique Files, Duplication ratio 1:25 (what is seen in Windows SD depots/branches))  No Deduplication : 60GB index size, 19 hours indexing time, 9ms avg query time  Single Document : ~3GB, 50 minutes, 3.7 ms  Parent Child Mappings : 11 GB, ~1 hour 50 Min, 122 ms  https://microsoft.sharepoint.com/teams/EngSys/_layouts/15/WopiFrame.aspx?sourcedoc={A19453EF-9EB8-447C-A2E0- 85C245D4CA79}&file=Summary%20of%20Deduplication%20Effort.docx&action=default  Backend APIs : For diagnostics/dealing with corruption etc. Seminal objects are Tasks, Pipelines, Entities  E.g. of a Task Creation : Request to perform an indexing pipeline related operation X(e.g. reindex,start,stop) on some ‘Entity’ results in creation of a task  Tasks create pipelines, a task is completed when all pipelines spawned are finished Microsoft Confidential 17
  • 18. Indexing Pipeline Scaleout  We want to host indexes for different artefacts on the same ES cluster  This will enable search aggregation through ES  This opens ups several interesting scenarios in future  Scale-out and Isolation for different pipelines based on Job Infra is not possible  To leverage efforts across teams  Implies that the Crawl/Parse/Feed pipeline should be generalized  Potentially we might want to think of extensibilities at the query pipeline as well Microsoft Confidential 18
  • 19. ALM Search Deployment Topology Microsoft Confidential 19 AT Job ES Load Balancer Private Network InternalLoad Balancer ALM Search Service AT Job TR Data Nodes Search Data Nodes TR Query/I ndexing Nodes Search Query/I ndexing Nodes Shared Master Nodes
  • 20. Cross Account Search and Public Repositories  There is a desire to include all public repositories in Search either by default or as an option to the user  How will VSO support the notion of a public repository?  Will there be public accounts ? Microsoft Confidential 20
  • 21. On Premise and Cloud Search Federation Microsoft Confidential 21  Sharepoint and Office 365 have a precedent, supports 3 models based on Oauth  http://msdn.microsoft.com/en-us/library/dn155905.aspx  Three models for federation  Outbound : Searching on the portal for the on-premise service endpoint, will return results from cloud as well  Inbound : Searching on the portal for the cloud service, returns results from the on-premise TFS indexes as well  Both ways : Search is symmetric  Look out for more details in this space next time! Code Search Service Code Search Service Aggregator Indexorc Repository Indexora Repository Indexora Repository Indexorb Cloud Repository Indexora VS IDE VSO Web UX Aggregator VSO Web UX
  • 22. Futures  Semantic Search  OSS Search Requirements  Extensions to Code Search for Test Cases Microsoft Confidential 22
  • 24. Perf Testing on 1 Node for upto 4 M Files Microsoft Confidential 24
  • 25. Indexing Rate Analysis(Thanks to Perf Crew) Microsoft Confidential 25 0 50 100 150 200 250 300 350 A7-1N1S A7-1N3S A7-1N5S A6-1N1S A6-1N3S A6-1N5S A5-1N1S A5-1N3S A5-1N5S DocsIndexed/sec Files Indexed Indexing Rate 10K 100K 500K 1M 2M 3M 4M  Setup • A5: 6GB allocated to JVM Heap • A6: 12GB allocated to JVM Heap. • A7: 20G allocated to JVM Heap. • Feeder: A4 machine feeding asynchronously.  Observation • On A5 indexing rate remains same across shards. • On A6 & A7 using more than 1 shard improved Index rate. 3 and 5 shards behavior remained same. • Indexing rate remained linear across post 500K files during whole indexing period. • On A5 maximum indexing rate is 160 Docs/sec while minimum is 107 Docs/sec. • On A6 Maximum indexing rate is 200 Docs/sec while minimum rate is 125 Docs/sec. • On A7 indexing rate is 302 Docs/sec while minimum is 120 Docs/sec.  Conclusion • Indexing rate remained linear once 500K docs were indexed. • For onboarding a new repo we can clearly predict/estimate the maximum time needed to index the repo.