Presentation from the European SharePoint Conference 2014 in Barcelona. How did we build a solution for indexing 3000 file shares using self service solutions and automated crawl management.
Advancing Engineering with AI through the Next Generation of Strategic Projec...
ESPC14 380 So you think you can crawl? Stretching the Boundaries of SharePoint 2013!
1. So you think you can crawl?
Stretching the Boundaries of SharePoint 2013!
Petter Skodvin-Hvammen
AD-Gruppen, Norway
2. Who am I?
Petter Skodvin-Hvammen
Oseberg ship - Discovered 1904 in Tønsberg, Norway. Buried by Vikings in 834 AD
• Solutions Architect
• SharePoint Consultant
• Search Enthusiast
• Community Lead
@pettersh - psh@adgruppen.no
www.adgruppen.no
3. Enterprise Search
Index thousands
of sources
Automate index
management
Infrastructure
sizing
Challenges and Solutions
Not Included: code/scripts, user experience, relevancy, governancewww.sharepointeurope.com
4. Enterprise Search using SharePoint Server 2013
• 30,000 users
• 85 locations in 30 countries
• 15,000 daily searches
• 100,000,000 documents(?)
• 60 core systems, 2,000 applications
The Mission…
5. What do we index?
100,000,000
documents
3,000
fileshares
500
servers
6. Where is the data?
• Datacenters
• Time zones
• Bandwidth
www.sharepointeurope.com
8. How do we operate it?
• File shares are created, changed, and deleted every
day using a custom self service solution
• File shares are moved between servers every day by
automation rules
• Manage indexing and crawling of each file shares with
minimum manual effort
www.sharepointeurope.com
9. What can SharePoint do?
• Max 50 content sources per service application
– Max 500 with October 2013 CU installed
• Max 100 start addresses per content source
– Max 500 with October 2013 CU installed
• Max 20 concurrent crawls per service application
– Limitation has been removed
http://technet.microsoft.com/en-us/library/cc262787(v=office.15).aspx#Search
10. It’s complicated
• More data than we have space for
• It’s located all over the place
• Everything changes all of the time
• There are limitations in SharePoint
• Someone’s gotta maintain this
• It has to be secure and relevant
www.sharepointeurope.com
11. What did we do?
• Created logical groups of file shares
• Used symbolic linking
www.sharepointeurope.com
fewer
content
sources
file01share01
file02share03
file03share03
file00sharesym01
file00sharesym02
file00sharesym03
file00share
Start address
12. What did we do?
• Grouped file shares based on region
• One content source per region
• Incremental crawls every night
www.sharepointeurope.com
crawling
based on
time zones
13. What did we do?
• Created DNS alias per impact rule in
etc/hosts on crawl servers
www.sharepointeurope.com
reduced
crawler
impact
14. What did we do?
• Granted file share access to the
account included in least groups
• Monitored group memberships
• Grouped file shares by crawl account
• Crawl rules matched folder structure
managed pool
of crawl
accounts
file://.*/spcrwl01/.*
file://.*/spcrwl02/.*
Include
Include
SPspcrwl01
SPspcrwl02
www.sharepointeurope.com
16. How did we manage this?
www.sharepointeurope.com
self service portal for
enabling indexing of
file shares
custom web service
integration in self service portal
custom solution for
granting access to
crawl accounts
custom timer job to get list of file shares
to crawl from self service portal
custom timer job for creating
and removing symbolic links
custom lists for mapping
server to content source, schedule
and impact, shares to crawl accounts
and metadata, UNC to symlink
content enrichment service for
replacing symlinks in paths with actual file paths
17. www.sharepointeurope.com
Title: European SharePoint Conference
Owner: Petter Skodvin-Hvammen
Business Area: Consulting
Classification: Internal
Type: Project
UNC Path: Assigned automatically
Crawl Account: Assigned automatically
CancelSave
Example: Self Service Portal Example: Custom Lists
Title: European SharePoint Conference
Owner: Petter Skodvin-Hvammen
Business Area: Consulting
Classification: Internal
Type: Project
UNC Path: file01share01
Crawl Account: SPspcrawl01
Symlink: defaulteuropedefaultspcrwl01e5dc12a41d
Location: europe (server file01 is located in Oslo DC)
Bandwidth: 5Mbps
19. Capacity testing
Purpose
• Crawling of symbolic links
• Scaling of virtual machines
• Sizing of disk space
• Verify Microsoft’s advises
Approach
• 4 server farm with 2 partitions
• 8 vCPU, 16 GB RAM, 850 GB
• Crawl 10 file shares (3.7M files)
• Replay top 300 queries
• Apache JMeter
www.sharepointeurope.com
20. Capacity testing – findings
• Crawl rate declined 1% per million items indexed
• Query latency increased exponentially from 12 million items
indexed per partition
• Database latency was insignificant during crawling
• Successfully crawled file shares via symbolic directory links
• Disk space usage was significant lower than expected
– Reduced data volume from 850 GB to 450 GB
– 40+ servers => huge cost savings
www.sharepointeurope.com
21. Infrastructure – VM sizing
Dedicated ESX Cluster
• 14 x VM for SharePoint 2013
– 4 physical machines
– 4 x 32 = 128 CPUs
– 4 x 56 = 1024 GB memory
• HA max utiliization = ¾
– 3 x 32 = 96 CPUs
– 3 x 56 = 768 GB memory
• CPU and Memory can be over-
commited
• CPU over-commited 1,34
(1,78 if one physical host fail)
• VM’s must wait for physical CPU
Wait time for 8 cpu = 2 x 4 cpu
• Mitigation:
a) Reduce allocated virtual CPU, or
b) Increase physical CPU
• Memory factor 0,44 (0,59)
• Reserved and locked memory
prevents HA failover
www.sharepointeurope.com
22. Infrastructure – VM tuning
www.sharepointeurope.com
DC Role vCPU Peak Average Calculated Recommended Change
A Web, Query, Admin 8 187,55 37,03 2 4 -4
B Web, Query, Admin 8 621,88 92,69 8 8 0
A Crawl, Analytics, Content, CEWS, Central Admin 8 724,35 210,59 8 8 0
B Crawl, Analytics, Content, CEWS, Symbolic Links 8 724,56 198,44 8 8 0
A Index 0, Content, CEWS 8 486,18 62,55 6 6 -2
B Index 0, Content, CEWS 8 520,63 63,98 6 6 -2
A Index 1, Content, CEWS 8 547,08 69,3 6 6 -2
B Index 1, Content, CEWS 8 546,44 91,74 6 6 -2
A Index 2, Content, CEWS 8 491,38 65,6 6 6 -2
B Index 2, Content, CEWS 8 532,01 77,83 6 6 -2
A Index 3, Content, CEWS 8 540,45 78,72 6 6 -2
B Index 3, Content, CEWS 8 621,88 92,69 8 8 0
A Distributed Cache 4 91,71 5,99 2 2 -2
B Distributed Cache* (added later) - - - - - -
100 78 80 -20
Peak and average CPU usage is calculated over 30 days
23. Summary
1. Indexing thousands of content sources
2. Automation for rapid changing index requirements
3. Sizing the infrastructure for performance and HA
www.sharepointeurope.com