Stakeholder update 4 14 data center outage

Data Center Network Preventative
Maintenance
(a.k.a DC Outage)
June 4, 2016
Project Update
April 14, 2016 Thursday 12:00 PM Lamont Library

• Meeting Purpose and Intended Outcomes
• Network Bypass Plan…
• Netapp Solution Identified…
• New Risk Identified…
• Altered Approach…
• Project Update
• Next steps
• Communication
• Draft June 4 Plan
2
Agenda

Purpose
To provide an update on the current state of the project preparation activities
and review of the communication plan for the scheduled data center network
preventative maintenance (data center outage) on the weekend of June 4.
Intended Outcomes
• Provide an overview on the changes to the plan
• Provide an update on the action items from the last Service Delivery
Subgroup
• Share next steps and actions
• Provide an update on the communication approach
• Share Draft “Day-of” Timeline
Meeting Purpose and Intended Outcomes
3

The goal is to maintain current service levels for data center networking
and to avoid unplanned application outages. The plan was to
accomplish this by temporarily changing network routing and migrating
critical applications to non-impacted storage.
Network Bypass Plan…
4
Current plan is to temporarily initiate a network bypass. This is key to
avoiding application outages during the cross site switching system
upgrade. This diagram represents at a high level view of HUIT DC Networks
and the bypass router.
Router

The concept of migrating critical applications to non-impacted storage
was abandoned as the amount of storage needed to implement the
solution is significantly greater than the available Tier 1 storage.
An alternative option to connect the VM environment directly to the
NetApp, eliminating intermediate routing was identified, designed and
tested.
● Eliminates the risk of VMs disconnecting from standard storage
● Eliminates the need to promote applications to Tier 1 storage for the
maintenance
● Eliminates the need to take down the 60 Oxford Street VM Systems
● Is a permanent solution
NetApp Solution identified…
5

The team identified an additional risk to applications that use Oracle
Real Application Cluster (RAC).
● Oracle RAC cannot tolerate more than 30sec network outage or it will
reboot and may cause some of the applications to need to be restarted
● We have identified 2 production application servers at 60 Oxford that
support several College and Athletics applications. These could
potentially be impacted.
● We are working with the DBA, SOC and Application teams to identify
appropriate preparation activities
A New Risk Was Identified…
6

The goal is to maintain current service levels for data center networking
and to avoid unplanned application outages by temporarily changing by
passing network routing and permanently implementing a network
modification
Altered Approach...
7
Router
removed
permanently
Temporary

Since the last Service Delivery Subgroup on April 7, the following
progress has been made against requested action items:
Project Update
8
Action Description Status Update/Next Steps
1. Investigate rearchitecting
standard storage to server
connectivity alternatives
Investigate feasibility of creating a new
network and moving impacted storage
connected to the VM environment.
Complete The routed network connected
between the DC primary switch
and the Netapp is being replace
with a direct connection.
2. Contact vendors for on-site
support
Contact and schedule key vendors from
EMC, NetApp, Cisco, and VMWare
In
progress
EMC: Complete
NetApp: In Progress
Cisco: In Progress
VMWare:In Progress
3. Schedule Vendor Pre-event
Preparation meeting
Bring all vendors on site to review our
plans and activities for the event
May Planning Underway
4. Contact ATS release
management for impacted
Enterprise Apps
Work with Miroslav for ATS to
coordinate release management for
any impacted enterprise applications
In
progress
Met with ATS team on March
14th. Next step is to share New
Oracle RAC server impact list with
College and Athletics

Since the last Service Delivery Subgroup on March 24th, the following
Project Update
9
5. Decommission FAS UNIX
WebMail
Decommission on premise
environment of WebMail to prevent
outage impact and retire legacy
hardware (UC run project)
In progress Underway
6. Identify mitigation for the
ORACLE RAC impact
Impact identified is to a few servers
that support College and Athletics
April Contact Diane Stronach to
confirm previous ok to take down
associated servers/applications.
7. Contact SEAS to determine
action (SEAS manages
servers)
Discuss impact to Fileshares (p.
parker and spidey) due to bypass
network blip; potential for applications
to freeze and need to be restarted
April Contact SEAS Team to discuss
8. Determine communication
path for ICEMAIL impacted
users
OPEN ITEM: At March 24 SD
meeting the question of local
departments communicating Jun 4
impact was raised
Open Determine

The following activities are the next steps in preparing for June 4
maintenance
Next Steps
10
Action Description Status/Mitigation
1. Finalize scope of application
shutdown
(as required)
1. Create list of impacted, applications on the DC Oracle
RACs and Virtual Guests at Summer St that are at risk
during the by-pass connection and disconnection. Impact
would cause a manually remount effort with is significant.
2. Identify application owners and determine strategy for
Jun 4. (Likely: SEAS, Icemail, College & Athletics, others
TBD)
In progress
(Target: April 22)
2. Review impact to Campus
Services
Understand impact and propose implementation support
needs for building controls (if needed)
In progress
(Target: April)
3. Implement network solution
that avoids 60 OX VM outage
RISK
Implement the connectivity between the Virtual Hosts
(VMs) and the NetApp Storage - eliminating the risk of
dropping VMs during the maintenance
90% Complete
(End of April)
4. Finalize “day of” events
action plan
Create implementation sequence and complete full review
of “day of” events.
April 2016

Communication Plan
11
Date Goal Audience Description of Communication
March 7-11
HUIT Leadership
Comm.
SLT
● HUIT communication of new June date (HUIT Senior Leadership
validates June date and provides feedback)
April 5th
Comm with Leaders
inside and outside
of HUIT to validate
date of maintenance
Leadership Across
University
● Communication with Key Leaders who depend on HUIT services.
Announce change and share communication powerpoint.
● List includes: School CIOs, HUIT Senior Leadership, Practice Managers,
FAS/CA Major Stakeholders, Critical 1 & 2 application technical and
functional owners.
April 18th
First Targeted
Comm.
Owners of: Servers
Databases
Applications
● Initial communication of targeted June date highlighting impact and
details of maintenance
● Reference brown bag communication sessions
April 12 - April 21
Stakeholder Brown
Bag Sessions
System owners / IT
staff
● Brown Bag sessions for stakeholders
● Communicate with NetApp and fileshare technical contacts to determine
availability impact and requirements
April 19 “Day Of” Events
System owners / IT
staff
● Continue to receive feedback from stakeholders and business owners to
fine-tune the “Day Of” events
● Share the “day of” plan
April 28 Go-live Readiness SLT
● SLT final review and readiness validation for update
May 3 - June 3 Weekly Reminders All populations
● Reminders sent to all populations
June 4 All Clear Message
Leadership/ System
owners
● All clear message sent after maintenance is declared finished to business
stakeholders and leadership vetting list. No all clear to full account holder
populations unless clarification needed.
March-April
Engage Support
Services
HUIT Support
Services
● Conduct working session with Support Services to outline coverage and
needs during implementation
This table represents the proposed communication plan to the community to inform
them of the network change and potential impacts to applications and services

Jun 4 Activity Sequencing - Network ByPass & VSS Upgrade
12
When (Estimated) What Who Description
1 Jun 3, 8:00 PM - 11:00 AM Prep Activities
SOC & Self
Service IT
Partners
1. Shut down any Dev/Test/Other Services as Determined by
Infrastructure Investigation and/or IT Client Request.
2. Self Managed IT Partners are welcome to shut down services
in preparation for Network activity
3. Reschedule any impacted system back up
2 May 23 - Jun 3 Prep Activities Network
1. Prepare hardware, cable connections
2. Verify By-Pass Router Readiness
3. Hand Off to System’s Teams
3 Jun 3, 11:00 - 11:59PM
Take Down:
1. C & A Apps on
Oracle RAC
2. IceMAIL
3.SEAS Apps
1.DBAs, SOC
2. SOC -
Windows
3. SEAS IT
1. College and Athletics have several servers on the Oracle
RAC that could be affected by the network connection to the
BY PASS ROUTER. It has been agreed to that these servers
will be down for the duration of the network activity
2. IceMAIL services high profile users. It will be brought down
prior to the bypass and back up once the network is on bypass
3. SEAS IT will make the call and initiate action for their services
on File Shares: Spidy / Pparker
4. Hand Off to Network Team
4 Jun 4, 00:00 - 01:00AM
Enable Network
By Pass
Network
1. Enables network traffic to circumvent the cross data center
switching mechanism
2. Network Team connects the bypass and disconnects the VSS.
3. Network verifies traffic is transiting the bypass
4. Notify Server Teams when complete
5 Jun 4, 01:05 -03:00AM
1.Upgrade VSS
2. Bring up
services from #3
1.Network
2.SOC
1. Install and configure the VSS
2. Bring up services taken down prior to bypass implementation.
a. ICEmail
b. &TBD with services owners.
6 Jun 4, 03:10 - 4:10A Repeat #3 Same as #3 ● Same as #3
7 Jun 4, 4:15A - 4:45A
Reverse Network
ByPass
Network

Jun 4 Activity Sequencing -
60 Oxford DC Primary Switch Patch
13
When (Estimated) What Who Description
A Jun 4, 06:10 - 08:00 AM
Implement
Security
Software
Upgrade
Network
1. Maintain network connectivity
2. Upgrade back up devices, test, switch traffic
3. Upgrade primary device, test, switch traffic
4. Notify SOC
B Jun 4, 06:10 -08:00 AM
Test Apps that
were down
DBA, Self
Served
Customers
● Confirm applications are functioning normally
C Jun 4, 07:00 - 09:00 AM
Verify
Applications
DBAs,
Application
Teams
● Confirm applications are functioning normally
D Jun 4, 09:00 - 10:00 AM
Stabilization
Period
Network
● Communicate Activity Status
● Staff on site for one hour after last applicaition verification
● Continue to monitor maintenance channels
E Jun 4, 10:00 AM
Maintenance
Window Closes
Team
● Communicate “All Clear”
● Announce Window Closed
● Release Infrastructure & Network Staff
Communication Manager: Noah Selsby
Project Manager: Vicki Hall

Since the last Service Delivery Subgroup on March 10th, the following
Project Update
16
1. Schedule Weekly Executive
Check-ins
Schedule a weekly project check in with
the CTO and key project team members
Complete Scheduled weekly at 10:30AM
on Wednesdays
2. Produce and share list of
impacted Executive
leadership and Technical
leadership
Document the contact list for key
stakeholders and share with the Service
Delivery Subgroup
Complete Included as a hand-out
Distribution List built
3. Provide communication
mechanism for “day of”
transition mgmt
Identify and communicate formal
channels for reporting against the plan
during the maintenance window
Complete Plan to open bridges (partner
and technical) and send
emails throughout the
maintenance that align to
updates on the wiki
4. Meet with Support Services
to discuss outage planning
Initial meeting with Support Services to
review application impact and support
needs
Complete Confirm additional HelpDesk
resource needs based on
known application impact

Stakeholder update 4 14 data center outage

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie Stakeholder update 4 14 data center outage

Ähnlich wie Stakeholder update 4 14 data center outage (20)

Mehr von kevin_donovan

Mehr von kevin_donovan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Stakeholder update 4 14 data center outage