- The document provides information about the disaster recovery process for LOUIS, a library consortium, in 3 or fewer sentences:
The document outlines LOUIS' disaster recovery process which involves nightly backups of data to an off-site server, daily synchronization of data to a redundant "hotsite" server, and steps taken to switch production services to the hotsite server including preparing servers, stopping services, and reversing server roles in the event of an outage.
1. KEEP CALM AND BREATHE
DURING DISASTER RECOVERY
Marcy Stevens, Library Consortium Analyst
LOUIS: The Louisiana Library Network
marcy@lsu.edu
Follow along at:
http://tinyurl.com/stevens-cosugi
2.
3. Marcy Stevens
• LOUIS: The Louisiana Library Network
• 33 consortium members use SirsiDynix Symphony
• SDS runs on AIX server
• Production, Training, BETA and Hotsite servers
• 33 separate instances of Symphony on each server, not 33 servers
• Apache is running a single instance with virtual hosts defined for
each institution's eLibrary interface, EZproxy, and other
administrative needs
• Each member has its own set of databases (MARC records, items,
users, charges, orders, serial control, holds, etc.)
• All share a single set of binaries (Unicorn/Bin) directory; also share
eLibrary delivered pages, eLibrary helps, and XXX
4. Other critical staff
• LOUIS Staff
• System Administrators at each LOUIS site
• UNIX Support
• Louisiana Tech IT staff
• DNS (ipcontrol) support staff
• Network operations (24/7 support)
5. LOUIS System Configuration
• 8 x PowerPC_POWER5 (CPUs)
• 64 GB memory - at maximum
capacity
• Local disk - 2 x 73.4 GB – mirrored
• External disk - 2 x 2.25 TB Xserve
RAID
• 4 x PowerPC_POWER5 (CPUs)
• 32 GB memory - at maximum
capacity
• Local disk - 2 x 73.4 GB – mirrored
• External disk - 2 x 2.25 TB Xserve
RAID
HotsiteProduction
6. LOUIS System Configuration
• 2 x PowerPC_POWER5 (CPUs)
• 8 GB memory
• Local disk - 2 x 140 GB – mirrored
• External disk - 2 x 4.25 TB IBM
FasT RAID
• 4 x PowerPC_POWER5 (CPUs)
• 6 GB memory
• Local disk - 2 x 140 GB – mirrored
• External disk - 2 x 4.25 TB IBM
FasT RAID
BetaTraining
7. Factors that lead to develop an
organized DR
• South Louisiana is prone to Hurricanes during the months
of June through November due to the National Weather
Center's Hurricane season
• LOUIS headquarters are housed at LSU and are close to
the Mississippi River and the possibility of a levee breech
during severe, heavy rains and flooding is present
• Any concern that comes up; mother nature, hardware, or
software related
8. Two successful tests…one on the way!
• August 2009
• April 2012
• June 2014
• During each test we switched to our hotsite server and
used it in production capacity for two days
• Each test we were able to tweak the process and
streamline the steps
• Sites are down minimal time (approx. 4-6 hours total over
three days)
9. Nightly backup procedures:
• Every night Symphony data is backed up to an off-site
server using IBM’s Tivoli Storage Manager (TSM)
• The data is also sync’d every night to our hotsite
Symphony server in North Louisiana
10. Making the switch from production
to hotsite
• If possible contact the appropriate people weeks in
advance
• Night before:
• change our internal config file to $START_SITES=NO
• Morning of the switch
• Make sure all services are stopped
• Make sure backups and rsyncs are complete with no errors
• Make sure sites are indeed halted
• Stop apache
• Prepare the production server to be sure that it comes up as the
hotsite server after the switch
11. Prepare the production server to
come up as the hotsite server
• Create the file sirsi.nostart - This creates a zero-byte file.
The /etc/rc.local file is consulted upon reboot, and that
script is designed to skip the auto_haltrun cycle step when
sirsi.nostart file is present.
• Prepare root and sirsi cron - cp vs. crontab command
• Set maxreports to 0
• Change internal config file to not allow backups to TSM,
not rsync to hotsite, and not joinvg
• Reverse the production and hotsite IP addresses
12. Prepare the hotsite server to
become the production server
• Delete the sirsi.nostart file
• Prepare root and sirsi cron
• Change maxreports to 1
• Set the internal config file to not rsync
• Run an in-house script that blocks the workflows ports
from our sites temporarily
• Start apache
• Cycle force sites
• Shutdown the original production machine
14. Back up log files as a precautionary
measure
• Script runs on active production server via a cronjob
• Copies logs of specified directories to training server in
case of data loss
• /Marcimport/Bibbackup/
• /Marcimport/Bibwork/
• /Logs/Hist/
• /Logs/Report/
15. Prepare the production server to
become the hotsite server again
• Bring hotsite up
• Night before switch back set start sites to no and set
rsync to yes on production
• Morning of the switchback basically reverse steps