SlideShare a Scribd company logo
1 of 104
Download to read offline
BP108 Worst Practices…. Back from
                           the depths of despair


                           Bill Buchan / HADSL
                           Paul Mooney / Bluewave Technology




       © 2013 IBM Corporation


Monday, 4 February 13
STAND UP




      2


Monday, 4 February 13
I




      3


Monday, 4 February 13
State your name....




      4


Monday, 4 February 13
Now your real name....




      5


Monday, 4 February 13
Pledge solemnly to a deity, non-deity...or possibly spaghetti




      6


Monday, 4 February 13
To fill out an evaluation for this session




      7


Monday, 4 February 13
And to fill it out in full




      8


Monday, 4 February 13
And to buy a beer for the person to my left




      9


Monday, 4 February 13
Even though I never really liked that person




     10


Monday, 4 February 13
After that incident.. that time




     11


Monday, 4 February 13
But it’s best we don’t talk about it anymore...




     12


Monday, 4 February 13
Because it gets kinda uncomfortable




     13


Monday, 4 February 13
SO SAY WE ALL!




     14


Monday, 4 February 13
Paul Mooney
     § Geek
          –Lotus software since R2
          –Symantec Authorised Consultant
          –Google Certified Deployment Specialist


     § Speaker, Author, Blogger, jogger, biker
          –www.pmooney.net


     § Bluewave Technology
          –26 staff
          –Operate globally




     15


Monday, 4 February 13
Bill Buchan
      § HE’S BACK BABY YEAH!!!


      § Geek
          –cc:Mail
          –Enterprise level domino consultant since 1995
          –Dual PCLP in v3, v4, v5, v6, v7, v8 and v8.5


      § Speaker, Blogger, Biker
          –http://www.billbuchan.com


      § hadsl
          –IBM BP ISV focused on federated identity
           management
          –http://www.hadsl.com




     16


Monday, 4 February 13
Let’s get Legal!!
     ● This  slide presentation may contain the following
       copyrighted, trademarked, and/or restricted terms:
     ● IBM® Lotus® Domino®, IBM® Lotus® Notes®, IBM
       Lotus Symphony®, LotusScript®
     ● Microsoft® Windows®, Microsoft Excel®, Microsoft
       Office®
     ● Linux®, Java®, Adobe® Acrobat®, Adobe Flash®
     ● Your mileage may vary
     ● This is “Technology Light”
         ● Consider it a rest!
     ● Fill out the evaluations
     ● @IF(enjoy;"buy us beer";"buy us beer")
     ● Try to never be a story in this presentation
     ● Today is “destroy all end users” day
     ● No.. really it is


     17


Monday, 4 February 13
What is this session about?

    § Mistakes are made by everyone
          –How do you deal with them?
          –Blame?
          –Ignore?
          –Denial?


    § Large or small
          –Enterprise or SMB


    § Know your environment
          –Especially if you inherit it


    § Prevention beats the hell out of the cure
          –As you will see



     18


Monday, 4 February 13
Agenda

    § For each case study, we shall
          –Look at the errors
          –Diagnose the problem
          –Determine the problem
          –How was it resolved
          –What lessons can be learned
    § We have 10 case studies. All new
          –And 2 of our personal old favourites
    § We cover both infrastructure and development…


    § All true
          –Seriously, you can’t make these up....




     19


Monday, 4 February 13
Case Studies

   Paul                      Bill

   Archiving “issues”        Designer Disaster!
   Follow the white rabbit   Router Routed
   Does it blend?            Where’s John
   I have a cunning plan     BYOD hell
   (Un)Happy New Year        Server moves to Hell




     20


Monday, 4 February 13
Story 1 - Designer Disaster




     21


Monday, 4 February 13
The Story

    § Large multinational Site
          –Over 40k users
          –Over 30+ sites


    § Monday morning, the helpdesk exploded
          –Hundreds and hundreds of calls

             ‘I’m not getting new mail!’


    § Happening across all clusters
    § Happening on mobile devices




     22


Monday, 4 February 13
The Investigation

    § Checked servers
    § Checked mail routing
    § Checked replication


    § Checked target mailfiles
          –ACL - okay
          –Template - okay




          –Checked Database Properties...




     23


Monday, 4 February 13
The Cause

    § The ‘Don’t Maintain Unread Marks’ flag was set




    § But how? This problem affected hundreds of users!
    § We then checked the template.. And found that the flag had been set there
    § Designer task ran each night
    § Mailfile inherits from template - including this flag


     24


Monday, 4 February 13
The Resolution

    § We tracked down the developer who touched the template
          –And shot him


    § We unset the flag on the template
          –Made sure it had replicated around


    § Re-applied the template to the affected user mail databases




     25


Monday, 4 February 13
Lessons Learned

    § Unread marks are stored in a database table
          –Has worked quite well since 6.0.2 / 6.5.1


    § Uncontrolled changes to templates can quickly cause large scale issues


    § Designer task does NOT have to run every night
          –Running it on demand gives you control
          –Paul totally disagrees with this point......


    § Little things can easily cause large scale issues


    § Ensure that all members of all affected teams understand how to prevent this issue
          –Use it as a blunt weapon to ensure change control processes are adhered to



     26


Monday, 4 February 13
Story 2 - Archiving issues...




     27


Monday, 4 February 13
The Story

    § Large multinational site
          –Over 40k users
          –Over 40 countries


    § One region’s mail servers low on disk space
          –4GB left of 1TB data drive


    § Apparently aggressive archiving apparently in place
          –Data moved on schedule
            • Older, large size archive server
          –Mail over 90 days moved using server-server archiving




     28


Monday, 4 February 13
The Investigation

    § Check mail server settings


    § Program documents
          –Compact -a


    § Check archive server
          –Cannot connect
          –Apparently firewall restricts admin client subnet access to archive?


    § Check logs on mail servers
          –They are having issues too


    § No RDP access
    § No ICMP response (apparently firewall)

     29


Monday, 4 February 13
The Cause

    § The archive server was down
          –For 8 weeks


    § Ran out of disk space
          –Attempted restore of entire archive directory accidentally on same server


    § Nobody noticed
          –Nobody can access archive server from admin subnet client IPs anyway




     30


Monday, 4 February 13
The Resolution

    § *very carefully*
    § Disconnect archive server from network
    § Replace directory and key system databases
    § Bring up and check consistency
    § Add to network
    § Test archiving


    § But...there’s more




     31


Monday, 4 February 13
The Resolution

    § VPN’d in on Friday evening
    § No RDP access to box
    § So.. request access




     § Support gave me access
     § By adding my account to GlobalDomainAdmin group




     32


Monday, 4 February 13
Lessons Learned

    § All issues here caused by laziness


    § Check your servers are up, daily
          –Monitor your servers?


    § Have a restoration process for data


    § Don’t hand admin rights out to people “as needed”
          –I don’t care how much you like them!


    § Be “PIRK”Y
          –Purge Interval Replication Control
          –See adminblast deck



     33


Monday, 4 February 13
Story 3 - Router Rooted




     34


Monday, 4 February 13
The Story

    § Large multinational
          –30k + users
          –50+ sites
    § A large mailserver crashed
          –Thousands of users affected
          –Auto-restart enabled - restarted the server
          –Took 40 minutes
    § It crashed again
          –It restarted it again
    § it crashed again
          –Auto-restart decided it had enough
          –Manually Restarted
    § It crashed again
    § NSD’s indicated that Router was crashing


     35


Monday, 4 February 13
The Investigation

    § Console log - no issue
    § Log files - no issue
    § Transaction log - no issue


    § NSD analysis concluded that the router task was crashing
          –Whilst running LotusScript?




          –LotusScript ???


    § We noticed a new agent...




     36


Monday, 4 February 13
The Cause

    § Someone had created a new ‘Before Mail Delivery Agent’ in the mail template
          –Designer task enabled
          –All users got the new agent


    § Was it tested?
          –ummm... Yes?


    § A ‘Before Mail Delivery’ agent is ran by Router when it delivers the mail to the user
       mailbox
          –Very handy hook point for some automated processes
          –Documentation states that this agent has GOT to be quick


    § This agent tried to open a remote database and log the message
          –Thousands of mail messages meant that the router task could not keep up
          –Crashed the server


     37


Monday, 4 February 13
The Resolution

    § We re-educated the developer
          –With a bat
    § Increased change control around the template. Again.


    § Removed the ‘before mail delivery’ agent
    § Refreshed all user templates




     38


Monday, 4 February 13
Lessons Learned

    § Change Control
    § Before Mail Delivery agents have to be fast
          –Try not to open remote databases on each message being delivered
          –Over a 400ms wide area network
          –With 64kb/s bandwidth
    § Testing
          –No. Real Testing. Large Scale Testing.
          –Use ‘Agent Profiling’ to give you an idea of the total time it’ll take to run




     39


Monday, 4 February 13
Story 4 -Follow the white rabbit




     40


Monday, 4 February 13
The Story

    § Global customer
          –5k staff globally
    § Fast moving company
          –Acquisitions / temporary projects
            • Built servers as needed
    § Mail issues
          –Delivery time taking hours
          –Some mail never delivered
            • No NDRs
    § We were asked to investigate




     41


Monday, 4 February 13
The Investigation

    § Ask for copy of names.nsf
    § Check
          –Connection documents
          –Configuration documents
          –Domain documents
          –NNN
    § Noticed different domain entries
          –Adjacent domain docs
          –Non Adjacent domain docs


    § Asked to vpn to site to investigate domain
          –It got interesting/emotional




     42


Monday, 4 February 13
The Cause

                        Domain 2




                               Domain 1


                        Domain 3




     43


Monday, 4 February 13
The Cause                                 Domain
                                                8

                                                   Domain    Domain
                                                     9         11
                                   Domain 7
                                              Domain
                                                10
                        Domain 2

                                   Domain 6
          Domain 1
                                              12   15   16
                        Domain 3   Domain 5   13
                                                        18
                                              14   17
                                                        19   20
                                   Domain 4



     44


Monday, 4 February 13
The Cause

    § At some stage...
          –Someone designed separate domains for projects
            • Separate servers
    § Agents used to add documents to primary nab
    § This became a “standard” without question
    § Nobody knew who did it first
    § Routing hell - some domains linked through 8 hops
          –Some not linked at all




     45


Monday, 4 February 13
The Resolution

    § Designed a primary domain
          –Began a consolidation process
    § Cleaned up routing to HUB/Spoke where possible
          –Some servers could not do this
          –Ended up with four hub domains/servers until consolidation complete
    § Demo’d the directory catalog.....
    § Explained mail routing architecture




     46


Monday, 4 February 13
Lessons Learned

    § If there is a standard, and it is not traceable back to an “owner”
          –Question it?
          –Validate it?
    § Enable the “delayed mail” feature in configuration document on servers




     47


Monday, 4 February 13
Story 5 - ‘Where’s John?’




     48


Monday, 4 February 13
The Story

    § A multinational 30k + user site
          –70+ sites


    § Lots of critical line-of business Notes applications
          –Been running for years


    § Overnight application processing fails
          –No monitoring
          –No-one notices for a few days
          –Ambiguous help desk calls logged


    § More instances fail
          –No-one notices


    § Finally the business explodes
     49


Monday, 4 February 13
The Investigation

    § We picked one application
          –No application logs
          –No way of validating critical processing had been performed
          –History of large numbers of document writes
            • But not recently
          –Checked its agents - they looked fine

          –Checked the server logs to see when they should run
            • Tried to confirm from server console when the agents last ran




    § Checked the username associated with the agent...




     50


Monday, 4 February 13
The Cause

    § One developer, responsible for all these applications, left at the end of his contract
          –He’d been added to the terminations group
          –All the agents he’d signed had failed to run




     51


Monday, 4 February 13
The Resolution

    § Create a ‘Template Signing’ ID for your organisation
          –Have the Administrators keep control of it


    § Have the administrators sign all templates going into production with this ID
          –No exceptions


    § If it fails, its their fault.




     52


Monday, 4 February 13
Lessons Learned

    § Domino Applications run for years
          –I’ve seen ones in production for 10+ years
          –They need to be monitored
             • Scheduled agents have to run!
             • Use DDM ‘agent failed’ monitor - and check the results!




      § Release control isn’t just for the SOX Audit
      –its for life
      –And it’ll save yours




     53


Monday, 4 February 13
Story 6 - Does it blend




     54


Monday, 4 February 13
The Story

    § Mid size site
          –1.5k users in one region
    § Recently upgraded to ND8.x on Citrix
    § Full fat version
    § Ongoing issues with personal and recent contacts
          –Everyone had everyone’s recent contacts
          –Some people have other people’s saved contacts
          –Others had no issues
    § Management going berserk




     55


Monday, 4 February 13
The Investigation

    § Investigate a typical user setup
    § Check location of home directories
          –Majority of users using legacy “shared network drive” for data
    § Purge the recent contacts from personal address books
          –They come back almost instantly
    § Bang head on wall
          –No effect
    § Bang head on desk
          –No effect
    § Bang head against deployment team
          –Some effect




     56


Monday, 4 February 13
The Cause

    § There were two issues
    § Contacts being shared...
          –All users were setup using a default copy of the client databases (NOT templates)
            • names.nsf, bookmark.nsf etc
          –These were placed in network home folder (e.g. h:notesdata)
            • TERRIBLE IDEA
          –Some of the users were blackberry users
            • Legacy setup of blackberry for contact sharing
            • Users’ personal directories being replicated to BES server
            • BES used them as sources for contacts
          –As one user replicated their personal directory with BES server
            • All other replicas (i.e. other personal directories) replicated too




     57


Monday, 4 February 13
The Cause

    § Issue Two
          –Recent contacts appearing everywhere
    § Recent contacts are stored in recent contact view in personal directory
    § That data is ALSO stored..
          –<Notes data directory>workspace.metadata.pluginscom.ibm.notes.dip
          –files called DIP*.SER
    § Citrix deployment was completed incorrectly
          –The plugin directory was being shared by all users
            • Being written to by all users




     58


Monday, 4 February 13
The Resolution

    § Issue 1
          –Remove the personal address books from the BES server
          –Setup mail policy to sync contacts with mail file
          –Remove personal directory property for each user in the BES administrator
            • Will then default to mail file contacts
          –Start project to change replica id for all personal directories


    § Issue 2
          –Promote RTFM on Citrix deployment
          –Fix Citrix deployment




     59


Monday, 4 February 13
Lessons Learned

    § “If it works, leave it alone”
          –Not always the best way
          –e.g BES using replicated personal directories - very old school


    § Citrix is a great tool
          –8.5.3 supports Citrix well
          –But you need to:
            • Understand Citrix
            • Understand the Notes client
            • Read the manual




     60


Monday, 4 February 13
Story 7 - BYOD Hell




     61


Monday, 4 February 13
The Story

    § A single user, with an iShiny device


    § One morning, the phone was dead.


    § She had lost everything
          –Family pictures, contacts, text messages
          –No Backup




     62


Monday, 4 February 13
The Investigation

    § We looked back over the user history


          –She used to be an employee of BigCo
          –Left a number of months ago
          –Had the BigCo MDM profile and mail/PIM data pushed to her iShiny device




    § We looked at the BigCo Mobile Device Management strategy




     63


Monday, 4 February 13
The Cause

    § BigCo had a rather brutal and primitive MDM


    § They assumed control of the users own iShiny device


    § The user couldn’t pull their own data off their own phone
          –Because she wasn’t connected to the enterprise network


    § When the user left, they nuked the device




     64


Monday, 4 February 13
The Resolution

    § Shoot the administrator


    § We advised BigCo that they should invest in a better MDM architecture


    § We also advised them to at least warn their users that their own phones and iPads were
       rendered useless by their MDM architecture




     65


Monday, 4 February 13
Lessons Learned

    § ‘Bring Your Own Device’ means
          –The Users own the device
          –BigCo pushes mail to that device


    § BigCo wants to secure mail on that device


    § But there are better ways than just nuking the phone
          –For example, Traveler allows you just to nuke the Traveler data
          –Other systems create encrypted areas on the device which can be remotely nuked




     66


Monday, 4 February 13
Story 8 - “I have a cunning plan”




     67


Monday, 4 February 13
The Story

    § Small subsidiary of a large corporate company
    § Two Domino mail servers
    § One (Monday) morning
          –All mail files corrupted
          –All documents marked as Rep/Save conflicts
    § No databases outside of mail directories corrupted
    § FTI’s corrupted




     68


Monday, 4 February 13
The Investigation

    § Check the Domino servers
          –Program documents
          –Log files
          –Agents
          –Backup software
          –AV software
    § All good, with exception to corruption errors


    § Retrace logs to last startup
          –Thousands of locking errors


    § Ask a few questions....




     69


Monday, 4 February 13
The Cause

    § The Administrator was asked to make both servers available for mail access (iNotes)
          –Only had 1 public IP address available
          –Mail files were not replicas


    § GENIUS IDEA
          –Directory links!


    § Administrator decided to map a drive from server A to mail directory on Server B
          –And map a drive from Server B to mail directory on Server A


    § Administrator created directory links on each server to the additional mail directory




     70


Monday, 4 February 13
The Cause

    § Last restart
          –Server A had started first, and mapped drive to Server B’s mail directory
          –Server B was trying to access mail files, locking errors occurring
            • Corruption
    § Then....
          –Administrator noticed and did restarts in other order
          –Server B started first, mapped drive to Server A’s mail directory
          –Server A was trying to access mail files, locking errors occurring

          –leading to...




     71


Monday, 4 February 13
The Resolution

    § Stop servers
    § Unmap mappings
    § Delete .dir directory link files
    § Get backup tape
    § Restore
    § Punish Administrator
          –Enthusiastically




     72


Monday, 4 February 13
Lessons Learned

    § Domino is an application/database server
    § Needs ownership of its data
    § File locking is result of it not owning data
          –Always causes issues
          –Backups, AV software


    § Look for the easy solution to the initial problem
          –Saying no?
          –Replicating mail files to central server?
          –Reverse Proxy?


    § Domino doesn't always have to be the solution




     73


Monday, 4 February 13
Story 9 - Server Moves to Hell




     74


Monday, 4 February 13
The Story

    § Massive corporate (we mean that...)


          –Moving server images from one physical machine to another
            • Copy data across WAN - setup identical server


    § Server gets brought up and 15 minutes later
          –Mail routing stopped working
          –Replication stopped working
          –Traveler stopped working
          –Agents stop
          –HTTP stops
          –Console log goes crazy


    § Servers still running



     75


Monday, 4 February 13
The Investigation

    § We opened up the directory and saw...




     76


Monday, 4 February 13
The Cause

    § A server migration had corrupted the transaction logs on a single server
    § They started the server with no server id
    § This transaction log corruption had resulted in the directory design being corrupted to
       look a bit different.
          –




     77


Monday, 4 February 13
The real cause

    § An administrator accidently replaced the domino directory design
          –with the document library
    § It replicated
          –There were no survivors....




    § Most servers kept mail routing for a while
    § Some services - such as traveler - failed
          –The views they relied on were missing

          –




     78


Monday, 4 February 13
The Resolution

    § Replace design on the directory
          –Rebuild all indexes
          –Restart the server



    § SALT ON THE WOUND
          –Even AFTER discovering...
          –Didn’t do server restarts
          –Problems continued




     79


Monday, 4 February 13
Lessons Learned

    § Limit your risk
          –Designated master servers for design changes
    § Accidental design replaces can happen
          –Replication requires access
          –Remove the access!
    § Honesty
          –Really - you WILL get found out...
          –Own up - you will sleep easier



    § Detailed disaster recovery processes




     80


Monday, 4 February 13
Story 10 - (un)Happy New Year!




     81


Monday, 4 February 13
The Story

    § Large site
          –Regional administration
          –6000 users in “my” region
    § First support call of this year
          –2nd January
    § Replication stopped working for application hub server
          –Nothing replicating




     82


Monday, 4 February 13
The Investigation

    § Replication task
          – working
    § Cluster replication
          –working
    § Network connectivity
          –working
    § Console
          –nothing obvious reported
    § Log file
          –Unusual dates listed...
    § Replication history
          –Entry dates for 1st January 2020
    § Spoke to developer



     83


Monday, 4 February 13
The Cause

    § Developer working over holiday season
    § Request from senior executive
          –Automated email to all users to be sent at midnight
          –Wishing them joyous tidings for the new year
    § Developer was enthusiastic
          –wrote an agent
    § Developer couldn’t test
          –Did not have admin rights to test servers OS
    § BUT!
          –He did have RDP access to production servers
            • And nobody was online one night
    § Brought down domino
          –Reset time to Dec 31st 11:58, 2012 and waited
            • It worked
    § Then...
     84
          –Tried every year to end of 2019
Monday, 4 February 13
The Cause

    § Brought server up each time
    § Server replicated
    § Applications updated Replication history
          –Ending in Jan 01, 2020
    § Time reset to 2012
          –Applications wouldn’t replicate




     85


Monday, 4 February 13
The Resolution

    § Clear every replication history
    § Rebuild view indexes
    § Slow repair
          –Will haunt you



    § Educate developer
          –with prejudice




     86


Monday, 4 February 13
Lessons Learned

    § Changing OS dates is bad
          –For any application server


    § Replication relies on replication history
          –Date/Time stamp based marker for last successful push of data




     87


Monday, 4 February 13
11 Hell’s agent!

    § The Story
          –A critical application sits on all servers
            • 3GB Database / 65,000 documents
            • Replicates from three global hub clusters to all spokes hourly
          –All server communication grinds to a halt
          –No Mail routing/replication
          –Application grows to 28GB
            • Masses of replication conflicts
    § The Investigation
          –Check application for design changes
          –Check replication history and schedule
          –Check server tasks
            • Sniff the bandwidth
    § Gotcha!
          –New scheduled agent



     88


Monday, 4 February 13
Hell’s agent!

    § The cause
          –Developer wanted to modify all documents
          –Built an all documents view
          –Wrote an agent to modify a field
          –Agent set as scheduled “every hour”
          –Set agent to run on ….

          –ALL SERVERS

          –Ran on Hub first…
          –Hub replicated with all spokes on 1-hour replication schedule
          –Then ran on all servers
          –Then continued to run and replicate for the weekend
          –4.8 Million documents per hour!




     89


Monday, 4 February 13
Hell’s agent!

    § Lessons Learned
          –Developers must never change design on production systems
            • Even basic agents
          –Have separate development domain/UAT/Production domains
            • Developers should NOT have designer access on UAT/Production domains


    § Domino is very powerful, and WILL do whatever you tell it to do – no matter how stupid..



    § Never leave new code unsupervised




     90


Monday, 4 February 13
12 Oh, is that important?

    § The Story
          –Big site
          –Over 90 servers – 65K users
          –One Friday, all replication and routing stops
          –Starts on HUB, and quickly affects all servers
    § The Investigation
          –Check the source of the error
          –Logs, console, WAN/LAN links
          –Is server performance the problem?
    § Gotcha!
          –Checked Server consoles… and the Admin4.nsf




     91


Monday, 4 February 13
Oh, is that important?




     92


Monday, 4 February 13
Oh, is that important?




     93


Monday, 4 February 13
Oh, is that important?




     94


Monday, 4 February 13
Oh, is that important?




     95


Monday, 4 February 13
Oh, is that important?




 It Replicates …..
     96


Monday, 4 February 13
Oh, is that important?




     97


Monday, 4 February 13
Oh, is that important?

    § The Cause
          –Junior Administrator deleted LocalDomainServers group using Adminp on the HUB server
          –This replicated to all servers
          –All server-server access lost
          –Admins attempted to stop spread by disabling replication of the names.nsf file
            • Forgot admin4.nsf!
    § Resolution
          –Flush out Adminp requests
          –Manually add entries for LocalDomainServers back to all ACLS and directory documents (3
           days to recover)




     98


Monday, 4 February 13
Oh, is that important?

    § Lessons learned
          –Limit Administrator access to the Domino directory
          –Education, Education, Education!
          –Response procedures for Disaster recovery
          –Have a test environment that simulates your live one…




     99


Monday, 4 February 13
Shorties

    § Archive Transaction logging relies on the backup routine clearing the transaction log
           –No backup, no cleanup. And the transaction log fills up
    § Don’t let your platform anti-virus scan the Domino Directory
           –It doesn’t half kill performance
    § Domino requires fast disk subsystems
           –0-2ms is fast. 360ms is slow
    § Don't keep “encrypted” text in clear text
    § Please don’t open the mail template and set the Owner Name...
    § Reader/Author fields are multi-value canonicalised names
           –You can try, but NOTHING ELSE will work
           –We say this every year. And we see it every year
    § Don’t switch on ‘Anonymous Access’ on your server security document..
    § Don’t just have a replication stub as your mail template...


     100


Monday, 4 February 13
Shorties 2

    § Can you define ‘Return on Investment’ for this?




     101


Monday, 4 February 13
Summary

    § Everyone makes mistakes
    § Put systems in place to prevent the obvious ones
    § Its how you deal with them that makes you professional
           –No Blame Culture
           –Admit soon, Admit well…
    § Major system disasters
           –Sometimes cant be prevented
           –Are usually the combination of many small errors


    § Learn from these mistakes!




     102


Monday, 4 February 13
Thank you




    § Paul Mooney (pmooney@pmooney.net)
    § Bluewave
    § pmooney.net / bluewavegroup.eu




    § Bill Buchan      (bill@billbuchan.com)
    § hadsl
    § billbuchan.com / hadsl.com




     103


Monday, 4 February 13
Legal Disclaimer

      © IBM Corporation 2009. All Rights Reserved.
      The information contained in this publication is provided for informational purposes only. While efforts were made to
      verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without
      warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and
      strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out
      of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is
      intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or
      licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
      References in this presentation to IBM products, programs, or services do not imply that they will be available in all
      countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change
      at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a
      commitment to future product or feature availability in any way. Nothing contained in these materials is intended to,
      nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales,
      revenue growth or other results.
      IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and
      Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or
      both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.


      IJava and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries,
      or both.
      Other company, product, or service names may be trademarks or service marks of others.

     104


Monday, 4 February 13

More Related Content

Similar to Connections Lotusphere Worst Practices 2013

Becoming a Productivity Ninja
Becoming a Productivity NinjaBecoming a Productivity Ninja
Becoming a Productivity Ninjaevantravers
 
Raising The Bar
Raising The BarRaising The Bar
Raising The BarjClarity
 
Just What Is This Continuous Delivery Thing, Anyway?
Just What Is This Continuous Delivery Thing, Anyway?Just What Is This Continuous Delivery Thing, Anyway?
Just What Is This Continuous Delivery Thing, Anyway?eshamow
 
Surviving the technical interview
Surviving the technical interviewSurviving the technical interview
Surviving the technical interviewEric Brooke
 
Building the Right Thing
Building the Right ThingBuilding the Right Thing
Building the Right Thingfuglylogic
 
Preventing Drupal Headaches: Establishing Flexible File Paths From The Start
Preventing Drupal Headaches: Establishing Flexible File Paths From The StartPreventing Drupal Headaches: Establishing Flexible File Paths From The Start
Preventing Drupal Headaches: Establishing Flexible File Paths From The StartAcquia
 
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good ServerICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good ServerSerdar Basegmez
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL DevelopersIke Ellis
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorialduleepa
 
Puppet Camp Berlin 2014: Advanced Puppet Design
Puppet Camp Berlin 2014: Advanced Puppet DesignPuppet Camp Berlin 2014: Advanced Puppet Design
Puppet Camp Berlin 2014: Advanced Puppet DesignPuppet
 
Infrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous DeliveryInfrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous DeliveryCarlo Bonamico
 
WordCamp Milwaukee 2012 - Contributing to Open Source
WordCamp Milwaukee 2012 - Contributing to Open SourceWordCamp Milwaukee 2012 - Contributing to Open Source
WordCamp Milwaukee 2012 - Contributing to Open Sourcejclermont
 
BACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the serverBACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the servertmccurry
 
The 5 Minute MySQL DBA
The 5 Minute MySQL DBAThe 5 Minute MySQL DBA
The 5 Minute MySQL DBAIrawan Soetomo
 
Pair PM-ing, An Exploration of an Idea
Pair PM-ing, An Exploration of an IdeaPair PM-ing, An Exploration of an Idea
Pair PM-ing, An Exploration of an IdeaScott Gilbert
 
Interactive Project Management Workshop
Interactive Project Management WorkshopInteractive Project Management Workshop
Interactive Project Management WorkshopShelley Simmons
 
Lotusphere 2008 Worst practices
Lotusphere 2008 Worst practicesLotusphere 2008 Worst practices
Lotusphere 2008 Worst practicesBill Buchan
 

Similar to Connections Lotusphere Worst Practices 2013 (20)

Becoming a Productivity Ninja
Becoming a Productivity NinjaBecoming a Productivity Ninja
Becoming a Productivity Ninja
 
Raising The Bar
Raising The BarRaising The Bar
Raising The Bar
 
Aten ntc-stories
Aten ntc-storiesAten ntc-stories
Aten ntc-stories
 
Just What Is This Continuous Delivery Thing, Anyway?
Just What Is This Continuous Delivery Thing, Anyway?Just What Is This Continuous Delivery Thing, Anyway?
Just What Is This Continuous Delivery Thing, Anyway?
 
Surviving the technical interview
Surviving the technical interviewSurviving the technical interview
Surviving the technical interview
 
Reinventing Yourself
Reinventing YourselfReinventing Yourself
Reinventing Yourself
 
Building the Right Thing
Building the Right ThingBuilding the Right Thing
Building the Right Thing
 
Preventing Drupal Headaches: Establishing Flexible File Paths From The Start
Preventing Drupal Headaches: Establishing Flexible File Paths From The StartPreventing Drupal Headaches: Establishing Flexible File Paths From The Start
Preventing Drupal Headaches: Establishing Flexible File Paths From The Start
 
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good ServerICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
ICONUK 2016: Back From the Dead: How Bad Code Kills a Good Server
 
MDN is easy!
MDN is easy!MDN is easy!
MDN is easy!
 
14 Habits of Great SQL Developers
14 Habits of Great SQL Developers14 Habits of Great SQL Developers
14 Habits of Great SQL Developers
 
Scaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON TutorialScaling a Web Site - OSCON Tutorial
Scaling a Web Site - OSCON Tutorial
 
Puppet Camp Berlin 2014: Advanced Puppet Design
Puppet Camp Berlin 2014: Advanced Puppet DesignPuppet Camp Berlin 2014: Advanced Puppet Design
Puppet Camp Berlin 2014: Advanced Puppet Design
 
Infrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous DeliveryInfrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous Delivery
 
WordCamp Milwaukee 2012 - Contributing to Open Source
WordCamp Milwaukee 2012 - Contributing to Open SourceWordCamp Milwaukee 2012 - Contributing to Open Source
WordCamp Milwaukee 2012 - Contributing to Open Source
 
BACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the serverBACKFiL Finding Files you left on the server
BACKFiL Finding Files you left on the server
 
The 5 Minute MySQL DBA
The 5 Minute MySQL DBAThe 5 Minute MySQL DBA
The 5 Minute MySQL DBA
 
Pair PM-ing, An Exploration of an Idea
Pair PM-ing, An Exploration of an IdeaPair PM-ing, An Exploration of an Idea
Pair PM-ing, An Exploration of an Idea
 
Interactive Project Management Workshop
Interactive Project Management WorkshopInteractive Project Management Workshop
Interactive Project Management Workshop
 
Lotusphere 2008 Worst practices
Lotusphere 2008 Worst practicesLotusphere 2008 Worst practices
Lotusphere 2008 Worst practices
 

More from Bill Buchan

Dummies guide to WISPS
Dummies guide to WISPSDummies guide to WISPS
Dummies guide to WISPSBill Buchan
 
WISP for Dummies
WISP for DummiesWISP for Dummies
WISP for DummiesBill Buchan
 
WISP Worst Practices
WISP Worst PracticesWISP Worst Practices
WISP Worst PracticesBill Buchan
 
Marykirk raft race presentation night 2014
Marykirk raft race presentation night 2014Marykirk raft race presentation night 2014
Marykirk raft race presentation night 2014Bill Buchan
 
Dev buchan best practices
Dev buchan best practicesDev buchan best practices
Dev buchan best practicesBill Buchan
 
Dev buchan leveraging
Dev buchan leveragingDev buchan leveraging
Dev buchan leveragingBill Buchan
 
Dev buchan everything you need to know about agent design
Dev buchan everything you need to know about agent designDev buchan everything you need to know about agent design
Dev buchan everything you need to know about agent designBill Buchan
 
Dev buchan 30 proven tips
Dev buchan 30 proven tipsDev buchan 30 proven tips
Dev buchan 30 proven tipsBill Buchan
 
Entwicker camp2007 calling-the-c-api-from-lotusscript
Entwicker camp2007 calling-the-c-api-from-lotusscriptEntwicker camp2007 calling-the-c-api-from-lotusscript
Entwicker camp2007 calling-the-c-api-from-lotusscriptBill Buchan
 
Entwicker camp2007 blackberry-workshop
Entwicker camp2007 blackberry-workshopEntwicker camp2007 blackberry-workshop
Entwicker camp2007 blackberry-workshopBill Buchan
 
Admin2012 buchan web_services-v101
Admin2012 buchan web_services-v101Admin2012 buchan web_services-v101
Admin2012 buchan web_services-v101Bill Buchan
 
Reporting on your domino environment v1
Reporting on your domino environment v1Reporting on your domino environment v1
Reporting on your domino environment v1Bill Buchan
 
12 Step Guide to Lotuscript
12 Step Guide to Lotuscript12 Step Guide to Lotuscript
12 Step Guide to LotuscriptBill Buchan
 
Everything you ever wanted to know about lotus script
Everything you ever wanted to know about lotus scriptEverything you ever wanted to know about lotus script
Everything you ever wanted to know about lotus scriptBill Buchan
 
Admin camp 2011-domino-sso-with-ad
Admin camp 2011-domino-sso-with-adAdmin camp 2011-domino-sso-with-ad
Admin camp 2011-domino-sso-with-adBill Buchan
 
Softsphere 08 web services bootcamp
Softsphere 08 web services bootcampSoftsphere 08 web services bootcamp
Softsphere 08 web services bootcampBill Buchan
 
Lotusphere 2009 The 11 Commandments
Lotusphere 2009 The 11 CommandmentsLotusphere 2009 The 11 Commandments
Lotusphere 2009 The 11 CommandmentsBill Buchan
 

More from Bill Buchan (20)

Dummies guide to WISPS
Dummies guide to WISPSDummies guide to WISPS
Dummies guide to WISPS
 
WISP for Dummies
WISP for DummiesWISP for Dummies
WISP for Dummies
 
WISP Worst Practices
WISP Worst PracticesWISP Worst Practices
WISP Worst Practices
 
Marykirk raft race presentation night 2014
Marykirk raft race presentation night 2014Marykirk raft race presentation night 2014
Marykirk raft race presentation night 2014
 
Dev buchan best practices
Dev buchan best practicesDev buchan best practices
Dev buchan best practices
 
Dev buchan leveraging
Dev buchan leveragingDev buchan leveraging
Dev buchan leveraging
 
Dev buchan everything you need to know about agent design
Dev buchan everything you need to know about agent designDev buchan everything you need to know about agent design
Dev buchan everything you need to know about agent design
 
Dev buchan 30 proven tips
Dev buchan 30 proven tipsDev buchan 30 proven tips
Dev buchan 30 proven tips
 
Entwicker camp2007 calling-the-c-api-from-lotusscript
Entwicker camp2007 calling-the-c-api-from-lotusscriptEntwicker camp2007 calling-the-c-api-from-lotusscript
Entwicker camp2007 calling-the-c-api-from-lotusscript
 
Entwicker camp2007 blackberry-workshop
Entwicker camp2007 blackberry-workshopEntwicker camp2007 blackberry-workshop
Entwicker camp2007 blackberry-workshop
 
Bp301
Bp301Bp301
Bp301
 
Ad507
Ad507Ad507
Ad507
 
Ad505 dev blast
Ad505 dev blastAd505 dev blast
Ad505 dev blast
 
Admin2012 buchan web_services-v101
Admin2012 buchan web_services-v101Admin2012 buchan web_services-v101
Admin2012 buchan web_services-v101
 
Reporting on your domino environment v1
Reporting on your domino environment v1Reporting on your domino environment v1
Reporting on your domino environment v1
 
12 Step Guide to Lotuscript
12 Step Guide to Lotuscript12 Step Guide to Lotuscript
12 Step Guide to Lotuscript
 
Everything you ever wanted to know about lotus script
Everything you ever wanted to know about lotus scriptEverything you ever wanted to know about lotus script
Everything you ever wanted to know about lotus script
 
Admin camp 2011-domino-sso-with-ad
Admin camp 2011-domino-sso-with-adAdmin camp 2011-domino-sso-with-ad
Admin camp 2011-domino-sso-with-ad
 
Softsphere 08 web services bootcamp
Softsphere 08 web services bootcampSoftsphere 08 web services bootcamp
Softsphere 08 web services bootcamp
 
Lotusphere 2009 The 11 Commandments
Lotusphere 2009 The 11 CommandmentsLotusphere 2009 The 11 Commandments
Lotusphere 2009 The 11 Commandments
 

Connections Lotusphere Worst Practices 2013

  • 1. BP108 Worst Practices…. Back from the depths of despair Bill Buchan / HADSL Paul Mooney / Bluewave Technology © 2013 IBM Corporation Monday, 4 February 13
  • 2. STAND UP 2 Monday, 4 February 13
  • 3. I 3 Monday, 4 February 13
  • 4. State your name.... 4 Monday, 4 February 13
  • 5. Now your real name.... 5 Monday, 4 February 13
  • 6. Pledge solemnly to a deity, non-deity...or possibly spaghetti 6 Monday, 4 February 13
  • 7. To fill out an evaluation for this session 7 Monday, 4 February 13
  • 8. And to fill it out in full 8 Monday, 4 February 13
  • 9. And to buy a beer for the person to my left 9 Monday, 4 February 13
  • 10. Even though I never really liked that person 10 Monday, 4 February 13
  • 11. After that incident.. that time 11 Monday, 4 February 13
  • 12. But it’s best we don’t talk about it anymore... 12 Monday, 4 February 13
  • 13. Because it gets kinda uncomfortable 13 Monday, 4 February 13
  • 14. SO SAY WE ALL! 14 Monday, 4 February 13
  • 15. Paul Mooney § Geek –Lotus software since R2 –Symantec Authorised Consultant –Google Certified Deployment Specialist § Speaker, Author, Blogger, jogger, biker –www.pmooney.net § Bluewave Technology –26 staff –Operate globally 15 Monday, 4 February 13
  • 16. Bill Buchan § HE’S BACK BABY YEAH!!! § Geek –cc:Mail –Enterprise level domino consultant since 1995 –Dual PCLP in v3, v4, v5, v6, v7, v8 and v8.5 § Speaker, Blogger, Biker –http://www.billbuchan.com § hadsl –IBM BP ISV focused on federated identity management –http://www.hadsl.com 16 Monday, 4 February 13
  • 17. Let’s get Legal!! ● This slide presentation may contain the following copyrighted, trademarked, and/or restricted terms: ● IBM® Lotus® Domino®, IBM® Lotus® Notes®, IBM Lotus Symphony®, LotusScript® ● Microsoft® Windows®, Microsoft Excel®, Microsoft Office® ● Linux®, Java®, Adobe® Acrobat®, Adobe Flash® ● Your mileage may vary ● This is “Technology Light” ● Consider it a rest! ● Fill out the evaluations ● @IF(enjoy;"buy us beer";"buy us beer") ● Try to never be a story in this presentation ● Today is “destroy all end users” day ● No.. really it is 17 Monday, 4 February 13
  • 18. What is this session about? § Mistakes are made by everyone –How do you deal with them? –Blame? –Ignore? –Denial? § Large or small –Enterprise or SMB § Know your environment –Especially if you inherit it § Prevention beats the hell out of the cure –As you will see 18 Monday, 4 February 13
  • 19. Agenda § For each case study, we shall –Look at the errors –Diagnose the problem –Determine the problem –How was it resolved –What lessons can be learned § We have 10 case studies. All new –And 2 of our personal old favourites § We cover both infrastructure and development… § All true –Seriously, you can’t make these up.... 19 Monday, 4 February 13
  • 20. Case Studies Paul Bill Archiving “issues” Designer Disaster! Follow the white rabbit Router Routed Does it blend? Where’s John I have a cunning plan BYOD hell (Un)Happy New Year Server moves to Hell 20 Monday, 4 February 13
  • 21. Story 1 - Designer Disaster 21 Monday, 4 February 13
  • 22. The Story § Large multinational Site –Over 40k users –Over 30+ sites § Monday morning, the helpdesk exploded –Hundreds and hundreds of calls ‘I’m not getting new mail!’ § Happening across all clusters § Happening on mobile devices 22 Monday, 4 February 13
  • 23. The Investigation § Checked servers § Checked mail routing § Checked replication § Checked target mailfiles –ACL - okay –Template - okay –Checked Database Properties... 23 Monday, 4 February 13
  • 24. The Cause § The ‘Don’t Maintain Unread Marks’ flag was set § But how? This problem affected hundreds of users! § We then checked the template.. And found that the flag had been set there § Designer task ran each night § Mailfile inherits from template - including this flag 24 Monday, 4 February 13
  • 25. The Resolution § We tracked down the developer who touched the template –And shot him § We unset the flag on the template –Made sure it had replicated around § Re-applied the template to the affected user mail databases 25 Monday, 4 February 13
  • 26. Lessons Learned § Unread marks are stored in a database table –Has worked quite well since 6.0.2 / 6.5.1 § Uncontrolled changes to templates can quickly cause large scale issues § Designer task does NOT have to run every night –Running it on demand gives you control –Paul totally disagrees with this point...... § Little things can easily cause large scale issues § Ensure that all members of all affected teams understand how to prevent this issue –Use it as a blunt weapon to ensure change control processes are adhered to 26 Monday, 4 February 13
  • 27. Story 2 - Archiving issues... 27 Monday, 4 February 13
  • 28. The Story § Large multinational site –Over 40k users –Over 40 countries § One region’s mail servers low on disk space –4GB left of 1TB data drive § Apparently aggressive archiving apparently in place –Data moved on schedule • Older, large size archive server –Mail over 90 days moved using server-server archiving 28 Monday, 4 February 13
  • 29. The Investigation § Check mail server settings § Program documents –Compact -a § Check archive server –Cannot connect –Apparently firewall restricts admin client subnet access to archive? § Check logs on mail servers –They are having issues too § No RDP access § No ICMP response (apparently firewall) 29 Monday, 4 February 13
  • 30. The Cause § The archive server was down –For 8 weeks § Ran out of disk space –Attempted restore of entire archive directory accidentally on same server § Nobody noticed –Nobody can access archive server from admin subnet client IPs anyway 30 Monday, 4 February 13
  • 31. The Resolution § *very carefully* § Disconnect archive server from network § Replace directory and key system databases § Bring up and check consistency § Add to network § Test archiving § But...there’s more 31 Monday, 4 February 13
  • 32. The Resolution § VPN’d in on Friday evening § No RDP access to box § So.. request access § Support gave me access § By adding my account to GlobalDomainAdmin group 32 Monday, 4 February 13
  • 33. Lessons Learned § All issues here caused by laziness § Check your servers are up, daily –Monitor your servers? § Have a restoration process for data § Don’t hand admin rights out to people “as needed” –I don’t care how much you like them! § Be “PIRK”Y –Purge Interval Replication Control –See adminblast deck 33 Monday, 4 February 13
  • 34. Story 3 - Router Rooted 34 Monday, 4 February 13
  • 35. The Story § Large multinational –30k + users –50+ sites § A large mailserver crashed –Thousands of users affected –Auto-restart enabled - restarted the server –Took 40 minutes § It crashed again –It restarted it again § it crashed again –Auto-restart decided it had enough –Manually Restarted § It crashed again § NSD’s indicated that Router was crashing 35 Monday, 4 February 13
  • 36. The Investigation § Console log - no issue § Log files - no issue § Transaction log - no issue § NSD analysis concluded that the router task was crashing –Whilst running LotusScript? –LotusScript ??? § We noticed a new agent... 36 Monday, 4 February 13
  • 37. The Cause § Someone had created a new ‘Before Mail Delivery Agent’ in the mail template –Designer task enabled –All users got the new agent § Was it tested? –ummm... Yes? § A ‘Before Mail Delivery’ agent is ran by Router when it delivers the mail to the user mailbox –Very handy hook point for some automated processes –Documentation states that this agent has GOT to be quick § This agent tried to open a remote database and log the message –Thousands of mail messages meant that the router task could not keep up –Crashed the server 37 Monday, 4 February 13
  • 38. The Resolution § We re-educated the developer –With a bat § Increased change control around the template. Again. § Removed the ‘before mail delivery’ agent § Refreshed all user templates 38 Monday, 4 February 13
  • 39. Lessons Learned § Change Control § Before Mail Delivery agents have to be fast –Try not to open remote databases on each message being delivered –Over a 400ms wide area network –With 64kb/s bandwidth § Testing –No. Real Testing. Large Scale Testing. –Use ‘Agent Profiling’ to give you an idea of the total time it’ll take to run 39 Monday, 4 February 13
  • 40. Story 4 -Follow the white rabbit 40 Monday, 4 February 13
  • 41. The Story § Global customer –5k staff globally § Fast moving company –Acquisitions / temporary projects • Built servers as needed § Mail issues –Delivery time taking hours –Some mail never delivered • No NDRs § We were asked to investigate 41 Monday, 4 February 13
  • 42. The Investigation § Ask for copy of names.nsf § Check –Connection documents –Configuration documents –Domain documents –NNN § Noticed different domain entries –Adjacent domain docs –Non Adjacent domain docs § Asked to vpn to site to investigate domain –It got interesting/emotional 42 Monday, 4 February 13
  • 43. The Cause Domain 2 Domain 1 Domain 3 43 Monday, 4 February 13
  • 44. The Cause Domain 8 Domain Domain 9 11 Domain 7 Domain 10 Domain 2 Domain 6 Domain 1 12 15 16 Domain 3 Domain 5 13 18 14 17 19 20 Domain 4 44 Monday, 4 February 13
  • 45. The Cause § At some stage... –Someone designed separate domains for projects • Separate servers § Agents used to add documents to primary nab § This became a “standard” without question § Nobody knew who did it first § Routing hell - some domains linked through 8 hops –Some not linked at all 45 Monday, 4 February 13
  • 46. The Resolution § Designed a primary domain –Began a consolidation process § Cleaned up routing to HUB/Spoke where possible –Some servers could not do this –Ended up with four hub domains/servers until consolidation complete § Demo’d the directory catalog..... § Explained mail routing architecture 46 Monday, 4 February 13
  • 47. Lessons Learned § If there is a standard, and it is not traceable back to an “owner” –Question it? –Validate it? § Enable the “delayed mail” feature in configuration document on servers 47 Monday, 4 February 13
  • 48. Story 5 - ‘Where’s John?’ 48 Monday, 4 February 13
  • 49. The Story § A multinational 30k + user site –70+ sites § Lots of critical line-of business Notes applications –Been running for years § Overnight application processing fails –No monitoring –No-one notices for a few days –Ambiguous help desk calls logged § More instances fail –No-one notices § Finally the business explodes 49 Monday, 4 February 13
  • 50. The Investigation § We picked one application –No application logs –No way of validating critical processing had been performed –History of large numbers of document writes • But not recently –Checked its agents - they looked fine –Checked the server logs to see when they should run • Tried to confirm from server console when the agents last ran § Checked the username associated with the agent... 50 Monday, 4 February 13
  • 51. The Cause § One developer, responsible for all these applications, left at the end of his contract –He’d been added to the terminations group –All the agents he’d signed had failed to run 51 Monday, 4 February 13
  • 52. The Resolution § Create a ‘Template Signing’ ID for your organisation –Have the Administrators keep control of it § Have the administrators sign all templates going into production with this ID –No exceptions § If it fails, its their fault. 52 Monday, 4 February 13
  • 53. Lessons Learned § Domino Applications run for years –I’ve seen ones in production for 10+ years –They need to be monitored • Scheduled agents have to run! • Use DDM ‘agent failed’ monitor - and check the results! § Release control isn’t just for the SOX Audit –its for life –And it’ll save yours 53 Monday, 4 February 13
  • 54. Story 6 - Does it blend 54 Monday, 4 February 13
  • 55. The Story § Mid size site –1.5k users in one region § Recently upgraded to ND8.x on Citrix § Full fat version § Ongoing issues with personal and recent contacts –Everyone had everyone’s recent contacts –Some people have other people’s saved contacts –Others had no issues § Management going berserk 55 Monday, 4 February 13
  • 56. The Investigation § Investigate a typical user setup § Check location of home directories –Majority of users using legacy “shared network drive” for data § Purge the recent contacts from personal address books –They come back almost instantly § Bang head on wall –No effect § Bang head on desk –No effect § Bang head against deployment team –Some effect 56 Monday, 4 February 13
  • 57. The Cause § There were two issues § Contacts being shared... –All users were setup using a default copy of the client databases (NOT templates) • names.nsf, bookmark.nsf etc –These were placed in network home folder (e.g. h:notesdata) • TERRIBLE IDEA –Some of the users were blackberry users • Legacy setup of blackberry for contact sharing • Users’ personal directories being replicated to BES server • BES used them as sources for contacts –As one user replicated their personal directory with BES server • All other replicas (i.e. other personal directories) replicated too 57 Monday, 4 February 13
  • 58. The Cause § Issue Two –Recent contacts appearing everywhere § Recent contacts are stored in recent contact view in personal directory § That data is ALSO stored.. –<Notes data directory>workspace.metadata.pluginscom.ibm.notes.dip –files called DIP*.SER § Citrix deployment was completed incorrectly –The plugin directory was being shared by all users • Being written to by all users 58 Monday, 4 February 13
  • 59. The Resolution § Issue 1 –Remove the personal address books from the BES server –Setup mail policy to sync contacts with mail file –Remove personal directory property for each user in the BES administrator • Will then default to mail file contacts –Start project to change replica id for all personal directories § Issue 2 –Promote RTFM on Citrix deployment –Fix Citrix deployment 59 Monday, 4 February 13
  • 60. Lessons Learned § “If it works, leave it alone” –Not always the best way –e.g BES using replicated personal directories - very old school § Citrix is a great tool –8.5.3 supports Citrix well –But you need to: • Understand Citrix • Understand the Notes client • Read the manual 60 Monday, 4 February 13
  • 61. Story 7 - BYOD Hell 61 Monday, 4 February 13
  • 62. The Story § A single user, with an iShiny device § One morning, the phone was dead. § She had lost everything –Family pictures, contacts, text messages –No Backup 62 Monday, 4 February 13
  • 63. The Investigation § We looked back over the user history –She used to be an employee of BigCo –Left a number of months ago –Had the BigCo MDM profile and mail/PIM data pushed to her iShiny device § We looked at the BigCo Mobile Device Management strategy 63 Monday, 4 February 13
  • 64. The Cause § BigCo had a rather brutal and primitive MDM § They assumed control of the users own iShiny device § The user couldn’t pull their own data off their own phone –Because she wasn’t connected to the enterprise network § When the user left, they nuked the device 64 Monday, 4 February 13
  • 65. The Resolution § Shoot the administrator § We advised BigCo that they should invest in a better MDM architecture § We also advised them to at least warn their users that their own phones and iPads were rendered useless by their MDM architecture 65 Monday, 4 February 13
  • 66. Lessons Learned § ‘Bring Your Own Device’ means –The Users own the device –BigCo pushes mail to that device § BigCo wants to secure mail on that device § But there are better ways than just nuking the phone –For example, Traveler allows you just to nuke the Traveler data –Other systems create encrypted areas on the device which can be remotely nuked 66 Monday, 4 February 13
  • 67. Story 8 - “I have a cunning plan” 67 Monday, 4 February 13
  • 68. The Story § Small subsidiary of a large corporate company § Two Domino mail servers § One (Monday) morning –All mail files corrupted –All documents marked as Rep/Save conflicts § No databases outside of mail directories corrupted § FTI’s corrupted 68 Monday, 4 February 13
  • 69. The Investigation § Check the Domino servers –Program documents –Log files –Agents –Backup software –AV software § All good, with exception to corruption errors § Retrace logs to last startup –Thousands of locking errors § Ask a few questions.... 69 Monday, 4 February 13
  • 70. The Cause § The Administrator was asked to make both servers available for mail access (iNotes) –Only had 1 public IP address available –Mail files were not replicas § GENIUS IDEA –Directory links! § Administrator decided to map a drive from server A to mail directory on Server B –And map a drive from Server B to mail directory on Server A § Administrator created directory links on each server to the additional mail directory 70 Monday, 4 February 13
  • 71. The Cause § Last restart –Server A had started first, and mapped drive to Server B’s mail directory –Server B was trying to access mail files, locking errors occurring • Corruption § Then.... –Administrator noticed and did restarts in other order –Server B started first, mapped drive to Server A’s mail directory –Server A was trying to access mail files, locking errors occurring –leading to... 71 Monday, 4 February 13
  • 72. The Resolution § Stop servers § Unmap mappings § Delete .dir directory link files § Get backup tape § Restore § Punish Administrator –Enthusiastically 72 Monday, 4 February 13
  • 73. Lessons Learned § Domino is an application/database server § Needs ownership of its data § File locking is result of it not owning data –Always causes issues –Backups, AV software § Look for the easy solution to the initial problem –Saying no? –Replicating mail files to central server? –Reverse Proxy? § Domino doesn't always have to be the solution 73 Monday, 4 February 13
  • 74. Story 9 - Server Moves to Hell 74 Monday, 4 February 13
  • 75. The Story § Massive corporate (we mean that...) –Moving server images from one physical machine to another • Copy data across WAN - setup identical server § Server gets brought up and 15 minutes later –Mail routing stopped working –Replication stopped working –Traveler stopped working –Agents stop –HTTP stops –Console log goes crazy § Servers still running 75 Monday, 4 February 13
  • 76. The Investigation § We opened up the directory and saw... 76 Monday, 4 February 13
  • 77. The Cause § A server migration had corrupted the transaction logs on a single server § They started the server with no server id § This transaction log corruption had resulted in the directory design being corrupted to look a bit different. – 77 Monday, 4 February 13
  • 78. The real cause § An administrator accidently replaced the domino directory design –with the document library § It replicated –There were no survivors.... § Most servers kept mail routing for a while § Some services - such as traveler - failed –The views they relied on were missing – 78 Monday, 4 February 13
  • 79. The Resolution § Replace design on the directory –Rebuild all indexes –Restart the server § SALT ON THE WOUND –Even AFTER discovering... –Didn’t do server restarts –Problems continued 79 Monday, 4 February 13
  • 80. Lessons Learned § Limit your risk –Designated master servers for design changes § Accidental design replaces can happen –Replication requires access –Remove the access! § Honesty –Really - you WILL get found out... –Own up - you will sleep easier § Detailed disaster recovery processes 80 Monday, 4 February 13
  • 81. Story 10 - (un)Happy New Year! 81 Monday, 4 February 13
  • 82. The Story § Large site –Regional administration –6000 users in “my” region § First support call of this year –2nd January § Replication stopped working for application hub server –Nothing replicating 82 Monday, 4 February 13
  • 83. The Investigation § Replication task – working § Cluster replication –working § Network connectivity –working § Console –nothing obvious reported § Log file –Unusual dates listed... § Replication history –Entry dates for 1st January 2020 § Spoke to developer 83 Monday, 4 February 13
  • 84. The Cause § Developer working over holiday season § Request from senior executive –Automated email to all users to be sent at midnight –Wishing them joyous tidings for the new year § Developer was enthusiastic –wrote an agent § Developer couldn’t test –Did not have admin rights to test servers OS § BUT! –He did have RDP access to production servers • And nobody was online one night § Brought down domino –Reset time to Dec 31st 11:58, 2012 and waited • It worked § Then... 84 –Tried every year to end of 2019 Monday, 4 February 13
  • 85. The Cause § Brought server up each time § Server replicated § Applications updated Replication history –Ending in Jan 01, 2020 § Time reset to 2012 –Applications wouldn’t replicate 85 Monday, 4 February 13
  • 86. The Resolution § Clear every replication history § Rebuild view indexes § Slow repair –Will haunt you § Educate developer –with prejudice 86 Monday, 4 February 13
  • 87. Lessons Learned § Changing OS dates is bad –For any application server § Replication relies on replication history –Date/Time stamp based marker for last successful push of data 87 Monday, 4 February 13
  • 88. 11 Hell’s agent! § The Story –A critical application sits on all servers • 3GB Database / 65,000 documents • Replicates from three global hub clusters to all spokes hourly –All server communication grinds to a halt –No Mail routing/replication –Application grows to 28GB • Masses of replication conflicts § The Investigation –Check application for design changes –Check replication history and schedule –Check server tasks • Sniff the bandwidth § Gotcha! –New scheduled agent 88 Monday, 4 February 13
  • 89. Hell’s agent! § The cause –Developer wanted to modify all documents –Built an all documents view –Wrote an agent to modify a field –Agent set as scheduled “every hour” –Set agent to run on …. –ALL SERVERS –Ran on Hub first… –Hub replicated with all spokes on 1-hour replication schedule –Then ran on all servers –Then continued to run and replicate for the weekend –4.8 Million documents per hour! 89 Monday, 4 February 13
  • 90. Hell’s agent! § Lessons Learned –Developers must never change design on production systems • Even basic agents –Have separate development domain/UAT/Production domains • Developers should NOT have designer access on UAT/Production domains § Domino is very powerful, and WILL do whatever you tell it to do – no matter how stupid.. § Never leave new code unsupervised 90 Monday, 4 February 13
  • 91. 12 Oh, is that important? § The Story –Big site –Over 90 servers – 65K users –One Friday, all replication and routing stops –Starts on HUB, and quickly affects all servers § The Investigation –Check the source of the error –Logs, console, WAN/LAN links –Is server performance the problem? § Gotcha! –Checked Server consoles… and the Admin4.nsf 91 Monday, 4 February 13
  • 92. Oh, is that important? 92 Monday, 4 February 13
  • 93. Oh, is that important? 93 Monday, 4 February 13
  • 94. Oh, is that important? 94 Monday, 4 February 13
  • 95. Oh, is that important? 95 Monday, 4 February 13
  • 96. Oh, is that important? It Replicates ….. 96 Monday, 4 February 13
  • 97. Oh, is that important? 97 Monday, 4 February 13
  • 98. Oh, is that important? § The Cause –Junior Administrator deleted LocalDomainServers group using Adminp on the HUB server –This replicated to all servers –All server-server access lost –Admins attempted to stop spread by disabling replication of the names.nsf file • Forgot admin4.nsf! § Resolution –Flush out Adminp requests –Manually add entries for LocalDomainServers back to all ACLS and directory documents (3 days to recover) 98 Monday, 4 February 13
  • 99. Oh, is that important? § Lessons learned –Limit Administrator access to the Domino directory –Education, Education, Education! –Response procedures for Disaster recovery –Have a test environment that simulates your live one… 99 Monday, 4 February 13
  • 100. Shorties § Archive Transaction logging relies on the backup routine clearing the transaction log –No backup, no cleanup. And the transaction log fills up § Don’t let your platform anti-virus scan the Domino Directory –It doesn’t half kill performance § Domino requires fast disk subsystems –0-2ms is fast. 360ms is slow § Don't keep “encrypted” text in clear text § Please don’t open the mail template and set the Owner Name... § Reader/Author fields are multi-value canonicalised names –You can try, but NOTHING ELSE will work –We say this every year. And we see it every year § Don’t switch on ‘Anonymous Access’ on your server security document.. § Don’t just have a replication stub as your mail template... 100 Monday, 4 February 13
  • 101. Shorties 2 § Can you define ‘Return on Investment’ for this? 101 Monday, 4 February 13
  • 102. Summary § Everyone makes mistakes § Put systems in place to prevent the obvious ones § Its how you deal with them that makes you professional –No Blame Culture –Admit soon, Admit well… § Major system disasters –Sometimes cant be prevented –Are usually the combination of many small errors § Learn from these mistakes! 102 Monday, 4 February 13
  • 103. Thank you § Paul Mooney (pmooney@pmooney.net) § Bluewave § pmooney.net / bluewavegroup.eu § Bill Buchan (bill@billbuchan.com) § hadsl § billbuchan.com / hadsl.com 103 Monday, 4 February 13
  • 104. Legal Disclaimer © IBM Corporation 2009. All Rights Reserved. The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. IJava and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. 104 Monday, 4 February 13