Unplanned Outage
November 15th, 2012 by admin
Much like the controversial statement by Benjamin Franklin, "Those who would give up Essential Liberty to purchase a little Temporary Safety, deserve neither Liberty nor Safety," we face a similar dichotomy between systems stability and systems security.
Most people don't realize the two are inextricable and any improvement in one usually is at the regression of the other. In some cases, the benefit clearly outweighs the cost; others can be not just less obvious but counter productive. For this reason we tend make any changes with great caution.
The most recent outages have been related to our Windows Update process. One school of thought, arguing for security, urges us to update with vigilance and priority, for once an update for a security weakness has been issued it has, in the very same act, announced the existence of the weakness, along with the method to exploit it, to the world. They rightly claim every second unpatched is a second unprotected, with your customers, credit cards, reputation, and every other digital asset vulnerable.
The opposing school warns us to be cautious of these same updates that are meant to protect us. Very often they are issued in great haste, attempting to plug the hole in the dike before the next deluge of cyber attacks crashes upon us. In this haste they are often not widely tested in the full spectrum of ecosystems in which both our hardware (virtual machines or physical, and all our many devices) and software (not only the OS itself but also uncounted vastness of 3rd party applications) live. As such, it is not uncommon for these updates to cause conflict with the environment. This is the exact scenario that occurred early this morning.
It is interesting to note that though we have chosen a middle path (to delay all updates until off-peak hours) some elements of the operating system (in the case of Windows, as outlined by Microsoft in this blog) may decide a given security item is so important that it will override your defined settings (to the extent outlined in the article) and apply the update immediately. This again is the scenario encountered.
There are steps we are taking to help mitigate outages like these. While we do offer limited EST AM support (and in this case we were working to resolve the issue well before our normal PST hours) we are seeking to enhance our monitoring and alert solution. We are also considering removing all updates and moving to a daily, off-peak hour, manual alternative. However, know that there are risks balanced against such activities, which is why we make them only after great deliberation.
Furthermore, I encourage you to examine your business and determine if the tradeoffs between security and stability create a unique business scenario for you. If you need more of either one or the other beyond what we feel is the best fit for the majority of our customers we, along with your local providers, would be glad to work out a cost benefit analysis with you to determine if your business would benefit from either an SLA or custom solution.
We understand there is a loss of productivity associated with unplanned outages, and will continue to strive to minimize that risk while at the same time balancing the concomitant security factors.
Posted in: announcement