Archive for August, 2011

Post outage services

[UPDATE] 08/12/11 8:54 AM: All services for DEWEY have come back online from the restart. If you are still experiencing issues, please call 614.503.7139 option 2 to reach our Help Desk.

[UPDATE] 08/12/11 8:15 AM:: The domain controllers for DEWEY have come back online from the restart. We are still waiting for the primary CAS and MBOX server to come back online.

[UPDATE] 08/12/11 8:05 AM: We are going to reboot the CAS servers for DEWEY to clear up any issues before the start of business. Service may be interrupted for 15 mins

[UPDATE] 08/12/11  1:46 AM EST – All Maintenance on DEWEYMBOX2 has been completed succesfully. All services have been restored and queued messages are flushing into the user’s mailbox.

Again we’d like to take this opportunity to apologize for this issue and we appreciate your patience with us through out the process.


[UPDATE] 5:43 PM EST: Tonight we will be performing maintenance on DEWEYMBOX2 to address the performance issues reported by clients throughout the day.

Maintenance is expected to last between 10PM and 2AM.

99% of all of our services have been restored. Unfortunately, we have 3 servers that are still affected by the outage. Remembering that it was a power outage  and Windows Server’ capability of handling such outages is not on par with Linux, we’ve had some servers take longer to come back online. Services currently affected:

Dewey MBOX2 – It’s a secondary mailbox server on the DEWEY cluster, as such its a small user count but if your clients are on it, their mailbox is inaccessible.

Daisy – Legacy Exchange 2003 server

VS4 – One of our Virtual servers

These services affect a small but equally important portion of the client base, as such we’re all hands on deck on restoring services on these servers ASAP. For the Hosted Exchange users still affected please remember LiveArchive is online, if you’re having any issues authenticating to it or otherwise please open a support request and we’ll get you back online on LiveArchive as quickly as possible.

 

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Dear Network Logix Clients,

At approximately noon yesterday the data center hosting the hub of our partner’s US operations suffered a major power failure that momentarily interrupted all network services. More redundant services resumed operations immediately (ExchangeDefender) while others started recovering through emergency systems around 5 PM EST. We are incredibly sorry for the impact this has made on your operations and promise to enhance service redundancy so that this never happen again.

On behalf of the whole Network Logix team, I am sorry we put you into this situation. Much like you we’ve spent a part of yesterday in the dark with no clear ETA on when the services would be restored or how this failure could have happened in the first place.

The facility that our main US operations are in boasts N+2 redundancy, over 30 days of generator fuel and is one of the largest and most reliable in the world. As a matter of fact, we consolidated our central operations here due to the issues we had in California and Florida. The facility has had an incredible service record and has allowed us to provide the same level of service to you. Yesterday the facility experienced a failure in Automatic Transfer Switch (ATS) equipment designed to switch the power feed from live utility power to the power generators in the event of power loss. This was the piece of equipment that was designed to keep power available and while it is also redundant with A/B feeds, the data center distribution routers were not connected to both power banks. While our equipment remained powered on, the network connectivity remained down. This was the technical root of the issue.

Thankfully spare ATS was available and the utility, data center and supporting vendors were all available on site within the hour and completed the replacement within 5 hours of the service interruption. We did our best to keep everyone informed of everything we knew every step of the way through our Facebook page (http://www.facebook.com/NetworkLogix) and our Twitter account @networklogix.

Our operating procedures also call for use of emergency failover systems should the primary systems be down for more than 4 hours. At roughly 3:30 we began restoring services to our web sites, redundant Exchange clusters and continued restoring services well into the night as the data center facility restored full operations.

ExchangeDefender inbound service was not affected by this incident as it’s massively redundant through multiple data centers. However, a disruption to the major central control in Dallas effectively flooded the failover sites and some of our partners reported email delays from minutes to even two hours. Worse, our Exchange hosting clients were impacted for 4-6 hours and with the lack of ExchangeDefender LiveArchive to back them up, it completely failed them. Again, we are sorry for this issue and will address it immediately.

At approximately 10PM EST I was briefed by our partners to explain in detail what happened, how they responded, what we learned and what we intend to do to fix it going forward.

Going Forward

First of all, we will begin deployment of redundant control systems for ExchangeDefender: placing additional admin servers across our failover sites, adding more capacity to the existing ones and most importantly providing geographic redundancy to ExchangeDefender LiveArchive.

Second of all, we will be adding redundancy to our Exchange 2010 networks in USA.

Expect to see major changes behind the scenes this quarter. While this issue never occurred before and we don’t expect it to occur again, we have learned the hard way that we need to greatly improve certain areas of the product in particular LiveArchive.

Personally, I put my name and reputation on whom we choose as our partners and on the service we deliver. I believe we are the best and the solutions that we offer are without comparison. We will make sure that all of the features, not just the inbound mail processing, live up to the 100% uptime expectation you should have and we have maintained on our inbound service for the past decade.

My staff and our partner’s have worked tirelessly throughout the day and night to keep you informed and restore service as fast as possible. I want to personally thank you for your professionalism and the way you treated us during this difficult time. While it’s easy to lose composure and patience when services are down and there is limited visibility/ETA on resolution, almost universally the comments included “Well that sucks but I’m glad it’s you dealing with this and not me.” While I appreciate it, I do feel we failed you.

Our partner operations will remain in Dallas at the existing data center facility because simply put – they are the best. Even with the power incident which was the first one that we’ve experienced in nearly a decade of working with them, this is the kind of an issue Microsoft, Google and Amazon experience on a weekly basis. Cloud services are about providing an affordable IT solution through massively scalable equipment which is incredibly complex – it is not foolproof nor easy to fix when it goes down but the benefits are that you are not on the hook for a repair bill, equipment or the amount of manpower required to manage it all. Best of all, problems such as those experienced yesterday can be minimized and we will begin to work on that today.

Again, I apologize for the inconvenience this has caused you and your clients. I know we are fortunate to earn your business and the trust you put in us and our services that are designed to keep you up and running when your systems are affected. I look forward to talking to everyone that was affected by this and while I work my way through the messages, emails and callback requests I hope this email provides some clarity as to what happened and what we intend to do next.

Thank you for your business.

Sincerely,
Andrew Moon
Chief Technologist | Network Logix

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Hosted Exchange email interruption

We are experiencing routing issues in our Dallas data center affecting our Hosted Exchange email. Our engineers are working to resolve the issue as quickly as possible.

[UPDATE] 12:32PM EST: The issue has been identified as power related in the Data Center. Services are slowly coming online. We will update when service is fully restored.

[UPDATE] 2:17PM EST: The datacenter staff has confirmed an outage with the power plant and has individuals on staff attempting to redirect power around the core

[UPDATE] 3:27PM EST: Unfortunately our hands are tied with power related issues. We will keep this account updated as information comes along.

[UPDATE] 3:47PM EST: It has been determined that the ATS (Automatic Transfer Switch) will need repairs that will take time to perform. Fortunately our Data Center has another ATS that is on-site that can be used as a spare. Contractors are working on a solution right now that will allow us to safely bring that ATS in and use it as a spare while that repair is happening.

That plan is being developed now and we should have an update soon as to the time frame to restore temporary power. We will need to schedule another window when the temp ATS is brought offline and replaced by the repaired ATS.

[UPDATE] 4:32PM EST: Thank you for your patience as we work to address the ATS issue. We apologize for the situation and are working as quickly as possible to restore service.

We have determined that the repairs for the ATS will take more time than anticipated, so we are putting into service a backup ATS that we have on-site as part of our emergency recovery plan. We are working with our power team to safely bring the replacement ATS into operation. We will update you as soon as we have an estimated time that the replacement ATS will be online.

Later, once we have repaired the main ATS, we will schedule an update window to transition from the temporary power solution. We will provide advance notice and timelines to minimize any disruption to your business.

Again, we apologize for the loss of connectivity and impact to your business. We are working diligently to get things back online for our customers. Please expect another update within the hour.

[UPDATE] 4:45PM EST: The data center staff, electricians and all the suppliers have updated us with the provisional service restoration time of about 2:30 from now (about 6:30 PM EST). We will provide further update at that time. This is just an estimate and we will update this posting at that time. At this time we are turning up our emergency sites and failover systems (which also takes some time but will restore most of the services more rapidly).

[UPDATE]  5:10 PM EST: Our emergency failover systems are kicking in and we have restored service to support portals, web sites, outbound ExchangeDefender (inbound was not affected), louie and rockerduck. Please note THESE ARE OUR EMERGENCY FAILOVER systems, not the real thing. We are still waiting for services to be restored by the utility/power/electricians/etc?.

[UPDATE]  6:10 PM EST: Our team and electricians are working diligently to get the temporary ATS installed, wired and tested to allow power to be restored. As the ATS involves high-voltage power, we are following the necessary steps to ensure the safety of our personnel and your equipment housed in our facility.

Based on current progress the electricians expect to start powering the equipment on between 6:15 – 7:00pm Central. This is our best estimated time currently. We have thoroughly tested and don’t anticipate any issues in powering up, but there is always the potential for unforeseen issues that could affect the ETA so we will keep you posted as we get progress reports. Our UPS vendor has checked every UPS, and the HVAC has checked every unit and found no issues. Our electrical contractor has also checked everything.

We realize how challenging and frustrating that it has been to not have an ETA for you or your customers, but we wanted to ensure we shared accurate and realistic information. We are working as fast as possible to get our customers back online and to ensure it is done safely and accurately. We will provide an update again within the hour.

[UPDATE]  9:16 PM EST:Since it was power failure, the DEWEY database didn’t cleanly commit log files. We’re performing an integrity check before bringing back online.

[UPDATE]  10:32 PM EST: At this time all systems are back online. It may take several hours for the servers to process all the queued mail. If you are still unable to access your email, please call 614.503.7139 option 2. Thank you for your patience today. We sincerely apologize for the inconvenience to you and your business.

[UPDATE]  August 11 10:04 AM EST:At this time we are experiencing latency issues with the Dewey server due to the unforeseen power outage issue we experienced yesterday. Our Exchange Engineers are doing their best to avoid as much service interruption as they possibly to all of our partners and their clients as they currently work on fully restoring the Dewey server.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Microsoft has issued a Security Bulletin Advance Notification indicating that its August release will contain 13 bulletins. These bulletins will have the severity ratings of critical, important, and moderate. Release of these bulletins is scheduled for Tuesday, August 9, 2011.

Network Logix will provide additional information as it becomes available.

 

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon