Archive for the ‘ Data Center Ops ’ Category

As the winter holidays are quickly approaching, Network Logix is republishing this entry to increase awareness.

In the past, Network Logix has received reports of an increased number of phishing scams and malware campaigns that take advantage of the winter holidays and holiday shopping season. Network Logix reminds users to remain cautious when receiving unsolicited email messages that could be part of a potential phishing scam or malware campaign.

These phishing scams and malware campaigns may include but are not limited to the following:

  • electronic greeting cards that may contain malware
  • requests for charitable contributions that may be phishing scams and may originate from illegitimate sources claiming to be charities
  • screensavers or other forms of media that may contain malware
  • credit card applications that may be phishing scams or identity theft attempts
  • online shopping advertisements that may be phishing scams or identity theft attempts from bogus retailers

Network Logix encourages users and administrators to use caution when encountering these types of email messages and take the following preventative measures to protect themselves from phishing scams and malware campaigns:

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Microsoft has issued a Security Bulletin Advanced Notification indicating that its December release will contain 14 bulletins. These bulletins will have the severity rating of critical to important and will be for Microsoft Windows, Microsoft Office, and Internet Explorer. Release of these bulletin is scheduled for Tuesday, December 13, 2011.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

DEWEY Outage Report

11/5 – 12:56:00 pm

Db integrity check is still confirmed as ongoing.

11/3 – 09:52:00 pm

After talking with engineers from our service provider, we are estimating that full mailbox functionality should be restored over the coming weekend.  We realize that this is totally unacceptable, however, we will be holding our service provider to account once things are restored.  Thank you for your patience, and we sincerely apologize for this ongoing issue.

11/3 – 07:46:00 pm

The database checks are still going, engineers are monitoring the progress through the read/write activity.

11/3 – 04:11:00 am

The integrity check are still continuing on both databases in the progress bar-less step. Still no numeric value at this time.

11/2 – 07:21:00 pm

Both integrity checks are still active and processing. Again no numeric value to assign as explained.

11/2 – 02:53:00 pm

Integrity chcks are moving. It’s going to be tough to give numbers as both DBs are now in progress bar less steps. Activity is confirmed.

11/2 – 12:24:00 pm

DB3 completed steps 2 and 3 out of 5, and is now on step 4. DB2 is at 72% of its final step.

11/2 – 10:15:00 am Eastern

Both DB Integrity checks are still continuing to progress.

11/2 – 08:31:00 am Eastern

Anticipating DB2 completion by end of business day. Unfortunately DB3 check is taking longer potentially completing its process by Friday.

11/2 – 08:24:00 am Eastern

Integrity checks on DB2 is on the final process at 80% while DB3 is on the second passing at 30%

11/2/2011 6:34AM Eastern

At this time integrity checks on DB2 & DB3 are progressing smoothly. We are on the final processes of DB2 while DB3 is  moving flawlessly.

11:14PM Eastern

The integrity check on DB3 is on step 4/5 and at 80% DB2 is on step 5/5 and at 65%. We anticipate DB2 finishing around 4 am and DB3 late Wed

8:27 PM Eastern

The integrity checks are still progressing, the last database is through 70% of the Scanning process.

6:40 PM Eastern

The integrity checks are still progressing, the last database is through 62% of the Scanning process.

Update 3:38 PM Eastern

The integrity checks are still progressing, the last database is through 50% of the Scanning.

Update 1:09 PM Eastern

The dial tone migration has completed and users are now able to access their mailboxes on the temporary database.

Update 12:30 PM Eastern

Our service provider will be performing a dial tone migration to DEWEYMBOX2 for users on the affected databases. A dial tone migration will allow users to reconnect to their user mailbox on DEWEYMBOX2 via Outlook, OWA and Active Sync however the mailbox will have no information other than the mail from the previous day when the outage occurred and any new live running mail.

Users will see the following prompt after restarting Outlook

If the user wants to access their new mail they’ll select “Use Temporary Mailbox”

After the databases are back online they will move users back to their original databases and then restore mail from the temporary mailbox.

Update 11:14 AM Eastern The integrity checks are still progressing, the last database is through 30% of the Scanning. No ETA as there is no completion estimate.

Update 9:13 AM Eastern The Engineers have been able to successfully mount one more database and are currently monitoring the final DB integrity check.

Update 7:26 AM Eastern The larger DB is about 25% completed while the second largest DB is still checked as errors were detected.

Update 4:41 AM Eastern The larger DB check has started showing progress while the second to largest DB is still on the final step of the integrity
check.

Update 4:00 AM Eastern There hasn’t been any significant progress change as the checks are still underway.

Update 2:30 AM Eastern The integrity check is still underway. Unfortunately the first check is on the last step which does not have a progress bar which is why no update has been provided.

Update 10:30 PM Eastern The integrity check is performing clean up and is on the final step of the check. We’ve begun the check on the last DB

Update 8:55 PM Eastern The integrity check is 90% completed.

Update 8:05 PM Eastern The current DB check is 2/3 completed. At this point we estimate being able to mount the database on or before 10PM EST.

Update 7:30 PM Eastern The integrity check is about 1/3 completed. At this point we are going to wait for completion before starting the integrity check on the DB3

Update 4:55 PM Eastern The integrity check is taking slightly longer than calculated. We will continue to monitor the repairs and will update this posting once the next step is reached.

Update 3:32 PM Eastern We’ve mounted half of the databases and are now running consistency checks on the larger databases.

Update 2:59 PM Eastern
The databases were offlined by Exchange and now in a dirty shutdown state. Before we can remount the databases we must run
consistency check.

Update 2:53 PM Eastern The databases for DEWEY were forcefully taken offline by Exchange. We are working on diagnosing the issue.

Update 2:16PM EasternWe’ve replaced a failed drive on the DEWEYMBOX1 server. Unfortunately to bring the server back to a healthy RAID we must rebuild the array. Throughout the rest of the work day users may see momentary delays on mail delivery if their mailbox is hosted on the affected array.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Slowness on DEWEY server

Initial traces and diagnostic point to a possible issue with memory. We are discussing a plan of action for resolution today.


(10/17 – 09:52:00 am)
We’ve noticed increased mail delivery times for messages inbound to the DEWEY network and affects mailboxes not on DEWEYMBOX2

(10/17 – 09:52:00 am)
We’ve received a report from a partner about this affecting outbound mail, but we haven’t received additional reports from other partners.

(10/17 – 11:01:00 am)
Mail delivery is back on speed on DEWEY and is still being monitored.

(10/17 – 11:01:00 am)
Between 2:45PM and 3:10 PM we seen drastic performance issues with DEWEY that was linked to an unused failed drive that activated.

(10/17 – 11:01:00 am)
In the interest of bringing live mail up to speed we are going to freeze the current queues and move them off for processing later today.

(10/17 – 11:01:00 am)
The reboot has completed and we are finishing service checks for availability.

(10/17 – 11:01:00 am)
DEWEY has gone down for the reboot and is now in POST.

(10/17 – 11:01:00 am)
We are rebooting DEWEY mbox1 to prevent further service issues throughout the day for DEWEY users. We do not anticipate another reboot

(10/17 – 11:01:00 am)
We are restarting the information store on DEWEY. We will be scheduling a reboot for after hours to apply memory changes.

(10/17 – 11:01:00 am)
Between 2:45 and 4:30 PM there were some clients who reported the inability to send out. Solved after restarting the mail submission service

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Post outage services

[UPDATE] 08/12/11 8:54 AM: All services for DEWEY have come back online from the restart. If you are still experiencing issues, please call 614.503.7139 option 2 to reach our Help Desk.

[UPDATE] 08/12/11 8:15 AM:: The domain controllers for DEWEY have come back online from the restart. We are still waiting for the primary CAS and MBOX server to come back online.

[UPDATE] 08/12/11 8:05 AM: We are going to reboot the CAS servers for DEWEY to clear up any issues before the start of business. Service may be interrupted for 15 mins

[UPDATE] 08/12/11  1:46 AM EST – All Maintenance on DEWEYMBOX2 has been completed succesfully. All services have been restored and queued messages are flushing into the user’s mailbox.

Again we’d like to take this opportunity to apologize for this issue and we appreciate your patience with us through out the process.


[UPDATE] 5:43 PM EST: Tonight we will be performing maintenance on DEWEYMBOX2 to address the performance issues reported by clients throughout the day.

Maintenance is expected to last between 10PM and 2AM.

99% of all of our services have been restored. Unfortunately, we have 3 servers that are still affected by the outage. Remembering that it was a power outage  and Windows Server’ capability of handling such outages is not on par with Linux, we’ve had some servers take longer to come back online. Services currently affected:

Dewey MBOX2 – It’s a secondary mailbox server on the DEWEY cluster, as such its a small user count but if your clients are on it, their mailbox is inaccessible.

Daisy – Legacy Exchange 2003 server

VS4 – One of our Virtual servers

These services affect a small but equally important portion of the client base, as such we’re all hands on deck on restoring services on these servers ASAP. For the Hosted Exchange users still affected please remember LiveArchive is online, if you’re having any issues authenticating to it or otherwise please open a support request and we’ll get you back online on LiveArchive as quickly as possible.

 

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Dear Network Logix Clients,

At approximately noon yesterday the data center hosting the hub of our partner’s US operations suffered a major power failure that momentarily interrupted all network services. More redundant services resumed operations immediately (ExchangeDefender) while others started recovering through emergency systems around 5 PM EST. We are incredibly sorry for the impact this has made on your operations and promise to enhance service redundancy so that this never happen again.

On behalf of the whole Network Logix team, I am sorry we put you into this situation. Much like you we’ve spent a part of yesterday in the dark with no clear ETA on when the services would be restored or how this failure could have happened in the first place.

The facility that our main US operations are in boasts N+2 redundancy, over 30 days of generator fuel and is one of the largest and most reliable in the world. As a matter of fact, we consolidated our central operations here due to the issues we had in California and Florida. The facility has had an incredible service record and has allowed us to provide the same level of service to you. Yesterday the facility experienced a failure in Automatic Transfer Switch (ATS) equipment designed to switch the power feed from live utility power to the power generators in the event of power loss. This was the piece of equipment that was designed to keep power available and while it is also redundant with A/B feeds, the data center distribution routers were not connected to both power banks. While our equipment remained powered on, the network connectivity remained down. This was the technical root of the issue.

Thankfully spare ATS was available and the utility, data center and supporting vendors were all available on site within the hour and completed the replacement within 5 hours of the service interruption. We did our best to keep everyone informed of everything we knew every step of the way through our Facebook page (http://www.facebook.com/NetworkLogix) and our Twitter account @networklogix.

Our operating procedures also call for use of emergency failover systems should the primary systems be down for more than 4 hours. At roughly 3:30 we began restoring services to our web sites, redundant Exchange clusters and continued restoring services well into the night as the data center facility restored full operations.

ExchangeDefender inbound service was not affected by this incident as it’s massively redundant through multiple data centers. However, a disruption to the major central control in Dallas effectively flooded the failover sites and some of our partners reported email delays from minutes to even two hours. Worse, our Exchange hosting clients were impacted for 4-6 hours and with the lack of ExchangeDefender LiveArchive to back them up, it completely failed them. Again, we are sorry for this issue and will address it immediately.

At approximately 10PM EST I was briefed by our partners to explain in detail what happened, how they responded, what we learned and what we intend to do to fix it going forward.

Going Forward

First of all, we will begin deployment of redundant control systems for ExchangeDefender: placing additional admin servers across our failover sites, adding more capacity to the existing ones and most importantly providing geographic redundancy to ExchangeDefender LiveArchive.

Second of all, we will be adding redundancy to our Exchange 2010 networks in USA.

Expect to see major changes behind the scenes this quarter. While this issue never occurred before and we don’t expect it to occur again, we have learned the hard way that we need to greatly improve certain areas of the product in particular LiveArchive.

Personally, I put my name and reputation on whom we choose as our partners and on the service we deliver. I believe we are the best and the solutions that we offer are without comparison. We will make sure that all of the features, not just the inbound mail processing, live up to the 100% uptime expectation you should have and we have maintained on our inbound service for the past decade.

My staff and our partner’s have worked tirelessly throughout the day and night to keep you informed and restore service as fast as possible. I want to personally thank you for your professionalism and the way you treated us during this difficult time. While it’s easy to lose composure and patience when services are down and there is limited visibility/ETA on resolution, almost universally the comments included “Well that sucks but I’m glad it’s you dealing with this and not me.” While I appreciate it, I do feel we failed you.

Our partner operations will remain in Dallas at the existing data center facility because simply put – they are the best. Even with the power incident which was the first one that we’ve experienced in nearly a decade of working with them, this is the kind of an issue Microsoft, Google and Amazon experience on a weekly basis. Cloud services are about providing an affordable IT solution through massively scalable equipment which is incredibly complex – it is not foolproof nor easy to fix when it goes down but the benefits are that you are not on the hook for a repair bill, equipment or the amount of manpower required to manage it all. Best of all, problems such as those experienced yesterday can be minimized and we will begin to work on that today.

Again, I apologize for the inconvenience this has caused you and your clients. I know we are fortunate to earn your business and the trust you put in us and our services that are designed to keep you up and running when your systems are affected. I look forward to talking to everyone that was affected by this and while I work my way through the messages, emails and callback requests I hope this email provides some clarity as to what happened and what we intend to do next.

Thank you for your business.

Sincerely,
Andrew Moon
Chief Technologist | Network Logix

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Hosted Exchange email interruption

We are experiencing routing issues in our Dallas data center affecting our Hosted Exchange email. Our engineers are working to resolve the issue as quickly as possible.

[UPDATE] 12:32PM EST: The issue has been identified as power related in the Data Center. Services are slowly coming online. We will update when service is fully restored.

[UPDATE] 2:17PM EST: The datacenter staff has confirmed an outage with the power plant and has individuals on staff attempting to redirect power around the core

[UPDATE] 3:27PM EST: Unfortunately our hands are tied with power related issues. We will keep this account updated as information comes along.

[UPDATE] 3:47PM EST: It has been determined that the ATS (Automatic Transfer Switch) will need repairs that will take time to perform. Fortunately our Data Center has another ATS that is on-site that can be used as a spare. Contractors are working on a solution right now that will allow us to safely bring that ATS in and use it as a spare while that repair is happening.

That plan is being developed now and we should have an update soon as to the time frame to restore temporary power. We will need to schedule another window when the temp ATS is brought offline and replaced by the repaired ATS.

[UPDATE] 4:32PM EST: Thank you for your patience as we work to address the ATS issue. We apologize for the situation and are working as quickly as possible to restore service.

We have determined that the repairs for the ATS will take more time than anticipated, so we are putting into service a backup ATS that we have on-site as part of our emergency recovery plan. We are working with our power team to safely bring the replacement ATS into operation. We will update you as soon as we have an estimated time that the replacement ATS will be online.

Later, once we have repaired the main ATS, we will schedule an update window to transition from the temporary power solution. We will provide advance notice and timelines to minimize any disruption to your business.

Again, we apologize for the loss of connectivity and impact to your business. We are working diligently to get things back online for our customers. Please expect another update within the hour.

[UPDATE] 4:45PM EST: The data center staff, electricians and all the suppliers have updated us with the provisional service restoration time of about 2:30 from now (about 6:30 PM EST). We will provide further update at that time. This is just an estimate and we will update this posting at that time. At this time we are turning up our emergency sites and failover systems (which also takes some time but will restore most of the services more rapidly).

[UPDATE]  5:10 PM EST: Our emergency failover systems are kicking in and we have restored service to support portals, web sites, outbound ExchangeDefender (inbound was not affected), louie and rockerduck. Please note THESE ARE OUR EMERGENCY FAILOVER systems, not the real thing. We are still waiting for services to be restored by the utility/power/electricians/etc?.

[UPDATE]  6:10 PM EST: Our team and electricians are working diligently to get the temporary ATS installed, wired and tested to allow power to be restored. As the ATS involves high-voltage power, we are following the necessary steps to ensure the safety of our personnel and your equipment housed in our facility.

Based on current progress the electricians expect to start powering the equipment on between 6:15 – 7:00pm Central. This is our best estimated time currently. We have thoroughly tested and don’t anticipate any issues in powering up, but there is always the potential for unforeseen issues that could affect the ETA so we will keep you posted as we get progress reports. Our UPS vendor has checked every UPS, and the HVAC has checked every unit and found no issues. Our electrical contractor has also checked everything.

We realize how challenging and frustrating that it has been to not have an ETA for you or your customers, but we wanted to ensure we shared accurate and realistic information. We are working as fast as possible to get our customers back online and to ensure it is done safely and accurately. We will provide an update again within the hour.

[UPDATE]  9:16 PM EST:Since it was power failure, the DEWEY database didn’t cleanly commit log files. We’re performing an integrity check before bringing back online.

[UPDATE]  10:32 PM EST: At this time all systems are back online. It may take several hours for the servers to process all the queued mail. If you are still unable to access your email, please call 614.503.7139 option 2. Thank you for your patience today. We sincerely apologize for the inconvenience to you and your business.

[UPDATE]  August 11 10:04 AM EST:At this time we are experiencing latency issues with the Dewey server due to the unforeseen power outage issue we experienced yesterday. Our Exchange Engineers are doing their best to avoid as much service interruption as they possibly to all of our partners and their clients as they currently work on fully restoring the Dewey server.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

DEWEY slowness

8:04 AM EST: We’ve identified some slowness on DEWEY. Please stand by while we research any issues we identify. We’ll have more details once the issue is resolved.

8:39 AM EST: We’ll be restarting the Information Store shortly. This will cause a 10-15 minute outage. We do apologize for the inconvenience.

8:52 AM EST: We were able to discipate the delays and queue delivery issues without stopping the Information Store service. We’ll hold off on maintenance until after hours tonight unless a pressing issue arises. Thank you.

9:15 AM EST: We’ve received reports of accounts not being accessible. We’re looking at emergency maintenance.

9:50 AM EST: Unfortunately, we’re now rebooting the server to alleviate the problems. We should be back online shortly.

10:25 AM EST: DEWEY is back online, we’re currently spooling all the queued mail.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

DEWEY performance issues

The DEWEY Exchange 2007 server was reported to have delays in processing requests from clients. We are restarting the information store to resolve the issue.

Update 1:00 PM: The information store has finished restarting and we’ve confirmed access to mailboxes on each DB.

Update 3:46 PM: Multiple parties have reported issues in delivery speed. We are rebooting the DEWEY server in order to clear out the cached memory.

Update 4:08 PM: We’ve been monitoring the server reboot via KVM. The server is in the process of installing windows updates before it completes the reboot.

Update 4:21 PM: The installation of the windows update has completed and the server is now stopping the Exchange services. The server should complete the reboot in the next 15 minutes.

Update 4:49 PM: The reboot has completed. We’ve verified that mail is being routed and confirmed that our test users could login.

Update 5:30 PM: After the reboot, the DEWEY transport started routing mail through one of it’s alias IPs instead of the primary IP, causing outgoing messages to be rejected as “Relaying denied, proper authentication required” The issue has been resolved and mail is routing out.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon

Microsoft has issued a Security Bulletin Advance Notification, indicating that its July release will contain four bulletins. Three bulletins will have the severity rating of critical and will be for Microsoft Windows and Office. The remaining bulletin will have the severity rating of important and will be for Microsoft Office. Release of these bulletins is scheduled for Tuesday, July 13, 2010.

Network Logix will provide additional information as it becomes available.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to StumbleUpon