I know that many of you have gone through your own harrowing tales of trying to bring environments back online. I always enjoy hearing experiences of these. Why? Because these are where learning takes place. Problems are found and solutions have to be found. While my tale doesn’t involve a tremendous amount of learning per se, I feel there are a few things I did discover along the way that may be useful for someone that has to deal with this later. So let’s being the timeline.
Backstory
The current server is a Microsoft Small Business Server 2011. This server serves primarily as a DNS/File/Exchange server. It houses about 3-400GB of Exchange data, and about 700GB of user data. Now this machine is normally backed up using a backup product called Replibit. This product uses an onsite appliance to house the data and stage for replication to the cloud. So theoretically you will have a local backup snapshot and a remote-site backup. As backups always somehow have challenges associated with them, this seems like an appropriate amount of caution. The server itself is a Dell and is more than robust enough to handle the small business’ needs. There are other issues I would be remiss to not mention. Like the majority of the network is on a 10/100 switch with the single gigabit uplink being used by the SBS server.
Sometime in the wee hours of the morning on
Wednesday….
This was when the server was laid low. Don’t know what exactly caused it, as I haven’t performed a root cause analysis yet, and it’s unlikely to happen now. For the future I will be recommending a new course direction for the customer, as I believe there are better options out there now (Office365, standard Windows Server).
I believe that there was some sort of patch that may or may not have happened about the time the machine went down. Regardless, the server went down and did not come back up. It would not even boot in Safe Mode. It would just continually reboot as soon as Windows began to load. Alerts went off notifying of the outage and the immediate action taken was to promote the latest snapshot to a VM on the backup appliance. This is one of the nice features that Replibit allows. The appliance itself runs on customized Lubuntu distro and virtualization duties are handled by KVM. The VM was started with no difficulty, and with a few tweaks to Exchange, (for some reason it didn’t maintain DNS forwarding options) everything was up and running.
After 20 min of unsuccessfully trying to get the Dell server to start in safe mode or Last Known Config, or any mode I could I decided my energies would be better spent just working on the restore. Due to the users working fine and happy on the vm, the decision was made to push the restore to Saturday to minimize downtime and disruption.
Saturday 8:00am…….
As much as I hate to get up early on a Saturday and do anything besides drink coffee, I got up and drove to the companies’ office. An announcement was made the day before that everyone should be out of email and network etc. Then we proceeded to shut down the VM. Using the recovery USB, I booted into the recovery console and attempted to start a restore of the snapshot that the VM was using to run. I was promptly told, “No” by the recovery window. Reason? The ISCSI target could not be created. This being the first time I had used Replibit personally, I discovered how it works is, the appliance creates an ISCSI target out of the snapshot, then uses that to stream the data back to the server being recovered. Apparently when we promoted the snapshot to a Live VM, it created a delta disk with the changes from Wednesday to Saturday morning. The VM had helpfully found some bad blocks on the 6mo old 2TB Micron SSD in the backup appliance, which corrupted the snapshot delta disk. This was not what I wanted to see.
With the help of Replibit support, we attempted everything we could to start the ISCSI target. We had no luck. We then tried creating an ISCSI target from the previous snapshot. This worked. This was a problem however, because we would lose 3.5 days of email and work. Through some black magic and a couple of small animal sacrifices, we mounted the D drive of the corrupted snapshot with the rest of the week’s data (somehow it was able to differentiate the drives inside the snapshot). I was afraid though, that timestamps would end up screwing us with the DB’s on the servers. Due to the lack of any other options though, we decided to press forward. The revised plan now, was to restore the C drive backup from Tuesday night and then try to copy the data from the D drive from the snapshot using WinSCP. We started the restore – it was about 11am-ish on Saturday. We were restoring 128GB of data only, so we didn’t believe that it would take that long. The restore was cooking at first, 2-350MB/min. But as the time wore on….the timer kept adding hours to the estimate and the transfer rate kept dropping. Let’s fast forward.
Sunday 9:20pm
Yes…. 30+hrs later for 130GB of data, and we were done with just the C drive. At this point, we were sweating bullets. The company was hoping to open as usual Monday morning and with those sort of restore times, it wasn’t going to happen. —Would like to send a special shout out to remote access card manufacturers. Dell’s iDRAC in this case. Without which, I would have been forced to stay onsite during this time and that wouldn’t have been fun—Back to the fun. First thing now was to see if the restore worked and the server would come up. I was going to bring it up in safe mode with networking as the main Exchange DB was on the D drive and I didn’t want the Exchange server to try to come up without that. Or any other services that also required files on the D drive for that matter.
The server started and F8 was pressed. “Safe Mode with Networking” was selected and fingers were crossed. The startup files scrolled all the way down through Classpnp.sys and it paused. The hard drives lit up and pulsed like a Christmas tree. 5 min later the screen flashed and “Configuring Memory” showed back up on the screen. “Fudge!” – this is what happened before the restore, just slower. Rebooted, came back to the item selection screen and this time just chose “Safe Mode”. For whatever reason, the gods were smiling on us and the machine came up. First window up by my hand was a command prompt with a SFC /scannow command run. That finished with no corrupt files found (of course) so I moved on. I then created the D drive as it had overwritten the partition table when the C drive was restored. I had no access to the network of course and needed that to continue with the restoration process. Rebooted again and chose “..with Networking” again. This time it came up.
Now we moved on to the file copy. The D drive was mounted on the backup appliance in the /tmp folder (just mounted mind you, not moved there) on the Linux backup appliance. We connected with WinSCP and chose a couple folders and started the copy. Those folders moved fine, so on to some larger ones……Annnnnd an error message. Ok what was the error? File name was too long. Between the path name and the file name, we had files that exceeded 255 chars. This was on basically a 2008r2 Windows server so there was no real help for files that exceeded that. While NTFS file system itself can accept a filename including path of over 32k characters, the Windows shell API can’t. Well crap. This was not going the way I wanted it to. Begin thought process here. Hmmm Windows says it has a hotpatch that can allow me to work around this… This doesn’t help me with the files that it pseudo-moved already though. I can’t move/delete/rename or do any useful thing to those files, whether in the shell or in Explorer. ( I do discover later that I can delete files locally with filenames past 255 char if I use WinSCP to do so. This does create a lock on the folder though so you will need to reboot before you can delete everything) I can’t run the hotfix in safe mode but I don’t really want to start Windows up in normal mode. I don’t have much choice at this point, so I move the rest of the Exchange DB files over to the D drive. This will allow me to start in regular mode without worrying about Exchange. I now go home to let the server finish the copy of about 350ish GB. A text is sent out that the server is not done and informing the company of the status of our work.
Monday morning 8am
The server is rebooted and it comes up in regular mode – BIG SIGH OF RELIEF – the hotpatch files are retrieved and I try to run them. Every one, even though 2008r2 is specifically called out, informs me that they will not work on my operating system. Well this is turning back into a curse-inducing moment.. again. Through a friend, I learn of a possible registry entry that might let us work with long file names – this doesn’t work either. Through my frantic culling through websites in my search for a solution, I find out there are two programs that do not use the Windows API and so are not hampered by that pesky MAX_PATH variable. (I did find there is a SUBST command I could use at the CLI to try to change the name manually. This is not feasible though as one user has over 50k files that would need to be renamed.) Those programs are RoboCopy and Fast Copy. Fast Copy looks a little dated, I know, but as I found out, it worked really well. On to the next hurdle! These tools require a Windows SMB share to work, so we need to mount a Samba share on the backup appliance and reference the mounted snapshot so we can get to it. This works and a copy is setup to test. 5 minutes in…. 10 minutes in… Seems like it’s working. Fast Copy is averaging a little better than 1GB/min transfer speeds as well. Set it up for multiple folders and decide to leave it in peace and go to bed (it is 12am at this point).
Tuesday morning
All files are moved over at this time. Some of them didn’t pull NTFS permissions with them for some odd reason, but no big deal, I’ll just re-create them manually. Exchange needs to be started. Eseutil to the rescue! The DB were shut down in a dirty state. The logs are also located on the C drive. We are able to find the missing logs though and merge everything back together and are able to get the DBs mounted. At this point, there is just a few “mop-up” things to do. There was one user that lost about 4 days of email since she was on a lone DB by herself and it was hosted on the C drive. She wasn’t happy, but not much we could do with a hardware corruption issue unfortunately.
Lessons learned from this are as follows (This list is not all inclusive). You should test the backup solution you are using before you need it. Some things are unfortunately beyond your control though. Corruption on the hardware on the backup device is one of those things which just seems like bad luck. You should always have a Restore Plan B, C, …. however. To go along with this, realistic RPOs and RTOs should be shared with the customer to keep everyone calm. Invest in good whiskey. And MAX_PATH variables suck but can be gotten around with the programs (whose links I included) above. Happy IT’ing to everyone!