Tuesday, July 12, 2011

The Skinny on SnapManager for SQL Server

Maybe, but not quite yet

I've been implementing SnapManager for SQL Server in our production environment for about 8 months. Every new system coming online during that time is configured to use SMSQL as the standard backup and recovery tool and we've also managed to retrofit some existing systems to use the tool. All-in-all we have roughly 50 servers runing SMSQL and managed by SMSQL. 

During our virtualization project we, meaning the DBA team, were basically dictated to use the tool. Our storage was all moving to NetApp and lots of dollars exchanged hands so the storage team wanted to squeeze the most out of what NetApp had to offer. We had online conference calls with NetApp discussing the tool and we had a NetApp engineer onsite to outline how the tool works and how the systems should be configured.   

I have to say I was intrigued by the snapshot technology and I still believe it is a great way to back up databases. Early on though I had my reservations. Configuration was difficult (I have an article discussing SMSQL configuration at http://www.mssqltips.com/tip.asp?tip=2294) and I could tell right away that using the SMSQL management console to manage a lot of backups was going to be a problem. Still, I learned it and we implemented it but as of today I'm a bit older and wiser and we've recently gone back and revisited our approach.  Let's look at some recent discoveries.

You will begin to have difficulty managing an environment where the servers running SMSQL exceed twenty. 
The dreaded hourglass. You'll see this
often in large SMSQL environments
This is of course a rough estimate. I think of it as the MS Access rule.  You wouldn't normally want more than 10 people using a single MS Access database,  well you aren't going to want more than 20 servers listed in SMSQL management console.  The application bogs down to an unbearably slow pace.  We have about 50 servers listed and if we close the console and reopen it we won't see the server list again for 30 minutes.

Problems between SnapDrive and virtual machines cause random errors
SMSQL is simply an interface calling Snapdrive. There are known stability problems between Snapdrive and vMotion.  We constantly face errors when SMSQL cannot see the LUNs or the drives are invalid.  This doesn't mean the system is down - it just mean SMSQL has lost its way. Unfortunately, backups will not occur.  The worst problem is you may fix this one day but two days later the same problem will occur on the same server. 

Disconnect between SMSQL and SQL Server
SMSQL does not actually "talk" to SQL Server. The only time the two ever meet is when you first configure a job and SMSQL calls the SSMS to create and schedule a SQL Server job. The job is a command like executable which calls the service running on the database server. The list of databases and other backup instructions is hardcoded in the command line. The problem here is if you add or remove a database the job has to be manually updated.  But just because you manually update the job doesn't mean the SMSQL console configuration will recognize the change. Basically, you have to wait until all user databases are created prior to configuring SMSQL on the server. If a new database is added or removed you have to manually reconfigure SMSQL.  Doing this can be a problem for even small shops. Also keep in mind that you may have a successful SQL job but the snapshot backup might still fail. You should not use job success or failure notification as your primary means of determining whether or not backup was successful.

Configuration requirement make retrofitting servers unpractical. 
SMSQL has strict LUN requirements. I'm not going to discuss them here but let's just say if you have a large amount of server already in place then reconfiguring these to use SMSQL may not be worth the effort. We have around 200 production servers and we estimated that to reconfigure these to use SMSQL would take an excess of a thousand man hours. That's just not going to happen.

Man hours wasted
With only 50 servers using SMSQL we have one DBA devoted full time to backup and recovery. Prior to SMSQL we had 1/2 DBA managing 300 servers.  This is due to the random nature of the errors. The backup failed list never shortens because the same problems occur over and over and NetApp does not yet have a resolution. Throw on top of this the application slowness and we have one over-worked DBA.

So, what's the answer?  We have already decided as an organization to begin moving systems off of SMSQL and back to Idera's SQLSafe.  We just cannot risk our ability to recover our systems. I also want to point out that the NetApp engineer we talked to about these problems has been honest and understanding. So far, NetApp has shown a great desire to put forth the effort to produce a better product. I hope to be working with NetApp engineers and walk them through our real-world scenarios.  My hope is they will gain a better understanding of what DBAs require to successfully manage backup and recovery for large SQL Server environments. We plan to still keep all new systems configured as if they will run SMSQL. Maybe one day a useful tool will be made to manage a very useful technology. 

4 comments:

  1. I would like to suggest to people who are using SMSQL to add a comment telling me what improvements you would like to see in the tool. My first suggestion to NetApp is to think about using a repository for backup and recovery configuration and history. All good 3rd party backup tools use one and it adds speed and scalability. This is also a good start to a truly centralized management console. Any others?

    ReplyDelete
  2. Scott, I believe the SnapInfo volume contains most of the backup & recovery configuration and history (backup logs). Or were you thinking about something else?

    Also, how did you have your storage provisioned to your SQL servers? RDMs? MS iSCSI software initiator?

    Have you looked at using newer versions of SMSQL and SnapDrive with NFS-connected datastores? This really simplifies things IMHO. You just leave your databases sitting on a VMDK on a NFS-based datastores. SnapDrive 6.3 now supports this and is aware of such configuration.

    Also, has anyone at NetApp spoken to you about SnapCreator? SnapCreator is a framework by which you can create a single-pane of glass to manage snapshot backup utilities in one central location. It also allows you to write your own "SnapManager-ish" scripts that interact with SnapDrive and the applications to take application-consistent snapshots. You can essentially create a SnapManager for Anything using SnapCreator.

    ReplyDelete
  3. Taylor, we store logs in the SnapInfo directory. But we don't normally take log backups. Our databases run in simple mode but we take hourly full snaps. This saves us some configuration hassles and serves the same purpose as taking hourly logs. Our configuration problems are with isolating datafiles on their own volumes and separating out any application data. Our older systems usually have a single D drive and everything is stored there. Well, this won't work when moving to a SMSQL solution.


    We use RDM and run the latest SMSQL version (5.1.0.676) and SnapDrive 6.3.2.

    I'll ask NetApp about SnapCreator and see if that's an option for us.

    Our real concern is in the SMSQL application and not necessary in how the storage is configured. SMSQL is just not a prime-time tool for managing backup and recovery in large SQL Server shops. Maybe it will be one day.

    Thanks for you comments

    ReplyDelete
  4. Hey Scott,

    I would like to point out that taking hourly full snaps does not in any way serve the same purpose or allow you to do the same thing that hourly transaction log backups do. A transaction log backup using FULL recovery will allow you to roll a database forward to a specific point in time facilitating point in time recovery of the database. For example if someone drops a table by accident you can roll the logs forward to the ponit in time just before the table was dropped and minimize the total risk for data loss. Using full snapshot backups only, you can get up to the point of the last snapshot, which if that was 59 minutes before the table was dropped, you have 59 minutes of data loss potential. Log backups with full recovery at 59 minutes you would take a log backup and use the STOPAT command to stop the redo operation before the event occurred for near zero data loss.

    Jonathan Kehayias

    ReplyDelete