|Maybe, but not quite yet|
I've been implementing SnapManager for SQL Server in our production environment for about 8 months. Every new system coming online during that time is configured to use SMSQL as the standard backup and recovery tool and we've also managed to retrofit some existing systems to use the tool. All-in-all we have roughly 50 servers runing SMSQL and managed by SMSQL.
During our virtualization project we, meaning the DBA team, were basically dictated to use the tool. Our storage was all moving to NetApp and lots of dollars exchanged hands so the storage team wanted to squeeze the most out of what NetApp had to offer. We had online conference calls with NetApp discussing the tool and we had a NetApp engineer onsite to outline how the tool works and how the systems should be configured.
I have to say I was intrigued by the snapshot technology and I still believe it is a great way to back up databases. Early on though I had my reservations. Configuration was difficult (I have an article discussing SMSQL configuration at http://www.mssqltips.com/tip.asp?tip=2294) and I could tell right away that using the SMSQL management console to manage a lot of backups was going to be a problem. Still, I learned it and we implemented it but as of today I'm a bit older and wiser and we've recently gone back and revisited our approach. Let's look at some recent discoveries.
You will begin to have difficulty managing an environment where the servers running SMSQL exceed twenty.
|The dreaded hourglass. You'll see this |
often in large SMSQL environments
This is of course a rough estimate. I think of it as the MS Access rule. You wouldn't normally want more than 10 people using a single MS Access database, well you aren't going to want more than 20 servers listed in SMSQL management console. The application bogs down to an unbearably slow pace. We have about 50 servers listed and if we close the console and reopen it we won't see the server list again for 30 minutes.
Problems between SnapDrive and virtual machines cause random errors
SMSQL is simply an interface calling Snapdrive. There are known stability problems between Snapdrive and vMotion. We constantly face errors when SMSQL cannot see the LUNs or the drives are invalid. This doesn't mean the system is down - it just mean SMSQL has lost its way. Unfortunately, backups will not occur. The worst problem is you may fix this one day but two days later the same problem will occur on the same server.
Disconnect between SMSQL and SQL Server
SMSQL does not actually "talk" to SQL Server. The only time the two ever meet is when you first configure a job and SMSQL calls the SSMS to create and schedule a SQL Server job. The job is a command like executable which calls the service running on the database server. The list of databases and other backup instructions is hardcoded in the command line. The problem here is if you add or remove a database the job has to be manually updated. But just because you manually update the job doesn't mean the SMSQL console configuration will recognize the change. Basically, you have to wait until all user databases are created prior to configuring SMSQL on the server. If a new database is added or removed you have to manually reconfigure SMSQL. Doing this can be a problem for even small shops. Also keep in mind that you may have a successful SQL job but the snapshot backup might still fail. You should not use job success or failure notification as your primary means of determining whether or not backup was successful.
Configuration requirement make retrofitting servers unpractical.
SMSQL has strict LUN requirements. I'm not going to discuss them here but let's just say if you have a large amount of server already in place then reconfiguring these to use SMSQL may not be worth the effort. We have around 200 production servers and we estimated that to reconfigure these to use SMSQL would take an excess of a thousand man hours. That's just not going to happen.
Man hours wasted
With only 50 servers using SMSQL we have one DBA devoted full time to backup and recovery. Prior to SMSQL we had 1/2 DBA managing 300 servers. This is due to the random nature of the errors. The backup failed list never shortens because the same problems occur over and over and NetApp does not yet have a resolution. Throw on top of this the application slowness and we have one over-worked DBA.
So, what's the answer? We have already decided as an organization to begin moving systems off of SMSQL and back to Idera's SQLSafe. We just cannot risk our ability to recover our systems. I also want to point out that the NetApp engineer we talked to about these problems has been honest and understanding. So far, NetApp has shown a great desire to put forth the effort to produce a better product. I hope to be working with NetApp engineers and walk them through our real-world scenarios. My hope is they will gain a better understanding of what DBAs require to successfully manage backup and recovery for large SQL Server environments. We plan to still keep all new systems configured as if they will run SMSQL. Maybe one day a useful tool will be made to manage a very useful technology.