Tuesday, July 19, 2011

Red Gate's Exceptional DBA Award

Along with 4 other DBA's I received notice of my esteemed spot as a finalist for RedGate's Exceptional DBA Award.  It reminds of the time I watched the show on the Discovery Channel where they film the Navy Seals training. One of the best lines in the show is when the Instructors ask the candidates what second place is and they all reply "First Loser! Sir!".  Well, there are times when being second place, or third, or fourth, or even fifth isn't all that bad.

Don't get me wrong. I'd love nothing more than to win. Both for the honor and acknowledgement of years of hard work but also for the opportunity to go to PASS which I've never been able to attend. I met tons of great people at the SQLRally and it would be great to see them again and meet new friends.  I also spent an interesting time in my life in Seattle. I lived downtown, worked as a concrete finisher, and basically starved. But it was a great place and I'd love to see how it changed.

There's some fantastic competition which makes being a finalist all that more exciting. Colin Stasiuk is extremely active in the community and already has a wonderful following that is hard to compete against.  There is also Tom Hill who I actually had the pleasure of working with at Anheuser-Busch and he also worked with many of my friends at Monsanto and they all have good things to say about him. I wish all the candidates the best of luck.

As for me, please go to the RedGate website and take a look.  Hopefully you'll see a word or phrase or something familiar in my experiences which may help win your vote.  Oh, also, if you vote for me and I end up winning - talk to me and we'll have a beer at PASS. :)

VOTE FOR PEDRO (OR SCOTT)

Tuesday, July 12, 2011

The Skinny on SnapManager for SQL Server

Maybe, but not quite yet

I've been implementing SnapManager for SQL Server in our production environment for about 8 months. Every new system coming online during that time is configured to use SMSQL as the standard backup and recovery tool and we've also managed to retrofit some existing systems to use the tool. All-in-all we have roughly 50 servers runing SMSQL and managed by SMSQL. 

During our virtualization project we, meaning the DBA team, were basically dictated to use the tool. Our storage was all moving to NetApp and lots of dollars exchanged hands so the storage team wanted to squeeze the most out of what NetApp had to offer. We had online conference calls with NetApp discussing the tool and we had a NetApp engineer onsite to outline how the tool works and how the systems should be configured.   

I have to say I was intrigued by the snapshot technology and I still believe it is a great way to back up databases. Early on though I had my reservations. Configuration was difficult (I have an article discussing SMSQL configuration at http://www.mssqltips.com/tip.asp?tip=2294) and I could tell right away that using the SMSQL management console to manage a lot of backups was going to be a problem. Still, I learned it and we implemented it but as of today I'm a bit older and wiser and we've recently gone back and revisited our approach.  Let's look at some recent discoveries.

You will begin to have difficulty managing an environment where the servers running SMSQL exceed twenty. 
The dreaded hourglass. You'll see this
often in large SMSQL environments
This is of course a rough estimate. I think of it as the MS Access rule.  You wouldn't normally want more than 10 people using a single MS Access database,  well you aren't going to want more than 20 servers listed in SMSQL management console.  The application bogs down to an unbearably slow pace.  We have about 50 servers listed and if we close the console and reopen it we won't see the server list again for 30 minutes.

Problems between SnapDrive and virtual machines cause random errors
SMSQL is simply an interface calling Snapdrive. There are known stability problems between Snapdrive and vMotion.  We constantly face errors when SMSQL cannot see the LUNs or the drives are invalid.  This doesn't mean the system is down - it just mean SMSQL has lost its way. Unfortunately, backups will not occur.  The worst problem is you may fix this one day but two days later the same problem will occur on the same server. 

Disconnect between SMSQL and SQL Server
SMSQL does not actually "talk" to SQL Server. The only time the two ever meet is when you first configure a job and SMSQL calls the SSMS to create and schedule a SQL Server job. The job is a command like executable which calls the service running on the database server. The list of databases and other backup instructions is hardcoded in the command line. The problem here is if you add or remove a database the job has to be manually updated.  But just because you manually update the job doesn't mean the SMSQL console configuration will recognize the change. Basically, you have to wait until all user databases are created prior to configuring SMSQL on the server. If a new database is added or removed you have to manually reconfigure SMSQL.  Doing this can be a problem for even small shops. Also keep in mind that you may have a successful SQL job but the snapshot backup might still fail. You should not use job success or failure notification as your primary means of determining whether or not backup was successful.

Configuration requirement make retrofitting servers unpractical. 
SMSQL has strict LUN requirements. I'm not going to discuss them here but let's just say if you have a large amount of server already in place then reconfiguring these to use SMSQL may not be worth the effort. We have around 200 production servers and we estimated that to reconfigure these to use SMSQL would take an excess of a thousand man hours. That's just not going to happen.

Man hours wasted
With only 50 servers using SMSQL we have one DBA devoted full time to backup and recovery. Prior to SMSQL we had 1/2 DBA managing 300 servers.  This is due to the random nature of the errors. The backup failed list never shortens because the same problems occur over and over and NetApp does not yet have a resolution. Throw on top of this the application slowness and we have one over-worked DBA.

So, what's the answer?  We have already decided as an organization to begin moving systems off of SMSQL and back to Idera's SQLSafe.  We just cannot risk our ability to recover our systems. I also want to point out that the NetApp engineer we talked to about these problems has been honest and understanding. So far, NetApp has shown a great desire to put forth the effort to produce a better product. I hope to be working with NetApp engineers and walk them through our real-world scenarios.  My hope is they will gain a better understanding of what DBAs require to successfully manage backup and recovery for large SQL Server environments. We plan to still keep all new systems configured as if they will run SMSQL. Maybe one day a useful tool will be made to manage a very useful technology.