Lady Runa - SQL Stitcher: Database Corruption

Monday, May 2, 2011

Meme Monday: I Got 99 SQL Problems And the Disk Ain’t One

This month, Thomas Larock (Website| Twitter ) declared a meme Monday inspired by the Hugo song 99 Problems - aside from disk issues, name 9 problems you frequently see in your shop which are not related to disk issues.

1) Using default install settings for file growth
Despite numerous examples from live systems showing that those settings are not appropriate for our product's databases, we frequently see new customers with all their databases set to the default 10% growth setting, despite the statement in the "best practices" documentation that tells them otherwise.

2) Bloated SQL error logs
Many times when customers report having issues and we're called in to examine what's happening with their SQL server, we find that we can't open the SQL Error Logs because the customer's SQL server hasn't been restarted in a long time and the errorlog is so bloated that it's too big for the UI to open in a timely manner. The simple fix, of course, is to set up a SQL job that runs sp_cycle_errorlog periodically.

3) Not doing ANY index maintenance
Frequently, when I hear about SQL performance issues, I find that the customer has turned off the regular index maintenance jobs "because they take too long". Eventually, this results in painfully out of date statistics, severely fragmented indices and terrible performance.

4) Shrinking databases as "maintenance" to "free up disk space"
I try my best not to use profanity or to scream (loudly) when I see this enabled on customer servers. I just take a deep breath and forward the following links to the guilty party:
Paul Randal: Why You Should Not Shrink your Data Files
Brent Ozar: Stop Shrinking Your Database Files. Seriously. Now.

5) Developers "testing" code on production
Don't get me started....

6) Poor backup plans not even close to SLA requirements
High volume OLTP Production database, full recovery with log backup once a day at midnight and full backup once a day at 1AM - but their SLAs say they have to completely recover from failure within one hour. They claim that because the SQL server is clustered, that they don't have to back up the databases more often. Really? Please don't call me when things go south.

7) No disaster recovery plan
... And office in the middle of Tornado alley. Again, please don't call me to resurrect your SQL server when your data center gets destroyed by one of the 500+ tornadoes that went through town. You don't have a copy elsewhere and I can't create something from nothing.

8) Letting idiots have access to the Server room
Believe me: I can't make this one up - it actually DID happen.

A particular person on the night cleaning crew entered a server room to vacuum it. Because the vacuum's cord was too short to allow him to vacuum the far side of the server room, he unplugged something to free up an outlet so he could vacuum the far corner of the server room. The "something" he unplugged was the main power for the SQL server, which knocked out the customer's website until someone entered the server room in the morning and noticed that the server was unplugged.

9) Not having automated monitoring on servers
You'd think this was obvious, but I've been called too many times to count late at night to hear that someone's server is "down", only to find out the reason the SQL server crashed or the backups failed was because the disk was full. Automated disk monitoring systems have been around for over a decade, yet many of our customers don't have any automated monitoring and I doubt that their IT people check the servers every day since they seem so surprised to discover that their disk has filled up completely.

After just thinking about those 9 items, it's time for a stress pill.

Thursday, December 2, 2010

Corrupt Database Adventures

Ever notice how a “quick question” can turn into a major project? A coworker IM’d me earlier today with a “quick question” asking whether I knew how to get a database out of Emergency Mode.

Uh oh…

She then reassured me that it was a DEV box and not production. Whew! That meant that we didn’t have to worry about losing important customer data and at worst we could restore from back up without too much pain. I decided to investigate since it’s not often that one gets to fiddle with a genuinely corrupt database that’s not critical to the bottom line.

I did a quick web search and found an article by Paul Randal that talked about the last resorts that people try first with a corrupted database. It mentioned that one thing people try is rebuilding the transaction log using DBCC REBUILD_LOG. I tried it, but it’s not a valid DBCC command for SQL 2005 (in fact, if I’d read a few more paragraphs in Paul’s article BEFORE trying it, I would’ve seen him mention that very fact – that should teach me the folly of skimming articles).

The next item mentioned was to perform a DBCC CHECKDB using the REPAIR_ALLOW_DATA_LOSS setting. What better time to try this than when you have a broken dev box to play with? Take a deep breath and see what happens.

DBCC CHECKDB (testdb, repair_allow_data_loss) WITH NO_INFOMSGS;

The results said:

Msg 605, Level 12, State 3, Line 1
Attempt to fetch logical page (1:2576) in database 5 failed. It belongs to allocation unit 72057594060406784 not to 281474980642816.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.

That didn’t sound promising.I tweeted #sqlhelp quoting the text of the Msg 605 to see if anyone else had any suggestions. Meanwhile, I searched for other possible tricks to try.

One posting I found on a SQL newsgroup (forgot which one and can’t find it again) stated that Msg 605 indicates a physical drive failure. I didn’t verify that, but since the dev box had been troublesome for quite some time, it didn’t sound too inaccurate.

Meanwhile, I got some suggestions from #sqlhelp to try DBCC DBREINDEX and DBCC CHECKTABLE on all the indexes and tables. I snagged a script from SQLServerCentral that loops through every table and performs a DBCC DBREINDEX on all indexes. Lots of black text scrolled through in the results pane, interspersed with some shocks of red. The red means something was wrong – but at least the black text following the red text indicated that things were repaired.

I then ran the DBCC CHECKTABLE using the same script as above to loop through all tables. It crunched through the tables happily for about 75% of the database, then it stopped with an error:

Msg 211, Level 23, State 51, Line 13
Possible schema corruption. Run DBCC CHECKCATALOG

Sweeeeeeeeeeeeeeeeeeeeet. An “Improvement”? No. I ran DBCC CHECKCATALOG as suggested. It completed successfully. THAT seemed promising. I tried rerunning

DBCC CHECKDB (testdb, repair_allow_data_loss) WITH NO_INFOMSGS;

Same results. One more time with gusto – this time I noticed that I’d lost the SQL connection. Reconnect, retry, and it disconnected before completing the command. The server had degraded sufficiently that there certainly was no hope of resuscitating it, so I logged off. A short while later, I received a message stating that they’d located the backup image of the dev box and were ready to rebuild it.

Although I spent several hours working on this, it was a great learning experience for me – most importantly, I learned that if DBCC CHECKDB can’t repair it, then likely you’re best off restoring the database from backup.

Before you find yourself in a situation with a corrupt database, take a look at Paul Randal’s “CHECKDB from Every Angle: Emergency Mode Repair”, which is an update to his article mentioned above. I wish I’d found that one first, but one uses the tools that the search engines return. Actually, the better choice would be to simply read through Paul Randal’s “Corruption” Category in his blog – this is the ever-growing set of articles that he’s written about database corruption. I’m going to spend a lot of time reading through those posts, and you should, too. It’ll certainly make you life easier when someone asks you a “quick question” on how do you fix a database in Emergency mode.

Monday, May 2, 2011

Meme Monday: I Got 99 SQL Problems And the Disk Ain’t One

Thursday, December 2, 2010

Corrupt Database Adventures

Saturday Morning Funny: Tell Claude AI What's Bothering You