During this time I also went through some troubleshooting processes. Here is one of that:
A particular server failed to respond. Some DBAs tried to restart but the service could not be re-started. Agent was trying to stop before SQL Server stops and that was on waiting. At this moment I got a chance to work with them in resolving the problems. I tried to check on SQL Server error log. I couldn’t even connect. I checked with task manager where it showed that SQL Server was using only a few MB of memory (less than 1 GB) but CPU usage was 70% - 80%
This is how I identified the error: I opened SQL Server Configuration Manager, and selected the SQL Server service. Opened the properties, switched to advanced tab and noticed the startup parameters. One parameter starts with –e option which states the path of error log. I opened the folder specified and opened the “errorlog” file (the file without an extension) using notepad. Even though the file is named as errorlog, the file contains most of the activities occurred on the server. I went right to the bottom and started reading from there. One constant error was like this: (I changed the path name and database name to protect the identity of the owners)
2009-10-09 10:47:14.31 spid52 Error: 823, Severity: 24, State: 2.
2009-10-09 10:47:14.31 spid52 The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x00000000374000 in file 'G:\MSSQL2K5\Data\TheDB_sysData.mdf'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
What this error means? It was SQL Server was unable to read particular page of the database. Does that mean that the page is corrupted? It could be either yes or no. Even a connection failure to the disk while SQL Server was trying to read the particular page may cause this error.
How will you resolve the problem?
My ideal approach would be to connect to SQL Server using dedicated administrator connection and remove the database either my placing the database in emergency mode. When things get settle run DBCC CHECKDB command against the database and resolve the disk issues.
A Couple of things prevented me from doing that. Firstly, during this process someone told me that the team responsible for the storage has upgraded the firmware of the disk subsystem. After that they have restarted the service and then ran into trouble. The next issue was I also saw similar errors against most of the databases. Thirdly, dedicated database connection was not enabled at that server.
So I declared emergency regulations and stopped the service using the crude method. Open using task manager and killed the process. I ran checkdisk against the disk to see whether it has produced any errors. As it was working fine, I restarted the service and things were smooth.
So what caused SQL Server say that it could not read the file? This is one of the common issues when you have large array of disks in your storage device. When you restart the storage device and SQL Server sand box, often SQL Server service gets started before the storage device is ready. At that time, SQL Server will continue to through the same error again and again even after the storage device is ready. The only option I have seen so far is to restart the service.