Jan 15, 2010

Lies, damn lies and statistics

I was thinking for a long time to write something about statistics. In fact, I even know what to write: How much statistics are reliable and how they affect the query execution plans etc. I even thought of a heading: which you see now. But some or other reason, I didn’t do it.
Finally now I made it.

Statistics is an internal mechanism used by SQL Server to make decisions on query execution plans. For example, assume that we have a query like this (I am using AdventureWorks database which came with SQL Server 2005. It is available for download from Codeplex)


SELECT * FROM HumanResources.Employee WHERE ManagerID = 263
SELECT * FROM HumanResources.Employee WHERE ManagerID = 6

We have an index on ManagerID. Now the issue is you will get two possible (most likely) execution plans generated.


  1. Read the index which has the mangerID.  This can be done by Index seek operation. Find the PK (all non clustered indexes refer the clustered index key) Now read the rows using the clustered index.   For joining the data we need to have a join mechanism (in this case as the data is ordered we can have nested loops) That means, we need to  have an index seek, clustered index seek and a nested loop operations to select the data needed.
  2. Simply scan the clustered index and get all the data needed.  This mechanism will read all data from the table.
You can see the detailed execution plans in the picture below


You may think that the first method is faster. It is faster for some cases.  The key is how many rows the ManagerID will refer to. IF the number of rows increases, SQL Server may decide that the second method is better than the first.  (For more details please refer the article on turning point. I have mentioned it some time back in a blog in this site. The author is a well known SQL Server guru. She, in this article, has explained when the decision will change.) For this SQL Server uses statistics, Statistics have information on data distribution and length of columns. Based on these information SQL Server determines the cost of an execution plan. 
Now to make the decision SQL Server uses statistics.
Statistics could be created automatically if we set the option to automatically create statistics at the database level.  The next statement makes the option on
ALTER DATABASE AdventureWorks SET AUTO_CREATE_STATISTICS ON
Once the above setting is executed, statistics will be created for all tables automatically.
Also, we can create statistics using CREATE STATISTICS statement.
You may wonder why I used the heading.  That is where I am coming to.
CREATE STATISTICS option could be used to create the statistics using the entire data or part of the data.  Most of the time part of the data (called sample) is used for creating statistics as on large tables using all data (using FULLSCAN option) may not be practical. There is a possibility that the sample does not match with the data distribution of the entire table.
Additionally, statistics reflects the data distribution at that time only.  Statistics will not be updated as and when each row is updated/inserted/deleted.  There are two ways to update statistics.




  1. Use explicit UPDATE STATISTICS statement  to update.
  2. Make the database’s statistics to automatic update using ALTER DATABASE AdventireWorks SET AUTO_UPDATE_STATISTICS ON. This method will lead to statistics getting updated periodically if SQL Server understands that enough data modification has happened since the last update/create of statistics. This decision is made using the number of rows in the table.  That means, even if a table goes through extensive changes, if changes do not happen to many rows at a time, SQL server will not take any individual statement is a reason to update statistics.  Additionally, unless the number of rows changed significantly,  SQL Server will not understand that the table has changed much. 
This is the main reason I used the heading. 

If the statistics is out of date, SQL Server may take sub-optimized execution plans for execution of queries.  This may lead to larger locks and unnecessary operations like page split.

How can we prevent this?  The best way of doing that is having a job which checks the tables and updates the statistics periodically.

No comments:

Post a Comment