Adding a column with NOT NULL feature to an existing table requires a default value to be inserted. If the table is large, the operation can take more time and resources. Importantly, this operation requires exclusive table lock. Before SQL Server 2005, this operation will prevent other users from accessing the table (unless NOLOCK query hint is specified or the session is in READ_UNCOMMITTED mode) until all the data pages were updated with the default value.
With SQL Server 2005, two (some call it one and a half!) new transaction isolation levels were introduced which allows the snapshot of the table to be copied into tempdb and accessed by other users. Users using READ_COMMITTED_SNAPSHOT or SNAPSHOT isolation levels can access the data as at the column does not exists. However the cost involved with the operation is still high and It too a reason for page splits.
People generally prefer not to issue NOT NULL clause with the ADD COLUMN statement. Adding a column with null value and then updating the table with the default value, chunk by chunk is one of the mechanism employed to prevent the issue to some extend. This method will reduce the locks and increase the availability of the table. However, it will not prevent the page splits.
With the introduction of SQL Server 2012, we do not have to worry much about it.
DEFAULT statement is part of meta data now, and does not affect the existing data pages. They will affect only the new pages. When a request is made to retrieve existing data pages, SQL Server will display the default value from the meta data. (The only things SQL Server needs to check are, the list of columns, nullable property and the existence of data in data pages.
When I shared with some of my friends about this,they asked me a ton of questions on how it is done.
With SQL Server 2005, two (some call it one and a half!) new transaction isolation levels were introduced which allows the snapshot of the table to be copied into tempdb and accessed by other users. Users using READ_COMMITTED_SNAPSHOT or SNAPSHOT isolation levels can access the data as at the column does not exists. However the cost involved with the operation is still high and It too a reason for page splits.
People generally prefer not to issue NOT NULL clause with the ADD COLUMN statement. Adding a column with null value and then updating the table with the default value, chunk by chunk is one of the mechanism employed to prevent the issue to some extend. This method will reduce the locks and increase the availability of the table. However, it will not prevent the page splits.
With the introduction of SQL Server 2012, we do not have to worry much about it.
DEFAULT statement is part of meta data now, and does not affect the existing data pages. They will affect only the new pages. When a request is made to retrieve existing data pages, SQL Server will display the default value from the meta data. (The only things SQL Server needs to check are, the list of columns, nullable property and the existence of data in data pages.
When I shared with some of my friends about this,they asked me a ton of questions on how it is done.
- Does it check the meta data for each row when selected?
- What if the default value should be is null
- Does it keep any mechanism to check what rows were inserted before and what were inserted after the addition of new column?
- What is the cost overhead on select queries?
- Yes when we add a column with variable length (e.g.varchar) it may not enter data for nulls, but what about fixed length columns?
First of all, the metadata is retrieved before reading the data pages. The metadata explains about the data and SQL Server uses this metadata to interpret the data (user data). Earlier it was using some metadata, and now one more information is added: the default value, if the column is NOT Nullable.
On the second question, since this operation is applicable only to columns with NOT NULL property, this scenario will not occur. (You cannot specify NULL as the default value for NOT NULL columns)
SQL Server does not care when the column is added. It simply checks whether there is data for a particular column. If it doesn't exist, SQL Server will further check from its metadata whether the column is not nullable and has default value. If it affirms, it will replace the results with the default value. (All these operations happen in memory)
I ran some tests but i couldn't see any noticeable performance degrade in select queries. (I ran different tests 1000 times each against a table with different rows from 4000 rows to 4 million rows. Duration and CPU varies from query to query due to different factors, but the reads remain same for the same query) In fact, in most cases the select query performance has increased as the number of pages are less, and there is no fragmentation. There was no noticeable difference in performance when the number of pages are same.
Finally, the issue is not related to the data type. SQL Server has changed the mechanism of storing data at least for this scenario.
I have the code uploaded here. Unfortunately, it is not structured well I have used upper case, lower case and mixed case without any consideration to the formatting. I didn't add any comments to separate different tests. But I believe it is self explanatory. Please let me know if you have any issues with the code or difficulty in understanding.
No comments:
Post a Comment