dbthoughts: Page Splits

Introduction

I thought of writing a series about page splits. Initially I started writing about it and it slowly became a huge article. I was in two minds on whether I should write a series, of blogs or go for a big article. Finally, I have made up my mind for the first option and here is the first part of the five part series.

What is page Split?

In SQL Server data is stored into a page. A data page is 8 kb of size. It means when the rows are inserted they need to be inserted into a page. There is an exception when large blob data is inserted sql server can insert the data separately into another page and keep only the pointer (For more details refer: row overflow). Apart from that, a row should fit into the page and cannot split into multiple pages.

The decision on where the data should be inserted is decided by the clustered key. If the table does not have a clustered key, (which is called a "heap") the data inserted always be inserted at the end. On clustered tables the data is logically stored in clustered key order. (I'll discuss later on why the word "logically" is used here. But for the moment, it is safe to assume that the pages are in clustered key order) When a row need to be inserted or updated but it cannot fit into the balance space within the current page, The data is inserted into a new page and is called page split. There are two types of page splits occur in SQL Server. Firstly, the natural growth of data that periodically requires new pages, irrespective of whether it is a heap of clustered table. Secondly, a data insert or update operation happens within an existing page that the page needs to move some of the existing data in order to accommodate the inserted /updated data. Even though technically the first operation too is page split, as the content of the page dos not split in that type of operation. The second operation also called as mid-page split, which is a concern of DBAs.
There are primarily two reasons for mid-page split to occur:

Data inserts happen not in clustered key order
Update of variable length columns

Let's analyze each scenario:

Scenario 1: Inserts which cause mid page splits

Now let us assume we have a table which has a row size of 1300 bytes. (The example given below takes a few bytes more than 1300, but for this example, it does not make any change) It means a fully occupied page will have 6 rows. Let us assume that we have a table where a particular page has 5 rows having the clustered key values 1,2,3,5 and 6. Note that I purposefully skipped the row with clustered key value 4.

CREATE TABLE dbo.Note

(

NoteID int NOT NULL CONSTRAINT PK_Note PRIMARY KEY CLUSTERED (NoteID),

NoteText char(1300) NOT NULL

)

Insert into dbo.Note Values (1,REPLICATE('a', 1300))

Insert into dbo.Note Values (2,REPLICATE('b', 1300))

Insert into dbo.Note Values (3,REPLICATE('c', 1300))

-- The row with NoteID = 4 is missing here

Insert into dbo.Note Values (5,REPLICATE('e', 1300))

Insert into dbo.Note Values (6,REPLICATE('f', 1300))

As the data can fit into one page the total number of pages needed for data will be one.

EXEC sp_SpaceUsed 'dbo.Note'

Now a new row is inserted with the clustered key of 4. It should be inserted into the same page as the value 4 falls between 3 and 5.

Insert into dbo.Note Values (4,REPLICATE('5', 1300))

EXEC sp_SpaceUsed 'dbo.Note'

Since the page already has space within the page there is no need for page split. Even though the inserted order does not match the clustered key order (NoteID =5 and NoteID =6 were inserted before NoteID =4) they all fall into the same page without a page split. The reason behind this is the insert mechanism employed by SQL Server. During any DML query (select, insert, update or delete) all the required pages will be loaded into memory, not just the rows required. During inserts, SQL Server will identify that the data should be inserted into a particular page and load the page into memory and inserts are done into the page.

Now assume that another table has a row size of 1900 bytes.

CREATE TABLE dbo.Note2

(

NoteID int NOT NULL CONSTRAINT PK_Note2 PRIMARY KEY CLUSTERED (NoteID),

NoteText char(1900) NOT NULL

)

When the above code is executed, SQL Server will create the table definition no data page will be allocated until the first row is inserted.

Insert into dbo.Note2 Values (1,REPLICATE('a', 1900))

Insert into dbo.Note2 Values (2,REPLICATE('b', 1900))

Insert into dbo.Note2 Values (3,REPLICATE('c', 1900))

-- The row with Note2ID = 4 is missing here

Insert into dbo.Note2 Values (5,REPLICATE('e', 1900))

Insert into dbo.Note2 Values (6,REPLICATE('f', 1900))

EXEC sp_SpaceUsed 'dbo.Note2'

The first four rows will take slightly more than 7616 bytes and there is no room for the fifth row. But since the last row inserted does fall as the last entry in the clustered key order, it will get into a new page without making considerable changes into the previous page. Only change happens in the previous page is the pointer to next row is updated.

As mentioned before, page split occurs even when the data is inserted into a new page. However, this is inevitable and does not cause fragmentation or heavy writes.

As usual SQL Server will find the first free page and insert the first row (PK = 1). Since the table is small the free page will be allocated from a mixed extend SGAM (Shared Global Allocation map)

The subsequent three rows too will be inserted into the same page. As a page cannot be shared across multiple tables, any concurrent inserts to other tables will not make any difference here. Let us assume the data was inserted into Page ID 159.

When the row with PK = 6 was inserted it will get into another page, which will be after 159, unless another page before 159 became free. (Suppose you removed a large table or deleted large amount of data or rebuilt the clustered index of a table, this situation can occur) Let us assume that the new row is 165.

Now, let us insert another row into the table

Insert into dbo.Note2 Values (4,REPLICATE('5', 1900))

EXEC sp_SpaceUsed 'dbo.Note2'

Now the data needs to be inserted into middle of first page which is already full. (The word “Full” needs to be taken with a pinch of salt here. Technically the page has some free space left. But the free space is not enough to insert the new data.) It causes mid page split which also leads to fragmentation and heavy writes.

Let us look into how SQL server performs the operation of page split. It will create a new page and copy half of the rows from the existing page into the new page. As mentioned before, the new page usually get allocated after the last allocated page. Let us assume it as 166. Now both pages will be half empty, so that the new page will be inserted into the page it needs to be inserted. In this case the first page which had rows with PK 1, 2, 3 and 5 will split into 2. The first page will now have rows 1 and 2 and the new page will have rows 3 and 5. As the row with PK =4 will logically fit into the new page so that the new page will have rows 3, 4 and 5.

Scenario 2: Updates which cause mid page splits

The following code illustrates the point.

CREATE TABLE dbo.Note3

(

NoteID int NOT NULL CONSTRAINT PK_Note3 PRIMARY KEY CLUSTERED (NoteID),

NoteText varchar(2500) NOT NULL

)

Insert into dbo.Note3 Values (1,REPLICATE('a', 1900))

Insert into dbo.Note3 Values (2,REPLICATE('b', 1900))

Insert into dbo.Note3 Values (3,REPLICATE('c', 1900))

Insert into dbo.Note3 Values (4,REPLICATE('d', 1900))

Insert into dbo.Note3 Values (5,REPLICATE('e', 1900))

Insert into dbo.Note3 Values (6,REPLICATE('f', 1900))

As users may know, there is a difference between varchar and char data types. Char data type always hold the same amount of data on every row, irrespective of the amount of data sent in the INSERT statement. If the statement sends a smaller string, the balance will be padded with spaces. But varchar data types holds variable length strings, which means the amount of data sent will be saved and nothing more. In this example, even though the column can hold upto 4000 bytes, the column in each row will occupy only 1900 bytes. As seen before, the six rows can fit into 2 pages, where PK values from 1 to 4 will fit into the first page and 5 and 6 will fit into the second page. First page will occupy 7616+ bytes (with overhead) and the second page will occupy 3808+ bytes. Now let us issue an update statement:

Update dbo.Note3 Set NoteText = REPLICATE('a', 2500) WHERE NoteID = 1

With this statement, the row updated resides in the first page and the page does not have enough space, page split will occur. Here, even though the maximum amount of data the page holds is around 2504 bytes and even if the other rows are updated to hold the maximum amount of data, the first page can hold 3 rows, page split will not consider it. The data will be divided into half and the second half will be copied to new page.

These are the possible scenarios for page splits. As it is explained in detail how a page split occurs, let us move to the next section.
References
http://sqlblogcasts.com/blogs/tonyrogerson/archive/2007/06/28/what-is-a-page-split-what-happens-why-does-it-happen-why-worry.aspx
This article was written around 5 years ago, but it gives a complete understanding of page splits.

-- To be Continued--

dbthoughts

Feb 19, 2012

Page Splits

No comments:

Post a Comment