Sep 14, 2009

NOT IN, NOT EXISTS and LEFT JOIN ... What To Use?

This blog is to analyze the differences between the above statements and to find what suits where better.
Lets consider this scenario: (This example is taken from AdventureWorks) Find all customers who do not have Sales. It could be written in three ways:


SET ANSI_NULLS ON
-- Method 1: Using NOT IN
SELECT CustomerID, AccountNumber
FROM Sales.Customer
WHERE CustomerID NOT IN
(
SELECT CustomerID
FROM Sales.SalesOrderHeader
)

-- Method 2: Using NOT EXISTS
SELECT CustomerID, AccountNumber
FROM Sales.Customer C
WHERE NOT EXISTS
(
SELECT 1
FROM Sales.SalesOrderHeader S
WHERE S.CustomerID = C.CustomerID
)

-- MEthod 3: Using LEFT JOIN
SELECT c.CustomerID, c.AccountNumber
FROM Sales.Customer c
LEFT JOIN Sales.SalesOrderHeader soh
ON soh.CustomerID = c.CustomerID
WHERE soh.CustomerID IS NULL


All the above queries give the same results, show the same execution plan, take the same amount of resources in IO and executed almost at the same time. However, this example is quite simple. The column CustomerID is not having any null values. When it comes with null values, the results may vary:


Consider this example:


SET ANSI_NULLS ON
-- Method 1: Using NOT IN
SELECT SalesPersonID, SalesOrderID
FROM Sales.SalesOrderHeader
WHERE SalesPersonID NOT IN
(
SELECT SalesPersonID
FROM Sales.Store
)

-- Method 2: Using NOT EXISTS
SELECT SalesPersonID, SalesOrderID
FROM Sales.SalesOrderHeader soh
WHERE NOT EXISTS
(
SELECT 1
FROM Sales.Store st
WHERE soh.SalesPersonID = st.SalesPersonID
)

-- Method 3: Using LEFT JOIN
SELECT soh.SalesPersonID, soh.SalesOrderID
FROM Sales.SalesOrderHeader soh
LEFT JOIN Sales.Store st
ON soh.SalesPersonID = st.SalesPersonID
WHERE st.SalesPersonID IS NULL


The results are different, even though they refer the same data. The first query gives 243 rows where the rest give 27092 rows.
Now if we change the first line to SET ANSI_NULLS OFF, we will get the same results for all queries.

This shows how ansi settings can change the results.

Now let us assume that we keep ANSI_NULLS OFF in our environment. What is the best method to use?

Let us first check the execution plans for both ANSI settings: As ANSI_NULLS OFF gives the same results, we will check that first





This clearlty shows the third query is completely out of the way in performing the task. But the first two queries give the same execution plan.
We can add SET STATISTICS IO ON to check the IO cost. The first two queries give the same IO cost.

When we check for ANSI_NULLS ON, the execution plan changes drastically. Now the first query, eventhough it does not return the null values, it causes huge percentage of the total cost.


If we execute SET STATISTICS_IO ON before the query, we can identify the IO cost.

(27902 row(s) affected)
Table 'SalesOrderHeader'. Scan count 1, logical reads 57, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Store'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

(27902 row(s) affected)
Table 'SalesOrderHeader'. Scan count 1, logical reads 57, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Store'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

(27902 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 1, logical reads 57, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Store'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

This results show that the second query ( IF NOT EXISTS) is the best way from performance point of view
Now what is the best way of writing?
The question depends on the style you prefer and the ANSI settign you need. However, LEFT JOIN is the most unrecommended method of writing.

I prefer IF NOT EXISTS, as it gives the best choice always. We will not loose data, irrespective of ANSI setting.

No comments:

Post a Comment