Using “NOT IN” in a query can provide mixed results
The topic came up at work awhile back with using various includes/excludes such as IN, NOT IN, and EXISTS. A co-worker was working on an ETL process when he discovered that using “NOT IN” in a sub query with a data set that contained NULL values was giving him incorrect data. Lets take a look at the behavior of using “NOT IN”.
Using AdventureWorks lets run the following query.
SELECT * FROM Person.Contact WHERE FirstName = ‘Virginia’
This query will return 24 results. Notice that many of rows of the “MiddleName” column has NULL values. Now lets run the follow query to only return values where the MiddleName has certain values.
SELECT * FROM Person.Contact WHERE FirstName = ‘Virginia’
and MiddleName in (‘A’, ‘C’, ‘E’, ‘J’)
This query returns 7 rows. So you would think if there are 24 rows with FirstName = ‘Virginia’ and only 7 where MiddleName in (‘A’, ‘C’, ‘E’, ‘J’) that there should be 17 rows where MiddleName NOT IN (‘A’, ‘C’, ‘E’, ‘J’) right? Lets check. Lets run the following query.
SELECT * FROM Person.Contact WHERE FirstName = ‘Virginia’
and MiddleName NOT IN (‘A’, ‘C’, ‘E’, ‘J’)
WHAT we only have 4 results. Where are all MiddleNames that have NULL as their value?
When ANSI_NULLS is on MiddleName <> NULL is UNKNOWN, so the predicate evaluates to UNKNOWN so no results are returned for those items. If we run this query setting ANSI_NULLS OFF then we get the results we expect.
SET ANSI_NULLS OFF
SELECT * FROM Person.Contact WHERE FirstName = ‘Virginia’
and MiddleName NOT IN (‘A’, ‘C’, ‘E’, ‘J’)
Or you can manually include NULLS in your query.
SELECT * FROM Person.Contact WHERE FirstName = ‘Virginia’
and (middlename NOT IN (‘A’, ‘C’, ‘E’, ‘J’) OR MiddleName IS NULL)
I am sure there are other options and ways to get the proper result set but my intention with this blog was to raise awareness and explain why this behaviour happens.
One Comment
Thanks for sharing. Always good to find a real expert.