Wednesday, 17 March 2010

Delete duplicate rows using CTE and ROW_NUMBER in SQL 2005

Ever had duplicate rows in your table because you forgot to use a 'Unique' contraint? Finding it hard to delete the duplicate rows as not all of the columns in the rows are the same? Well you can use two of SQL Server 2005's new features to overcome this problem.

I'm going to create a table with duplicate rows to demonstrate this problem:

CREATE TABLE CustomerEmails
(
ID int IDENTITY (1, 1),
CustomerID int,
Email varchar(100),
CreatedDate smalldatetime
)

INSERT INTO CustomerEmails
SELECT 135, 'apple@spamme.com', '2010-03-31' UNION ALL
SELECT 267, 'orange@spamme.com','2010-03-21' UNION ALL
SELECT 222, 'pear@spamme.com', '2010-03-11' UNION ALL
SELECT 333, 'kiwi@spamme.com', '2010-03-01' UNION ALL
SELECT 333, 'lemon@spamme.com', '2010-02-28' UNION ALL
SELECT 333, 'lime@spamme.com', '2010-02-21' UNION ALL
SELECT 222, 'grape@spamme.com', '2010-02-11' UNION ALL
SELECT 492, 'banana@spamme.com','2010-02-01';

So customers 222 and 333 are in there multiple times and if I do a lookup to find their email I get multiple results. Someone has been inserting new rows without trying to update existing rows first... the lunatic!

I remember a DBA faffing with this problem for ages. After realising you can't DISTINCT a single column, trying to DELETE WHERE and ID exists IN a GROUP SELECT statement HAVING COUNT(*) > 1 except for the TOP row with an ORDER BY on the date DESC... *yawn*

Well, here is the way for cool people to do it: You create a CTE using a column for a ROW_NUMBER() to count the number of duplicates for each customer:

WITH Duplicates AS
(
SELECT
CustomerID,
CreatedDate,
Email,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CreatedDate DESC) as cnt
FROM
CustomerEmails
)

The ROW_NUMBER() has a partition on CustomerID which means the count will start again for each customer. I have oredered the count by CreatedDate DESC too which means the row with a count of 1 will be the newest.
Now you simply DELETE from the CTE where the count is higher than 1, removing all the duplicates:

DELETE FROM
Duplicates
WHERE
cnt > 1;

Have a look:

SELECT * FROM CustomerEmails;

Great! You no longer have any duplicates! Now quickly whack a UNIQUE contraint on that column before that lunatic developer starts messing with your data again.

ALTER TABLE CustomerEmails ADD UNIQUE (CustomerID);

8 comments:

Ranjan Nishant11 June 2010 at 10:46
Thanks for giving a such a good way to delete the duplicate records....

But what about if we have large amount of records like 8 Lakh records and how much cost this query is taking.

Thanks,
Nishant
ReplyDelete
Replies
Colin Farr12 June 2010 at 09:38
It terms of performance if you are using SQL 2005 or later: ALWAYS ALWAYS ALWAYS use CTE with row_number for this scenario.
ReplyDelete
Replies
Anonymous17 September 2010 at 13:25
Hi, Is 'Duplicates' a CTE or is it just an alias? The reason i ask is because i thought deleting from a CTE would just delete the records in that temporary record set and not the source table (CustomerEmails)?
I am a beginner so you tell me!

Cheers
ReplyDelete
Replies
Colin Farr17 September 2010 at 14:35
Duplicates is the name of the CTE. I know it is probably surprising but deleting from a CTE will also be deleting from its original table.
ReplyDelete
Replies
Anonymous10 November 2010 at 12:18
I'm going into SQL view in Access 2007.
It gives the following error: "Invalid SQL statement; expected 'DELETE','INSERT','PROCEDURE','SELECT',or 'UPDATE'"

I think the problem has something to do with the 'WITH' statement,but I don't know how to fix it.
Does anyone know how to fix this?
ReplyDelete
Replies
Colin Farr10 November 2010 at 23:54
CTEs are designed for SQL Server 2005+ I'd assume you can't use this in an Access Database
ReplyDelete
Replies
Anonymous29 October 2012 at 07:14
Order by Clause is really required?
ReplyDelete
Replies

Add comment

British Developer

Wednesday, 17 March 2010

Delete duplicate rows using CTE and ROW_NUMBER in SQL 2005

8 comments:

Colin Farr

Readers of British Developer

Search

Blog Archive

Labels

My Blog List

British Developer

Wednesday, 17 March 2010

Delete duplicate rows using CTE and ROW_NUMBER in SQL 2005

8 comments:

Colin Farr

Readers of British Developer

Search

Blog Archive

Labels

My Blog List

Subscribe To