find a location for property in a new city

Wednesday 17 March 2010

Delete duplicate rows using CTE and ROW_NUMBER in SQL 2005

Ever had duplicate rows in your table because you forgot to use a 'Unique' contraint? Finding it hard to delete the duplicate rows as not all of the columns in the rows are the same? Well you can use two of SQL Server 2005's new features to overcome this problem.

I'm going to create a table with duplicate rows to demonstrate this problem:
CREATE TABLE CustomerEmails
(
  ID int IDENTITY (1, 1),
  CustomerID int,
  Email varchar(100),
  CreatedDate smalldatetime
)

INSERT INTO CustomerEmails
SELECT 135, 'apple@spamme.com', '2010-03-31' UNION ALL
SELECT 267, 'orange@spamme.com','2010-03-21' UNION ALL
SELECT 222, 'pear@spamme.com''2010-03-11' UNION ALL
SELECT 333, 'kiwi@spamme.com''2010-03-01' UNION ALL
SELECT 333, 'lemon@spamme.com', '2010-02-28' UNION ALL
SELECT 333, 'lime@spamme.com''2010-02-21' UNION ALL
SELECT 222, 'grape@spamme.com', '2010-02-11' UNION ALL
SELECT 492, 'banana@spamme.com','2010-02-01';

So customers 222 and 333 are in there multiple times and if I do a lookup to find their email I get multiple results. Someone has been inserting new rows without trying to update existing rows first... the lunatic!

I remember a DBA faffing with this problem for ages. After realising you can't DISTINCT a single column, trying to DELETE WHERE and ID exists IN a GROUP SELECT statement HAVING COUNT(*) > 1 except for the TOP row with an ORDER BY on the date DESC... *yawn*

Well, here is the way for cool people to do it: You create a CTE using a column for a ROW_NUMBER() to count the number of duplicates for each customer:
WITH Duplicates AS
(
  SELECT
    CustomerID,
    CreatedDate,
    Email,
    ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CreatedDate DESC) as cnt
  FROM
    CustomerEmails
)

The ROW_NUMBER() has a partition on CustomerID which means the count will start again for each customer. I have oredered the count by CreatedDate DESC too which means the row with a count of 1 will be the newest.
Now you simply DELETE from the CTE where the count is higher than 1, removing all the duplicates:
DELETE FROM
  Duplicates
WHERE
  cnt > 1;

Have a look:
SELECT * FROM CustomerEmails;

Great! You no longer have any duplicates! Now quickly whack a UNIQUE contraint on that column before that lunatic developer starts messing with your data again.
ALTER TABLE CustomerEmails ADD UNIQUE (CustomerID);

Follow britishdev on Twitter

8 comments:

  1. Thanks for giving a such a good way to delete the duplicate records....

    But what about if we have large amount of records like 8 Lakh records and how much cost this query is taking.

    Thanks,
    Nishant

    ReplyDelete
  2. It terms of performance if you are using SQL 2005 or later: ALWAYS ALWAYS ALWAYS use CTE with row_number for this scenario.

    ReplyDelete
  3. Hi, Is 'Duplicates' a CTE or is it just an alias? The reason i ask is because i thought deleting from a CTE would just delete the records in that temporary record set and not the source table (CustomerEmails)?
    I am a beginner so you tell me!

    Cheers

    ReplyDelete
  4. Duplicates is the name of the CTE. I know it is probably surprising but deleting from a CTE will also be deleting from its original table.

    ReplyDelete
  5. I'm going into SQL view in Access 2007.
    It gives the following error: "Invalid SQL statement; expected 'DELETE','INSERT','PROCEDURE','SELECT',or 'UPDATE'"

    I think the problem has something to do with the 'WITH' statement,but I don't know how to fix it.
    Does anyone know how to fix this?

    ReplyDelete
  6. CTEs are designed for SQL Server 2005+ I'd assume you can't use this in an Access Database

    ReplyDelete
  7. Order by Clause is really required?

    ReplyDelete
    Replies
    1. I have verified and Rank function must have Order by clause

      Delete