I'm going to create a table with duplicate rows to demonstrate this problem:
CREATE TABLE CustomerEmails
(
ID int IDENTITY (1, 1),
CustomerID int,
Email varchar(100),
CreatedDate smalldatetime
)
INSERT INTO CustomerEmails
SELECT 135, 'apple@spamme.com', '2010-03-31' UNION ALL
SELECT 267, 'orange@spamme.com','2010-03-21' UNION ALL
SELECT 222, 'pear@spamme.com', '2010-03-11' UNION ALL
SELECT 333, 'kiwi@spamme.com', '2010-03-01' UNION ALL
SELECT 333, 'lemon@spamme.com', '2010-02-28' UNION ALL
SELECT 333, 'lime@spamme.com', '2010-02-21' UNION ALL
SELECT 222, 'grape@spamme.com', '2010-02-11' UNION ALL
SELECT 492, 'banana@spamme.com','2010-02-01';
(
ID int IDENTITY (1, 1),
CustomerID int,
Email varchar(100),
CreatedDate smalldatetime
)
INSERT INTO CustomerEmails
SELECT 135, 'apple@spamme.com', '2010-03-31' UNION ALL
SELECT 267, 'orange@spamme.com','2010-03-21' UNION ALL
SELECT 222, 'pear@spamme.com', '2010-03-11' UNION ALL
SELECT 333, 'kiwi@spamme.com', '2010-03-01' UNION ALL
SELECT 333, 'lemon@spamme.com', '2010-02-28' UNION ALL
SELECT 333, 'lime@spamme.com', '2010-02-21' UNION ALL
SELECT 222, 'grape@spamme.com', '2010-02-11' UNION ALL
SELECT 492, 'banana@spamme.com','2010-02-01';
So customers 222 and 333 are in there multiple times and if I do a lookup to find their email I get multiple results. Someone has been inserting new rows without trying to update existing rows first... the lunatic!
I remember a DBA faffing with this problem for ages. After realising you can't DISTINCT a single column, trying to DELETE WHERE and ID exists IN a GROUP SELECT statement HAVING COUNT(*) > 1 except for the TOP row with an ORDER BY on the date DESC... *yawn*
Well, here is the way for cool people to do it: You create a CTE using a column for a ROW_NUMBER() to count the number of duplicates for each customer:
WITH Duplicates AS
(
SELECT
CustomerID,
CreatedDate,
Email,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CreatedDate DESC) as cnt
FROM
CustomerEmails
)
(
SELECT
CustomerID,
CreatedDate,
Email,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY CreatedDate DESC) as cnt
FROM
CustomerEmails
)
The ROW_NUMBER() has a partition on CustomerID which means the count will start again for each customer. I have oredered the count by CreatedDate DESC too which means the row with a count of 1 will be the newest.
Now you simply DELETE from the CTE where the count is higher than 1, removing all the duplicates:
DELETE FROM
Duplicates
WHERE
cnt > 1;
Duplicates
WHERE
cnt > 1;
Have a look:
SELECT * FROM CustomerEmails;
Great! You no longer have any duplicates! Now quickly whack a UNIQUE contraint on that column before that lunatic developer starts messing with your data again.
ALTER TABLE CustomerEmails ADD UNIQUE (CustomerID);
Thanks for giving a such a good way to delete the duplicate records....
ReplyDeleteBut what about if we have large amount of records like 8 Lakh records and how much cost this query is taking.
Thanks,
Nishant
It terms of performance if you are using SQL 2005 or later: ALWAYS ALWAYS ALWAYS use CTE with row_number for this scenario.
ReplyDeleteHi, Is 'Duplicates' a CTE or is it just an alias? The reason i ask is because i thought deleting from a CTE would just delete the records in that temporary record set and not the source table (CustomerEmails)?
ReplyDeleteI am a beginner so you tell me!
Cheers
Duplicates is the name of the CTE. I know it is probably surprising but deleting from a CTE will also be deleting from its original table.
ReplyDeleteI'm going into SQL view in Access 2007.
ReplyDeleteIt gives the following error: "Invalid SQL statement; expected 'DELETE','INSERT','PROCEDURE','SELECT',or 'UPDATE'"
I think the problem has something to do with the 'WITH' statement,but I don't know how to fix it.
Does anyone know how to fix this?
CTEs are designed for SQL Server 2005+ I'd assume you can't use this in an Access Database
ReplyDeleteOrder by Clause is really required?
ReplyDeleteI have verified and Rank function must have Order by clause
Delete