Removing Duplicate Entries in Mysql for large dataset

Simple dummy table structure:
CREATE TABLE RssEntry (
rowID INT NOT NULL AUTO_INCREMENT
, rssID INT NOT NULL
, url VARCHAR(254) NOT NULL
, title TEXT
, description TEXT
, PRIMARY KEY (rowID)
, INDEX (rssID)
);

When doing work to remove duplicate entries from a table with a significant number of rows, adding
a temporary, non-unique index on the columns in question can often speed up operations as you go about
removing duplicate entries.

Adding a Non-Unique Index to Speed Up Queries
CREATE INDEX UrlRow ON RssEntry (Url, rowID);

Determining How Many Duplicate URLs Exist in the Data Set:
SELECT COUNT(*), COUNT(*) – COUNT(DISTINCT url) FROM RssEntry;

When removing duplicate entries from a table, first determine which rows having duplicate keys you
wish to keep in the table. For instance, if you are removing a duplicate customer record, will you take the
oldest or newest record? Or will you need to merge the two records? Be sure you have a game plan for what
to do with the redundant data records.

Using GROUP BY to Get Unique URL Records
SELECT MAX(rowID) AS rowID, url FROM RssEntry GROUP BY Url;

Creating a New Table with the Unique Records
CREATE TABLE RssEntry2 (
rowID INT NOT NULL AUTO_INCREMENT
rssID INT NOT NULL
title VARCHAR(255) NOT NULL
url VARCHAR(255) NOT NULL
description TEXT
PRIMARY KEY (rowID)
UNIQUE INDEX Url (url));

Now Insert the data to the new table and delete the old table:
INSERT INTO RssEntry2
SELECT * FROM RssEntry
INNER JOIN (
SELECT MAX(rowID) AS rowID, url
FROM RssEntry
GROUP BY url
) AS uniques
ON RssEntry.rowID = uniques.rowID;

ALTER TABLE RssEntry RENAME TO RssEntry_old;

ALTER TABLE RssEntry2 RENAME TO RssEntry;

These Example was collected from Pro Mysql by Michael Kruckenberg and Jay Pipes…..

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s