Wednesday 14 September 2011

MYSQL dedupe data

If you want to select data from one table, ensuring that it does not (de-duplicate) match another table, I find the following useful.

Code:   
SELECT * FROM data_2
WHERE data_2.email
NOT IN (
SELECT data_1.email
FROM data_1
WHERE data_1.email=data_2.email
);
   
If you have a flat-file, .csv for example and you want to cleanse of duplicate records, you can upload it to a temporary table, setting the field you want to ignore duplicates from, as a UNIQUE data field.

Code:
CREATE TABLE 'example' (
'id' BIGINT(20) NOT NULL AUTO_INCREMENT,
'email' VARCHAR(255) NOT NULL,
'forename' VARCHAR(30) NULL DEFAULT NULL,
'surname' VARCHAR(30) NULL DEFAULT NULL,
PRIMARY KEY ('id'),
INDEX 'Index 2' ('email'),
UNIQUE INDEX 'Index 3' ('email')
)
COLLATE='utf8_general_ci'
ENGINE='InnoDB'
AUTO_INCREMENT=1;


The above example will check and remove any duplicate email addresses it finds during import of the file, thus de-duping the data.


No comments: