How to Find and Remove Duplicate Data in MySQL: Complete Guide with SQL Examples

目次

1. Introduction

When operating a database, it is not uncommon to encounter issues such as “duplicate records being inserted” or “data that should be unique appearing multiple times.” In environments where relational databases like MySQL are used, extracting and managing duplicate data is an essential task to maintain data accuracy and quality.

For example, in core business tables such as member information, product data, and order history, duplicate records may be inserted due to user mistakes or system errors. If left unaddressed, this can reduce the accuracy of aggregation and analysis, and may also lead to unexpected bugs or operational issues.

To resolve this “duplicate data problem,” you must first identify which records are duplicated, and then organize or remove those duplicate records depending on the situation. However, using only a standard SELECT statement in MySQL is often not sufficient to efficiently detect duplicates. Slightly advanced SQL techniques and practical approaches are required.

In this article, we focus on “How to Extract Duplicate Data in MySQL”, covering everything from basic SQL statements to practical applications, performance considerations, and common error handling. Whether you are a database beginner or an engineer who writes SQL daily, this guide aims to provide practical and field‑oriented knowledge.

2. Basics: Detecting Duplicates Using a Key Column

The most basic way to extract duplicate data in MySQL is to identify cases where “multiple records share the same value in a specific column (key column).” In this section, we explain representative SQL queries used to detect duplicate key values and how they work.

2-1. Detecting Duplicates with GROUP BY and HAVING

The fundamental technique for duplicate detection is grouping records by a specific column using the GROUP BY clause, then filtering groups that contain two or more records using the HAVING clause. Here is a typical example:

SELECT key_column, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1;

Example: Extracting Duplicate Member Email Addresses

SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

When this query is executed, if the same email address has been registered multiple times, the email address and the number of duplicates (count) will be displayed in the results.

2-2. Detecting Duplicates Across Multiple Columns

If you need to detect duplicates based on a combination of two or more columns, you can specify multiple columns in the GROUP BY clause using the same logic.

SELECT col1, col2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;

Using this method, you can detect duplicates where multiple conditions fully match, such as “same full name and birth date” or “same product ID and order date.”

2-3. Calculating the Total Number of Duplicate Records

If you want to understand the overall scale of duplication, you can use a subquery to calculate the total number of duplicate entries.

SELECT SUM(duplicate_count) AS total_duplicates
FROM (
  SELECT COUNT(*) AS duplicate_count
  FROM table_name
  GROUP BY key_column
  HAVING COUNT(*) > 1
) AS duplicates;

This query sums the number of duplicate entries across all duplicate groups.

By combining GROUP BY and HAVING, you can extract duplicate data in MySQL in a simple and efficient manner.

3. Extracting All Records That Share Duplicate Keys

In the previous section, we introduced how to list only the “duplicate key values.” However, in real‑world work, you often need to confirm “which exact records are duplicated, and inspect all their details.” For example, you may want to review the full duplicated user profiles or inspect duplicated product data row by row.

In this section, we explain practical SQL patterns to extract all records that share duplicate keys.

3-1. Extracting Duplicate Records Using a Subquery

The most basic approach is to retrieve the list of duplicate key values in a subquery, then fetch all records that match those keys.

SELECT *
FROM table_name
WHERE key_column IN (
  SELECT key_column
  FROM table_name
  GROUP BY key_column
  HAVING COUNT(*) > 1
);

Example: Extracting All Records with Duplicate Email Addresses

SELECT *
FROM users
WHERE email IN (
  SELECT email
  FROM users
  GROUP BY email
  HAVING COUNT(*) > 1
);

When you run this query, it extracts all rows in the “users” table where the email address is duplicated (including columns such as ID, registration date, etc.).

3-2. Efficient Extraction Using EXISTS

If you need to handle large datasets or care about performance, using EXISTS can also be effective. IN and EXISTS are similar, but depending on data volume and indexing, one may be faster than the other.

SELECT *
FROM table_name t1
WHERE EXISTS (
  SELECT 1
  FROM table_name t2
  WHERE t1.key_column = t2.key_column
  GROUP BY t2.key_column
  HAVING COUNT(*) > 1
);

Example: Duplicate Email Records (Using EXISTS)

SELECT *
FROM users u1
WHERE EXISTS (
  SELECT 1
  FROM users u2
  WHERE u1.email = u2.email
  GROUP BY u2.email
  HAVING COUNT(*) > 1
);

3-3. Notes and Performance Considerations

  • Subquery performance can be significantly affected when the dataset is large. With proper indexing, both IN and EXISTS can perform at a practical level.
  • However, if you need complex conditions or want to determine duplicates across multiple columns, queries can become heavy. Always validate behavior in a test environment first.

In this way, extracting all records matching duplicate keys can be achieved using subqueries or the EXISTS clause.

4. Detecting Duplicates Across Multiple Columns

Duplicate detection conditions aren’t always based on a single column. In practice, it’s common to require uniqueness across a combination of multiple columns. For example, you may consider records duplicates when “full name + birth date” match, or when “product ID + color + size” are all identical.

In this section, we explain in detail how to extract duplicates using multiple columns.

4-1. Detecting Duplicates with GROUP BY Using Multiple Columns

To detect duplicates across multiple columns, list the columns separated by commas in the GROUP BY clause. With HAVING COUNT(*) > 1, you can extract only combinations that appear two or more times.

SELECT col1, col2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;

Example: Detecting Duplicates by “first_name” and “birthday”

SELECT first_name, birthday, COUNT(*) AS count
FROM users
GROUP BY first_name, birthday
HAVING COUNT(*) > 1;

This query helps you identify cases where the combination of “same name” and “same birth date” has been registered multiple times.

4-2. Extracting All Records for Duplicate Multi-Column Keys

If you need all record details for duplicate key combinations, you can extract duplicate pairs in a subquery and then fetch all rows that match those pairs.

SELECT *
FROM table_name t1
WHERE (col1, col2) IN (
  SELECT col1, col2
  FROM table_name
  GROUP BY col1, col2
  HAVING COUNT(*) > 1
);

Example: Full Records for Duplicates in “first_name” and “birthday”

SELECT *
FROM users u1
WHERE (first_name, birthday) IN (
  SELECT first_name, birthday
  FROM users
  GROUP BY first_name, birthday
  HAVING COUNT(*) > 1
);

Using this query, for example, if the combination “Taro Tanaka / 1990-01-01” is registered multiple times, you can retrieve all related detailed rows.

4-3. Detecting Exact Duplicates (COUNT DISTINCT)

If you want to estimate “how many rows are exact duplicates across multiple columns,” you can also use aggregation with COUNT(DISTINCT ...).

SELECT COUNT(*) - COUNT(DISTINCT col1, col2) AS duplicate_count
FROM table_name;

This SQL provides an approximate count of fully duplicated rows within the table.

4-4. Notes

  • Even for multi‑column duplicate detection, proper indexing can significantly improve query speed.
  • If there are many columns involved or NULL values are present, you may get unexpected duplicate results. Design your conditions carefully.

In this way, detecting and extracting duplicates across multiple columns can be handled flexibly with well‑designed SQL.

5. Removing Duplicate Records (DELETE)

Once you can extract duplicate data, the next step is deleting unnecessary duplicates. In practice, a common approach is to keep only one record among duplicates and delete the rest. However, when deleting duplicates automatically in MySQL, you must narrow down the deletion target carefully to avoid unintended data loss.

In this section, we explain common safe methods for deleting duplicate data and key precautions.

5-1. Deleting Duplicates with a Subquery + DELETE

If you want to keep only the “oldest” or “newest” record and delete the others, a DELETE statement with a subquery can be useful.

Example: Keep the smallest (oldest) ID record and delete the others

DELETE FROM users
WHERE id NOT IN (
  SELECT MIN(id)
  FROM users
  GROUP BY email
);

This query keeps only the smallest id (the first registered record) for each email, and deletes all other rows that share the same email.

5-2. How to Avoid MySQL‑Specific Error (Error 1093)

In MySQL, you may encounter Error 1093 when you try to DELETE from a table while also referencing the same table in a subquery. In that case, you can avoid the error by wrapping the subquery result as a derived table (temporary result set).

DELETE FROM users
WHERE id NOT IN (
  SELECT * FROM (
    SELECT MIN(id)
    FROM users
    GROUP BY email
  ) AS temp_ids
);

By wrapping the subquery with SELECT * FROM (...) AS alias, you can prevent the error and delete safely.

5-3. Deleting Duplicates for Multi‑Column Keys

If you want to delete duplicates based on a combination of multiple columns, use GROUP BY with multiple columns and delete everything except the representative record.

Example: For duplicates by “first_name” and “birthday,” delete all but the first record

DELETE FROM users
WHERE id NOT IN (
  SELECT * FROM (
    SELECT MIN(id)
    FROM users
    GROUP BY first_name, birthday
  ) AS temp_ids
);

5-4. Safety Measures and Best Practices for Deletion

Deleting duplicates is a high‑risk operation that can permanently remove data. Make sure to follow these best practices:

  • Take backups: Always save a backup of the entire table or the target records before deleting.
  • Use transactions: If possible, wrap the operation in a transaction so you can roll back immediately if something goes wrong.
  • Confirm counts with SELECT first: Build the habit of verifying “Is the deletion target correct?” by running a SELECT query first.
  • Check indexes: Adding indexes to columns used for duplicate detection improves both performance and accuracy.

In MySQL, you can safely delete duplicate data by using subqueries and derived tables. Always proceed carefully, with sufficient testing and a solid backup strategy.

6. Performance Considerations and Index Strategy

When extracting or deleting duplicate data in MySQL, query execution time and server load become more problematic as the table grows. Especially in large‑scale systems or batch jobs, performance‑aware SQL design and index optimization are essential. In this section, we explain tips for improving performance and key points for index design in duplicate data processing.

6-1. Choosing Between EXISTS, IN, and JOIN

SQL constructs such as IN, EXISTS, and JOIN are commonly used for extracting duplicate data, but each has different characteristics and performance tendencies.

  • IN – Fast when the subquery result set is small, but performance tends to degrade as the result set grows.
  • EXISTS – Stops searching as soon as a matching record is found, so it is often effective for large tables or when matches are relatively rare.
  • JOIN – Useful for retrieving many pieces of information at once, but it can become slower if you join unnecessary data or lack proper indexing.

Example Performance Comparison

SyntaxSmall DataLarge DataComment
INSlow when the result set is large
EXISTSAdvantageous for large databases
JOINProper indexes required

It is important to choose the optimal syntax based on your actual system and data volume.

6-2. Why Index Design Matters

For columns used in duplicate checks or deletion filters, always create indexes. Without indexes, full table scans can occur and performance can become extremely slow.

Example: Creating an Index

CREATE INDEX idx_email ON users(email);

If you detect duplicates across multiple columns, a composite index is also effective.

CREATE INDEX idx_name_birthday ON users(first_name, birthday);

Index design can dramatically change read performance and search efficiency.
Note: Adding too many indexes can slow down writes and increase storage usage, so balance is important.

6-3. Batch Processing for Large Datasets

  • If the dataset is on the order of tens of thousands to millions of rows, it is safer to run processing in smaller batches instead of handling everything at once.
  • For deletes and updates, limit the number of rows processed per execution (e.g., LIMIT 1000) and run multiple times to reduce lock contention and performance degradation.
    DELETE FROM users
    WHERE id IN (
      -- The first 1000 duplicate record IDs extracted by a subquery
    )
    LIMIT 1000;
    

6-4. Using Execution Plans (EXPLAIN)

Use EXPLAIN to analyze how a query is executed. This helps you check whether indexes are being used effectively, and whether a full scan (ALL) is occurring.

EXPLAIN SELECT * FROM users WHERE email IN (...);

By keeping performance and index strategy in mind, you can handle duplicate processing safely and efficiently even for large datasets.

7. Advanced Use Cases: Handling Complex Scenarios

In real‑world environments, duplicate detection and deletion are often more complex than simple matching. You may need to add additional conditions, execute operations safely in stages, or meet stricter operational requirements. In this section, we introduce advanced practical techniques for handling duplicate data safely and flexibly.

7-1. Conditional Duplicate Deletion

If you want to delete only duplicates that meet specific conditions, use the WHERE clause strategically.

Example: Delete only duplicate records with the same email and status = 'withdrawn'

DELETE FROM users
WHERE id NOT IN (
  SELECT * FROM (
    SELECT MIN(id)
    FROM users
    WHERE status = 'withdrawn'
    GROUP BY email
  ) AS temp_ids
)
AND status = 'withdrawn';

By adding conditions to WHERE and GROUP BY, you can precisely control which records to keep and which to remove.

7-2. Recommended: Batch Processing and Split Execution

If the dataset is very large or you want to avoid lock contention and performance degradation, use batch processing.

  • Do not process all deletion targets at once—use LIMIT for split execution
  • Use transaction control and roll back in case of unexpected errors
  • Manage risk with backups and logging
    DELETE FROM users
    WHERE id IN (
      SELECT id FROM (
        -- Extract duplicate record IDs filtered by conditions
      ) AS temp_ids
    )
    LIMIT 500;
    

This approach significantly reduces system load.

7-3. Handling Complex Duplicate Definitions

In different business contexts, the definition of “duplicate” varies. You can combine subqueries, CASE expressions, and aggregate functions for flexible handling.

Example: Consider duplicates only when product_id, order_date, and price are all identical

SELECT product_id, order_date, price, COUNT(*)
FROM orders
GROUP BY product_id, order_date, price
HAVING COUNT(*) > 1;

For more advanced requirements such as “keep only the most recent record among duplicates,” you can use subqueries or ROW_NUMBER() (available in MySQL 8.0 and later).

7-4. Best Practices for Transactions and Backups

  • Always wrap DELETE or UPDATE operations in transactions so you can restore data with ROLLBACK if issues occur.
  • If working with important tables or large datasets, always create a backup beforehand .

By mastering these advanced techniques, you can handle duplicate data processing safely and flexibly in any environment.

8. Summary

In this article, we systematically explained how to extract and delete duplicate data in MySQL, from fundamentals to advanced applications. Let’s review the key points.

8-1. Key Takeaways

  • Detecting Duplicate Data You can detect duplicates not only in a single column but also across multiple columns. The combination of GROUP BY and HAVING COUNT(*) > 1 is the fundamental pattern for duplicate detection.
  • Extracting All Duplicate Records Using subqueries and the EXISTS clause, you can retrieve all records corresponding to duplicate key values.
  • Deleting Duplicate Records By using MIN(id) or MAX(id) to keep representative rows and combining subqueries with DELETE statements, you can safely remove unnecessary duplicates. Avoiding MySQL Error 1093 is also important.
  • Performance and Indexing For large datasets or complex conditions, proper indexing, batch processing, and execution plan checks using EXPLAIN are essential.
  • Practical Techniques Conditional deletion, split execution, transaction management, and backups are key practices to avoid mistakes in production environments.

8-2. Quick Reference by Use Case

ScenarioRecommended Approach
Single-column duplicate detectionGROUP BY + HAVING
Multi-column duplicate detectionGROUP BY (multiple columns) + HAVING
Retrieve all duplicate recordsSubquery (IN / EXISTS)
Safe deletionSubquery + derived table + DELETE
High-speed processing of large datasetsIndexes + batch processing + EXPLAIN
Conditional duplicate deletionCombine WHERE clause and transactions

8-3. Preventing Future Duplicate Issues

Preventing duplicates at the time of insertion is equally important.

  • Consider using UNIQUE constraints during table design.
  • Regular data cleanup and auditing help detect operational issues early.

Extracting and deleting duplicate data in MySQL requires knowledge ranging from basic SQL to advanced techniques. We hope this guide supports your database maintenance and system operations.
If you have specific cases or further questions, consider checking FAQs or consulting a database specialist.

9. FAQ: Frequently Asked Questions About Extracting and Deleting Duplicate Data in MySQL

Q1. Why use GROUP BY + HAVING instead of DISTINCT?

DISTINCT removes duplicates in the result set, but it cannot tell you how many times a value appears. By combining GROUP BY and HAVING COUNT(*) > 1, you can determine which values appear multiple times and how many duplicates exist.

Q2. Should I use IN or EXISTS?

For small datasets, the difference is minimal. For large tables or when indexes are effective, EXISTS often performs better. Test both approaches in your environment and verify execution plans using EXPLAIN.

Q3. How do I detect duplicates across multiple columns?

Specify multiple columns in GROUP BY and use HAVING COUNT(*) > 1 to detect combinations where all specified columns match. Example: GROUP BY first_name, birthday

Q4. I get Error 1093 when running DELETE. What should I do?

MySQL throws Error 1093 when you reference the same table in a subquery within a DELETE statement. Wrap the subquery result in a derived table using SELECT * FROM (...) AS alias to avoid the error.

Q5. How can I safely delete duplicate data?

Always create a backup before deletion, verify targets with a SELECT statement, and use transactions when possible. Batch deletion may also be safer for large datasets.

Q6. What should I do if queries are slow with large data volumes?

Create indexes on columns used for duplicate detection. Use batch processing with LIMIT and check execution plans using EXPLAIN to avoid unnecessary full table scans.

Q7. How can I fundamentally prevent duplicate inserts?

Define UNIQUE constraints or unique keys during table design to prevent duplicate values from being inserted. Also, perform periodic duplicate checks and data cleansing after deployment.

Q8. Can the same methods be used in MariaDB or other RDBMS?

Basic SQL constructs such as GROUP BY, HAVING, and subqueries are also supported in MariaDB and PostgreSQL. However, DELETE subquery restrictions and performance characteristics may vary by product, so always test in advance.