- 1 1. Introduction
- 2 2. Basics: Detecting Duplicates Using a Key Column
- 3 3. Extracting All Records That Share Duplicate Keys
- 4 4. Detecting Duplicates Across Multiple Columns
- 5 5. Removing Duplicate Records (DELETE)
- 6 6. Performance Considerations and Index Strategy
- 7 7. Advanced Use Cases: Handling Complex Scenarios
- 8 8. Summary
- 9 9. FAQ: Frequently Asked Questions About Extracting and Deleting Duplicate Data in MySQL
- 9.1 Q1. Why use GROUP BY + HAVING instead of DISTINCT?
- 9.2 Q2. Should I use IN or EXISTS?
- 9.3 Q3. How do I detect duplicates across multiple columns?
- 9.4 Q4. I get Error 1093 when running DELETE. What should I do?
- 9.5 Q5. How can I safely delete duplicate data?
- 9.6 Q6. What should I do if queries are slow with large data volumes?
- 9.7 Q7. How can I fundamentally prevent duplicate inserts?
- 9.8 Q8. Can the same methods be used in MariaDB or other RDBMS?
1. Introduction
When operating a database, it is not uncommon to encounter issues such as “duplicate records being inserted” or “data that should be unique appearing multiple times.” In environments where relational databases like MySQL are used, extracting and managing duplicate data is an essential task to maintain data accuracy and quality.
For example, in core business tables such as member information, product data, and order history, duplicate records may be inserted due to user mistakes or system errors. If left unaddressed, this can reduce the accuracy of aggregation and analysis, and may also lead to unexpected bugs or operational issues.
To resolve this “duplicate data problem,” you must first identify which records are duplicated, and then organize or remove those duplicate records depending on the situation. However, using only a standard SELECT statement in MySQL is often not sufficient to efficiently detect duplicates. Slightly advanced SQL techniques and practical approaches are required.
In this article, we focus on “How to Extract Duplicate Data in MySQL”, covering everything from basic SQL statements to practical applications, performance considerations, and common error handling. Whether you are a database beginner or an engineer who writes SQL daily, this guide aims to provide practical and field‑oriented knowledge.
2. Basics: Detecting Duplicates Using a Key Column
The most basic way to extract duplicate data in MySQL is to identify cases where “multiple records share the same value in a specific column (key column).” In this section, we explain representative SQL queries used to detect duplicate key values and how they work.
2-1. Detecting Duplicates with GROUP BY and HAVING
The fundamental technique for duplicate detection is grouping records by a specific column using the GROUP BY clause, then filtering groups that contain two or more records using the HAVING clause. Here is a typical example:
SELECT key_column, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1;
Example: Extracting Duplicate Member Email Addresses
SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
When this query is executed, if the same email address has been registered multiple times, the email address and the number of duplicates (count) will be displayed in the results.
2-2. Detecting Duplicates Across Multiple Columns
If you need to detect duplicates based on a combination of two or more columns, you can specify multiple columns in the GROUP BY clause using the same logic.
SELECT col1, col2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;
Using this method, you can detect duplicates where multiple conditions fully match, such as “same full name and birth date” or “same product ID and order date.”
2-3. Calculating the Total Number of Duplicate Records
If you want to understand the overall scale of duplication, you can use a subquery to calculate the total number of duplicate entries.
SELECT SUM(duplicate_count) AS total_duplicates
FROM (
SELECT COUNT(*) AS duplicate_count
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1
) AS duplicates;
This query sums the number of duplicate entries across all duplicate groups.
By combining GROUP BY and HAVING, you can extract duplicate data in MySQL in a simple and efficient manner.
3. Extracting All Records That Share Duplicate Keys
In the previous section, we introduced how to list only the “duplicate key values.” However, in real‑world work, you often need to confirm “which exact records are duplicated, and inspect all their details.” For example, you may want to review the full duplicated user profiles or inspect duplicated product data row by row.
In this section, we explain practical SQL patterns to extract all records that share duplicate keys.
3-1. Extracting Duplicate Records Using a Subquery
The most basic approach is to retrieve the list of duplicate key values in a subquery, then fetch all records that match those keys.
SELECT *
FROM table_name
WHERE key_column IN (
SELECT key_column
FROM table_name
GROUP BY key_column
HAVING COUNT(*) > 1
);
Example: Extracting All Records with Duplicate Email Addresses
SELECT *
FROM users
WHERE email IN (
SELECT email
FROM users
GROUP BY email
HAVING COUNT(*) > 1
);
When you run this query, it extracts all rows in the “users” table where the email address is duplicated (including columns such as ID, registration date, etc.).
3-2. Efficient Extraction Using EXISTS
If you need to handle large datasets or care about performance, using EXISTS can also be effective. IN and EXISTS are similar, but depending on data volume and indexing, one may be faster than the other.
SELECT *
FROM table_name t1
WHERE EXISTS (
SELECT 1
FROM table_name t2
WHERE t1.key_column = t2.key_column
GROUP BY t2.key_column
HAVING COUNT(*) > 1
);
Example: Duplicate Email Records (Using EXISTS)
SELECT *
FROM users u1
WHERE EXISTS (
SELECT 1
FROM users u2
WHERE u1.email = u2.email
GROUP BY u2.email
HAVING COUNT(*) > 1
);
3-3. Notes and Performance Considerations
- Subquery performance can be significantly affected when the dataset is large. With proper indexing, both
INandEXISTScan perform at a practical level. - However, if you need complex conditions or want to determine duplicates across multiple columns, queries can become heavy. Always validate behavior in a test environment first.
In this way, extracting all records matching duplicate keys can be achieved using subqueries or the EXISTS clause.
4. Detecting Duplicates Across Multiple Columns
Duplicate detection conditions aren’t always based on a single column. In practice, it’s common to require uniqueness across a combination of multiple columns. For example, you may consider records duplicates when “full name + birth date” match, or when “product ID + color + size” are all identical.
In this section, we explain in detail how to extract duplicates using multiple columns.
4-1. Detecting Duplicates with GROUP BY Using Multiple Columns
To detect duplicates across multiple columns, list the columns separated by commas in the GROUP BY clause. With HAVING COUNT(*) > 1, you can extract only combinations that appear two or more times.
SELECT col1, col2, COUNT(*) AS duplicate_count
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1;
Example: Detecting Duplicates by “first_name” and “birthday”
SELECT first_name, birthday, COUNT(*) AS count
FROM users
GROUP BY first_name, birthday
HAVING COUNT(*) > 1;
This query helps you identify cases where the combination of “same name” and “same birth date” has been registered multiple times.
4-2. Extracting All Records for Duplicate Multi-Column Keys
If you need all record details for duplicate key combinations, you can extract duplicate pairs in a subquery and then fetch all rows that match those pairs.
SELECT *
FROM table_name t1
WHERE (col1, col2) IN (
SELECT col1, col2
FROM table_name
GROUP BY col1, col2
HAVING COUNT(*) > 1
);
Example: Full Records for Duplicates in “first_name” and “birthday”
SELECT *
FROM users u1
WHERE (first_name, birthday) IN (
SELECT first_name, birthday
FROM users
GROUP BY first_name, birthday
HAVING COUNT(*) > 1
);
Using this query, for example, if the combination “Taro Tanaka / 1990-01-01” is registered multiple times, you can retrieve all related detailed rows.
4-3. Detecting Exact Duplicates (COUNT DISTINCT)
If you want to estimate “how many rows are exact duplicates across multiple columns,” you can also use aggregation with COUNT(DISTINCT ...).
SELECT COUNT(*) - COUNT(DISTINCT col1, col2) AS duplicate_count
FROM table_name;
This SQL provides an approximate count of fully duplicated rows within the table.
4-4. Notes
- Even for multi‑column duplicate detection, proper indexing can significantly improve query speed.
- If there are many columns involved or NULL values are present, you may get unexpected duplicate results. Design your conditions carefully.
In this way, detecting and extracting duplicates across multiple columns can be handled flexibly with well‑designed SQL.
5. Removing Duplicate Records (DELETE)
Once you can extract duplicate data, the next step is deleting unnecessary duplicates. In practice, a common approach is to keep only one record among duplicates and delete the rest. However, when deleting duplicates automatically in MySQL, you must narrow down the deletion target carefully to avoid unintended data loss.
In this section, we explain common safe methods for deleting duplicate data and key precautions.
5-1. Deleting Duplicates with a Subquery + DELETE
If you want to keep only the “oldest” or “newest” record and delete the others, a DELETE statement with a subquery can be useful.
Example: Keep the smallest (oldest) ID record and delete the others
DELETE FROM users
WHERE id NOT IN (
SELECT MIN(id)
FROM users
GROUP BY email
);
This query keeps only the smallest id (the first registered record) for each email, and deletes all other rows that share the same email.
5-2. How to Avoid MySQL‑Specific Error (Error 1093)
In MySQL, you may encounter Error 1093 when you try to DELETE from a table while also referencing the same table in a subquery. In that case, you can avoid the error by wrapping the subquery result as a derived table (temporary result set).
DELETE FROM users
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM users
GROUP BY email
) AS temp_ids
);
By wrapping the subquery with SELECT * FROM (...) AS alias, you can prevent the error and delete safely.
5-3. Deleting Duplicates for Multi‑Column Keys
If you want to delete duplicates based on a combination of multiple columns, use GROUP BY with multiple columns and delete everything except the representative record.
Example: For duplicates by “first_name” and “birthday,” delete all but the first record
DELETE FROM users
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM users
GROUP BY first_name, birthday
) AS temp_ids
);

5-4. Safety Measures and Best Practices for Deletion
Deleting duplicates is a high‑risk operation that can permanently remove data. Make sure to follow these best practices:
- Take backups: Always save a backup of the entire table or the target records before deleting.
- Use transactions: If possible, wrap the operation in a transaction so you can roll back immediately if something goes wrong.
- Confirm counts with SELECT first: Build the habit of verifying “Is the deletion target correct?” by running a SELECT query first.
- Check indexes: Adding indexes to columns used for duplicate detection improves both performance and accuracy.
In MySQL, you can safely delete duplicate data by using subqueries and derived tables. Always proceed carefully, with sufficient testing and a solid backup strategy.
6. Performance Considerations and Index Strategy
When extracting or deleting duplicate data in MySQL, query execution time and server load become more problematic as the table grows. Especially in large‑scale systems or batch jobs, performance‑aware SQL design and index optimization are essential. In this section, we explain tips for improving performance and key points for index design in duplicate data processing.
6-1. Choosing Between EXISTS, IN, and JOIN
SQL constructs such as IN, EXISTS, and JOIN are commonly used for extracting duplicate data, but each has different characteristics and performance tendencies.
- IN – Fast when the subquery result set is small, but performance tends to degrade as the result set grows.
- EXISTS – Stops searching as soon as a matching record is found, so it is often effective for large tables or when matches are relatively rare.
- JOIN – Useful for retrieving many pieces of information at once, but it can become slower if you join unnecessary data or lack proper indexing.
Example Performance Comparison
| Syntax | Small Data | Large Data | Comment |
|---|---|---|---|
| IN | ◎ | △ | Slow when the result set is large |
| EXISTS | ◯ | ◎ | Advantageous for large databases |
| JOIN | ◯ | ◯ | Proper indexes required |
It is important to choose the optimal syntax based on your actual system and data volume.
6-2. Why Index Design Matters
For columns used in duplicate checks or deletion filters, always create indexes. Without indexes, full table scans can occur and performance can become extremely slow.
Example: Creating an Index
CREATE INDEX idx_email ON users(email);
If you detect duplicates across multiple columns, a composite index is also effective.
CREATE INDEX idx_name_birthday ON users(first_name, birthday);
Index design can dramatically change read performance and search efficiency.
Note: Adding too many indexes can slow down writes and increase storage usage, so balance is important.
6-3. Batch Processing for Large Datasets
- If the dataset is on the order of tens of thousands to millions of rows, it is safer to run processing in smaller batches instead of handling everything at once.
- For deletes and updates, limit the number of rows processed per execution (e.g.,
LIMIT 1000) and run multiple times to reduce lock contention and performance degradation.DELETE FROM users WHERE id IN ( -- The first 1000 duplicate record IDs extracted by a subquery ) LIMIT 1000;
6-4. Using Execution Plans (EXPLAIN)
Use EXPLAIN to analyze how a query is executed. This helps you check whether indexes are being used effectively, and whether a full scan (ALL) is occurring.
EXPLAIN SELECT * FROM users WHERE email IN (...);
By keeping performance and index strategy in mind, you can handle duplicate processing safely and efficiently even for large datasets.
7. Advanced Use Cases: Handling Complex Scenarios
In real‑world environments, duplicate detection and deletion are often more complex than simple matching. You may need to add additional conditions, execute operations safely in stages, or meet stricter operational requirements. In this section, we introduce advanced practical techniques for handling duplicate data safely and flexibly.
7-1. Conditional Duplicate Deletion
If you want to delete only duplicates that meet specific conditions, use the WHERE clause strategically.
Example: Delete only duplicate records with the same email and status = 'withdrawn'
DELETE FROM users
WHERE id NOT IN (
SELECT * FROM (
SELECT MIN(id)
FROM users
WHERE status = 'withdrawn'
GROUP BY email
) AS temp_ids
)
AND status = 'withdrawn';
By adding conditions to WHERE and GROUP BY, you can precisely control which records to keep and which to remove.
7-2. Recommended: Batch Processing and Split Execution
If the dataset is very large or you want to avoid lock contention and performance degradation, use batch processing.
- Do not process all deletion targets at once—use
LIMITfor split execution - Use transaction control and roll back in case of unexpected errors
- Manage risk with backups and logging
DELETE FROM users WHERE id IN ( SELECT id FROM ( -- Extract duplicate record IDs filtered by conditions ) AS temp_ids ) LIMIT 500;
This approach significantly reduces system load.
7-3. Handling Complex Duplicate Definitions
In different business contexts, the definition of “duplicate” varies. You can combine subqueries, CASE expressions, and aggregate functions for flexible handling.
Example: Consider duplicates only when product_id, order_date, and price are all identical
SELECT product_id, order_date, price, COUNT(*)
FROM orders
GROUP BY product_id, order_date, price
HAVING COUNT(*) > 1;
For more advanced requirements such as “keep only the most recent record among duplicates,” you can use subqueries or ROW_NUMBER() (available in MySQL 8.0 and later).
7-4. Best Practices for Transactions and Backups
- Always wrap DELETE or UPDATE operations in transactions so you can restore data with
ROLLBACKif issues occur. - If working with important tables or large datasets, always create a backup beforehand .
By mastering these advanced techniques, you can handle duplicate data processing safely and flexibly in any environment.
8. Summary
In this article, we systematically explained how to extract and delete duplicate data in MySQL, from fundamentals to advanced applications. Let’s review the key points.
8-1. Key Takeaways
- Detecting Duplicate Data You can detect duplicates not only in a single column but also across multiple columns. The combination of
GROUP BYandHAVING COUNT(*) > 1is the fundamental pattern for duplicate detection. - Extracting All Duplicate Records Using subqueries and the
EXISTSclause, you can retrieve all records corresponding to duplicate key values. - Deleting Duplicate Records By using
MIN(id)orMAX(id)to keep representative rows and combining subqueries with DELETE statements, you can safely remove unnecessary duplicates. Avoiding MySQL Error 1093 is also important. - Performance and Indexing For large datasets or complex conditions, proper indexing, batch processing, and execution plan checks using
EXPLAINare essential. - Practical Techniques Conditional deletion, split execution, transaction management, and backups are key practices to avoid mistakes in production environments.
8-2. Quick Reference by Use Case
| Scenario | Recommended Approach |
|---|---|
| Single-column duplicate detection | GROUP BY + HAVING |
| Multi-column duplicate detection | GROUP BY (multiple columns) + HAVING |
| Retrieve all duplicate records | Subquery (IN / EXISTS) |
| Safe deletion | Subquery + derived table + DELETE |
| High-speed processing of large datasets | Indexes + batch processing + EXPLAIN |
| Conditional duplicate deletion | Combine WHERE clause and transactions |
8-3. Preventing Future Duplicate Issues
Preventing duplicates at the time of insertion is equally important.
- Consider using UNIQUE constraints during table design.
- Regular data cleanup and auditing help detect operational issues early.
Extracting and deleting duplicate data in MySQL requires knowledge ranging from basic SQL to advanced techniques. We hope this guide supports your database maintenance and system operations.
If you have specific cases or further questions, consider checking FAQs or consulting a database specialist.
9. FAQ: Frequently Asked Questions About Extracting and Deleting Duplicate Data in MySQL
Q1. Why use GROUP BY + HAVING instead of DISTINCT?
DISTINCT removes duplicates in the result set, but it cannot tell you how many times a value appears. By combining GROUP BY and HAVING COUNT(*) > 1, you can determine which values appear multiple times and how many duplicates exist.
Q2. Should I use IN or EXISTS?
For small datasets, the difference is minimal. For large tables or when indexes are effective, EXISTS often performs better. Test both approaches in your environment and verify execution plans using EXPLAIN.
Q3. How do I detect duplicates across multiple columns?
Specify multiple columns in GROUP BY and use HAVING COUNT(*) > 1 to detect combinations where all specified columns match. Example: GROUP BY first_name, birthday
Q4. I get Error 1093 when running DELETE. What should I do?
MySQL throws Error 1093 when you reference the same table in a subquery within a DELETE statement. Wrap the subquery result in a derived table using SELECT * FROM (...) AS alias to avoid the error.
Q5. How can I safely delete duplicate data?
Always create a backup before deletion, verify targets with a SELECT statement, and use transactions when possible. Batch deletion may also be safer for large datasets.
Q6. What should I do if queries are slow with large data volumes?
Create indexes on columns used for duplicate detection. Use batch processing with LIMIT and check execution plans using EXPLAIN to avoid unnecessary full table scans.
Q7. How can I fundamentally prevent duplicate inserts?
Define UNIQUE constraints or unique keys during table design to prevent duplicate values from being inserted. Also, perform periodic duplicate checks and data cleansing after deployment.
Q8. Can the same methods be used in MariaDB or other RDBMS?
Basic SQL constructs such as GROUP BY, HAVING, and subqueries are also supported in MariaDB and PostgreSQL. However, DELETE subquery restrictions and performance characteristics may vary by product, so always test in advance.


