Detecting & Removing Duplicates from Image Datasets |Keylabs

The presence of duplicates in datasets is one of the most underestimated problems, capable of nullifying the efforts of an entire development team. At the data preparation stage, there is often the prevailing belief that a larger sample size automatically yields better results. However, redundancy creates hidden risks that distort the real quality of artificial intelligence performance.

Duplicates create artificial skews in model weights. If a certain object or scenario is repeated in a dataset hundreds of times through copies, the neural network begins to treat these features as a priority, ignoring less-represented but no less important cases. This makes AI behavior unstable and biased.

From a business perspective, storing and processing duplicates represents direct financial losses. Computing power for training on bloated datasets, cloud storage costs, and the labor costs of annotators who label the same frames significantly increase the project budget. Cleaning data of repetitions improves prediction accuracy and significantly optimizes resource utilization by focusing on unique and valuable information.

Quick Take

A large amount of data does not guarantee success if it contains hidden duplicates that distort results.
Deduplication prevents data leakage between training and test sets, revealing the model's factual accuracy.
Cleaning the dataset before annotation significantly reduces cloud storage costs and labor costs for labelers.
Using pHash finds visually altered copies, while CNN embeddings find semantically similar duplicates.
Specialized tools like FAISS allow for finding repetitions among millions of images in minutes.

Impact of Redundant Data on Training Results

The purity of a dataset determines the reliability of artificial intelligence. When too many identical images enter the system, it creates an illusion of success that vanishes at the first encounter with real-world tasks.

Reasons for Removing Copies

Conducting high-quality dataset cleaning helps avoid technical traps that make the model ineffective. The main problem is not only that duplicates distort the algorithm's reality.

Data Leakage. Identical photos end up in both the training and test sets simultaneously, so the model simply recognizes a familiar image rather than analyzing it.
False Metrics. Due to data leakage, developers see incredible accuracy, but in real life, the system fails because it has merely memorized the correct answers.
Shift of Attention. Artificial intelligence focuses too much on objects that repeat frequently and begins to ignore rare but important details.
Waste of Resources. The company spends money on storing redundant files and paying annotators who label the same objects.

Classification of Duplicates in Images

To effectively conduct near-duplicate detection, it is necessary to understand how much copies can differ from each other. Identifying such files requires different technical approaches depending on their nature.

Category	Characteristics	How to Detect
Exact Duplicates	Files are identical at the byte level	Through checksum verification
Near Duplicates	Same photos but with different sizes, brightness, or formats	Using perceptual hashing algorithms
Semantic Duplicates	The same object was captured a second apart from a different angle	Using visual similarity data analysis
Augmentation-based Duplicates	Copies created through artificial flipping or rotation	By comparing mathematical image vectors

Using simple methods helps quickly remove identical files, but true quality is achieved only when the algorithm finds visual similarity data among visually altered images. This allows only unique examples to remain in the set, forcing the neural network to truly learn rather than just copy previously seen data.

To effectively clean data, it is important to choose a method that corresponds to the project's scale and the type of duplicates. Each approach has its own depth of analysis – from simple file checking to understanding the meaning of what is depicted in the image.

Technical Methods for Detecting Repetitions

Modern approaches to dataset cleaning are divided into mathematical methods of file comparison and the use of artificial intelligence to search for visually similar objects.

Mathematical and Intelligent Search Algorithms

The fastest way is to use classic hash functions such as MD5 or SHA. They convert a file into a short string of characters. If two files are identical byte-for-byte, their hashes will match. However, this method is powerless if the image size is changed by even one pixel.

For combating visually similar copies, perceptual hashing is used. These algorithms create a "fingerprint" of the image based on its structure. Such a hash does not change with file compression or format changes, making it ideal for finding near-identical photos.

A more advanced approach is based on CNN embeddings. Using powerful neural networks like ResNet or CLIP, each image is converted into a long list of numbers – a vector. This allows for comparing images based on content rather than pixels. To compare these vectors, cosine similarity is used. If the score is near 1, the images are almost identical in meaning. Here, it is important to correctly set threshold values to avoid deleting necessary data.

Scaling and Processing Speed

When the number of images reaches millions, comparing every image with every other image becomes impossible. In such cases, specialized tools for working with large arrays of visual-similarity data are used.

Method	Tools	Scaling Characteristics
Clustering	K-Means, DBSCAN	Grouping similar photos into clusters for quick verification
ANN	FAISS, Annoy	Ultra-fast search for similar vectors in million-image databases
Indexing	SQL/NoSQL with vectors	Allows for instant duplicate detection when adding new photos

Using libraries like FAISS allows for deduplicating giant datasets in minutes. This becomes part of an automated process where every new photo passes through a filter and is compared with those already in the database. This approach guarantees that the dataset always remains unique and useful for model training.

Data Cleaning Strategy

Properly removing duplicates requires a balance between automation and human control. This becomes especially critical at the labeling stage, where every redundant operation increases the project cost.

Decisions on Deletion

When an algorithm finds a group of similar images, the question arises: which one to keep as the reference image. The best practice is to save the file with the highest resolution or the lowest compression level. If the images are identical in quality, the one added to the database first is chosen.

During dataset cleaning, it is important to consider class balance. If you remove duplicates only in one class and leave them in another, the model will receive a new bias. Also, for the most critical cases, a human-in-the-loop approach should be used, where a specialist reviews the results of the automated search. This avoids deleting useful but visually similar samples that actually carry unique information for the neural network.

Computer Vision | Keylabs

Efficiency in the Data Labeling Process

For annotation companies, deduplication is a direct way to save budget. Labeling identical frames is wasted work that adds no knowledge to the model. Removing repetitions before starting annotation allows the team to focus on diverse data, which significantly speeds up training.

If some duplicates were already labeled, modern systems allow for automatic transfer of annotations. After cleaning is completed, quality control must be performed to ensure that deletion did not create holes in the training logic and that all remaining visual similarity data is correct.

After cleaning, it is important to ensure that removing redundancy actually improved the quality of the dataset and did not create new gaps. Proper verification of results allows for seeing real progress in model training.

Performance Evaluation and Error Handling

The final stage of deduplication is an audit that confirms the data has become cleaner and the model more objective.

How to Understand That It Got Better

The first indicator of success is the percentage of cleaned duplicates. However, the main criterion remains the change in model metrics. If, after removing duplicates, test accuracy drops slightly, but the model performs better on new data, it means you have eliminated overfitting.

A train/val leakage check is mandatory. Using similarity search algorithms, you can ensure that there is no longer a single identical frame between the training and validation groups. Visual audit also plays an important role: a random sample of deleted images helps confirm that the system did not remove unique and valuable samples.

Typical Mistakes to Avoid

Even an automated process can fail if near-duplicate detection specifics are not considered. Here are the most common pitfalls:

Too Strict Similarity Threshold. If the algorithm is set too aggressively, it will start deleting different details that were simply captured under similar conditions. This leads to a loss of useful diversity.
Deleting Rare Cases. Sometimes, unique objects look similar to each other. If you delete them, the model will never learn to work with deficient scenarios.
Ignoring Semantic Duplicates. Deleting only identical files without considering visual similarity does not solve the problem completely. The model will still "stall" on identical scenes.
Deduplication Only After Annotation. This is the most expensive mistake. Spending funds on labeling what will later be deleted is irrational from a business process perspective.

FAQ

How does deduplication affect computational costs during training?

Cleaning can reduce training time without loss of quality because the model does not need to process redundant iterations with the same data.

What are "soft duplicates", and how do they differ from "near duplicates"?

This is a term often used for images that are not copies but have too similar a context. Their removal requires a higher threshold of semantic analysis.

Are there cases where duplicates are useful to keep?

Almost never. Even if you need to increase a class's weight, it is better to use controlled augmentation rather than identical copies, which lead to overfitting.

How does deduplication help in the fight against "adversarial attacks"?

A cleaned dataset makes the model's decision boundaries clearer and more reasoned, which somewhat reduces vulnerability to specially selected visual noise.

What is the connection between deduplication and active learning?

They are an ideal pair. Deduplication removes redundancy at the start, and Active Learning helps annotators choose the most valuable samples from those that remain.

How to check deduplication quality without reviewing the entire database?

Using a statistical method: choose 100 random pairs that the algorithm recognized as duplicates and check them manually. This will provide an idea of the error rate.

Does deduplication affect the model's convergence speed?

Yes, the model usually converges faster because gradient updates are based on unique examples rather than repeating the same one.

What to do if two objects are different but have the same pHash?

This is called a collision. In such cases, it is worth using a combination of methods: first, a fast pHash, and then refinement through CNN embeddings.