Abstract
- Developing computer vision models can quickly become costly. On the one hand, this is due to the often large and complex models. On the other hand, the cost is driven by the size of the dataset. The larger the model and dataset, the more compute infrastructure is required. Also, dataset size is a major driver for storage costs (e.g., cloud storage) and data labeling effort. Yet, using smaller datasets normally comes at the cost of inferior model performance. In this paper, we propose a method to selectively sample images to create smaller, high-quality datasets. For that, we project each image into a feature space and measure the similarity to other data samples using cosine distance. Based on the similarity score, we decide whether to keep or discard a respective image. We demonstrate our method based on two use cases. First, we show that with this approach existing initial datasets can be enriched with new, diverse images. Second, we use our method to reduce large existing datasets by identifying similar clusters of images. Our method works completely unsupervised and without the need of any labels. Based on experiments we demonstrate the performance improvement of using small datasets created by our method in comparison to randomly selected data. Also, we find that models trained on our smaller datasets can even outperform models trained on larger datasets which were created via random selection. With that, we provide a valuable tool to develop computer vision models in resource-constrained environments and sacrifice only little performance.