Computer Vision; Deep Learning; Data Augmentation; Stable Diffusion; Generative AI
Abstract :
[en] Computer vision tasks such as object detection and segmentation rely on the availability of extensive, accurately annotated datasets. In this work, We present CIA, a modular pipeline, for (1) generating synthetic images for dataset augmentation using Stable Diffusion, (2) filtering out low quality samples using defined quality metrics, (3) forcing the existence of specific patterns in generated images using accurate prompting and ControlNet. In order to show how CIA can be used to search for an optimal augmentation pipeline of training data, we study human object detection in a data constrained scenario, using YOLOv8n on COCO and Flickr30k datasets. We have recorded significant improvement using CIA-generated images, approaching the performances obtained when doubling the amount of real images in the dataset. Our findings suggest that our modular framework can significantly enhance object detection systems, and make it possible for future research to be done on data-constrained scenarios. The framework is available at: github.com/multitel-ai/CIA.
Disciplines :
Computer science
Author, co-author :
Benkedadra, Mohamed ; Université de Mons - UMONS > Faculté Polytechnique > Service Informatique, Logiciel et Intelligence artificielle
F105 - Information, Signal et Intelligence artificielle - Information, Signal and Artificial Intelligence F114 - Informatique, Logiciel et Intelligence artificielle
Research institute :
Infortech Numediart R450 - Institut NUMEDIART pour les Technologies des Arts Numériques
Name of the research project :
5443 - ARIAC BY DIGITALWALLONIA4.AI - Applications et Recherche pour une Intelligence Artificielle de Confiance - Région wallonne
H. Su, J. Deng, and L. Fei-Fei, "Crowdsourcing annotations for visual object detection, " in Workshops at the twenty-sixth AAAI conference on artificial intelligence, Citeseer, 2012.
B. Settles, "Active learning literature survey, " 2009.
M. Xu, S. Yoon, A. Fuentes, and D. S. Park, "A comprehensive survey of image augmentation techniques for deep learning, " Pattern Recognition, vol. 137, p. 109347, 2023.
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "Highresolution image synthesis with latent diffusion models, " in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684-10695, 2022.
L. Zhang, A. Rao, and M. Agrawala, "Adding conditional control to text-to-image diffusion models, " in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836-3847, 2023.
C. Shorten and T. M. Khoshgoftaar, "A survey on image data augmentation for deep learning, " Journal of big data, vol. 6, no. 1, pp. 1-48, 2019.
Y. Chen, Y. Li, T. Kong, L. Qi, R. Chu, L. Li, and J. Jia, "Scale-aware automatic augmentation for object detection, " in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9558-9567, 2021.
G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. Le, and B. Zoph, "Simple copy-paste is a strong data augmentation method for instance segmentation, " in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2917-2927, 2021.
T. Ø. Eliassen and Y. Ma, "Data synthesis with stable diffusion for dataset imbalance-computer vision, " 2022.
B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, "Effective data augmentation with diffusion models, " arXiv preprint arXiv: 2302. 07944, 2023.
S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, "Synthetic data from diffusion models improves imagenet classification, " arXiv preprint arXiv: 2304. 08466, 2023.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database, " in 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255, Ieee, 2009.
Y. Ge, J. Xu, B. Nlong Zhao, L. Itti, and V. Vineet, "Dall-e for detection: Language-driven compositional image synthesis for object detection, " arXiv preprint arXiv: 2206. 09592v3, 2022.
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, "Zero-shot text-to-image generation, " in International Conference on Machine Learning, pp. 8821-8831, PMLR, 2021.
W. Wu, T. Dai, X. Huang, F. Ma, and J. Xiao, "Image augmentation with controlled diffusion for weakly-supervised semantic segmentation, " arXiv preprint arXiv: 2310. 09760, 2023.
A. Mittal, A. K. Moorthy, and A. C. Bovik, "Blind/referenceless image spatial quality evaluator, " in 2011 conference record of the forty fifth asilomar conference on signals, systems and computers (ASILOMAR), pp. 723-727, IEEE, 2011.
H. Talebi and P. Milanfar, "Nima: Neural image assessment, " IEEE transactions on image processing, vol. 27, no. 8, pp. 3998-4011, 2018.
J. Wang, K. C. Chan, and C. C. Loy, "Exploring clip for assessing the look and feel of images, " Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 2555-2563, 2023.
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, "Realtime multi-person 2d pose estimation using part affinity fields, " in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291-7299, 2017.
I. Grishchenko, A. Ablavatski, Y. Kartynnik, K. Raveendran, and M. Grundmann, "Attention mesh: High-fidelity face mesh prediction in real-time, " arXiv preprint arXiv: 2006. 10962, 2020.
J. Canny, "A computational approach to edge detection, " IEEE Transactions on pattern analysis and machine intelligence, vol. PAMI-8, no. 6, pp. 679-698, 1986.
W. S. Mseddi, R. Ghali, M. Jmal, and R. Attia, "Fire detection and segmentation using yolov5 and u-net, " in 2021 29th European Signal Processing Conference (EUSIPCO), pp. 741-745, IEEE, 2021.
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, "Llama 2: Open foundation and fine-tuned chat models, " 2023.
G. Jocher, A. Chaurasia, and J. Qiu, "Yolo by ultralytics, " jan 2023.
G. Jocher, "Yolov8 hyperparameter config files. "
T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll'a r, and C. L. Zitnick, "Microsoft COCO: common objects in context, " CoRR, vol. Abs/1405. 0312, 2014.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, "Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, " in Proceedings of the IEEE international conference on computer vision, pp. 2641-2649, 2015.
G. Jocher, "Yolov8 data augmentation docs of ultralytics, " Nov 2023.