Bridging 2D and 3D Object Detection: Advances in Occlusion Handling through Depth Estimation

3D sensors; depth estimation; monocular; multimodal fusion; Object detection; occlusion handling; Convolutional neural network; Depth Estimation; Light detection and ranging; Monocular; Multi-modal fusion; Objects detection; Occlusion handling; Three-dimensional object; Three-dimensional sensor; Two-dimensional; Software; Modeling and Simulation; Computer Science Applications

Abstract :

[en] Object detection in occluded environments remains a core challenge in computer vision (CV), especially in domains such as autonomous driving and robotics. While Convolutional Neural Network (CNN)-based two-dimensional (2D) and three-dimensional (3D) object detection methods have made significant progress, they often fall short under severe occlusion due to depth ambiguities in 2D imagery and the high cost and deployment limitations of 3D sensors such as Light Detection and Ranging (LiDAR). This paper presents a comparative review of recent 2D and 3D detection models, focusing on their occlusion-handling capabilities and the impact of sensor modalities such as stereo vision, Time-of-Flight (ToF) cameras, and LiDAR. In this context, we introduce FuDensityNet, our multimodal occlusion-aware detection framework that combines Red-Green-Blue (RGB) images and LiDAR data to enhance detection performance. As a forward-looking direction, we propose a monocular depth-estimation extension to FuDensityNet, aimed at replacing expensive 3D sensors with a more scalable CNN-based pipeline. Although this enhancement is not experimentally evaluated in this manuscript, we describe its conceptual design and potential for future implementation.

Disciplines :

Computer science

Author, co-author :

Ouardirhi, Zainab ; Université de Mons - UMONS > Faculté Polytechnique > Service Informatique, Logiciel et Intelligence artificielle ; Communication Networks Department, Ecole Nationale Supérieure d’Informatique and Systems Analysis, Mohammed V University in Rabat, Rabat, Morocco

Zbakh, Mostapha; Communication Networks Department, Ecole Nationale Supérieure d’Informatique and Systems Analysis, Mohammed V University in Rabat, Rabat, Morocco

Mahmoudi, Sidi ; Université de Mons - UMONS > Faculté Polytechnique > Service Informatique, Logiciel et Intelligence artificielle

Language :

English

Title :

Bridging 2D and 3D Object Detection: Advances in Occlusion Handling through Depth Estimation

Publication date :

2025

Journal title :

Computer Modeling in Engineering and Sciences

ISSN :

1526-1492

eISSN :

1526-1506

Publisher :

Tech Science Press

Volume :

143

Issue :

Pages :

2509 - 2571

Peer reviewed :

Peer reviewed

Additional URL :

https://cdn.techscience.cn/files/CMES/2025/TSP_CMES-143-3/TSP_CMES_64283/TSP_CMES_64283.pdf

Research unit :

F114 - Informatique, Logiciel et Intelligence artificielle

Research institute :

R450 - Institut NUMEDIART pour les Technologies des Arts Numériques
R300 - Institut de Recherche en Technologies de l'Information et Sciences de l'Informatique

Funding text :

Funding Statement: This research received financial support from ARES as part of a Ph.D. program conducted through joint supervision between UMONS in Belgium and UM5 in Morocco.

Available on ORBi UMONS :

since 14 January 2026

Statistics

Number of views

4 (1 by UMONS)

Number of downloads

0 (0 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Ye H, Zhao J, Pan Y, Cherr W, He L, Zhang H. Robot person following under partial occlusion. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). London, UK; 2023. p. 7591–7.
Ouardirhi Z, Mahmoudi SA, Zbakh M. Enhancing object detection in smart video surveillance: a survey of occlusion-handling approaches. Electron Personal Commun. 2024;13(3):541. doi:10.3390/electronics13030541.
Zhiqiang W, Jun L. A review of object detection based on convolutional neural network. In: 2017 36th Chinese Control Conference (CCC); 2017 Jul 26–28; Dalian, China. 2017. p. 11104–9.
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, et al. Video swin transformer. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 18–24; New Orleans, LA, USA. p. 3202–11.
Liu S, Li F, Zhang H, Yang X, Qi X, Su H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR. arXiv:2201.12329. 2022.
Chen X, Ma H, Wan J, Li B, Xia T. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 1907–15.
Yang T, Gu F. Overview of modulation techniques for spatially structured-light 3D imaging. Optics Laser Technol. 2024;169:110037.
Ouardirhi Z, Mahmoudi SA, Zbakh M, El Ghmary M, Benjelloun M, Abdelali HA, et al. An efficient real-time moroccan automatic license plate recognition system based on the YOLO object detector. In: 2022 International Conference on Big Data and Internet of Things; 2022 Oct 25–27; Tangier, Morocco. p. 290–302.
Birkl R, Wofk D, Müller M. Midas v3.1—a model zoo for robust monocular relative depth estimation. arXiv:2307.14460. 2023.
Bhat SF, Alhashim I, Wonka P. Adabins: depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. 2021. p. 4009–18.
Ming Y, Meng X, Fan C, Yu H. Deep learning for monocular depth estimation: a review. Neurocomputing. 2021;438:14–33. doi:10.1016/j.neucom.2020.12.089.
Ouardirhi Z, Amel O, Zbakh M, Mahmoudi SA. FuDensityNet: fusion-based density-enhanced network for occlusion handling. In: Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024). Rome, Italy; 2024. p. 632–9.
Pang S, Morris D, Radha H. CLOCs: camera-LiDAR object candidates fusion for 3D object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Las Vegas, NV, USA; 2020. p. 10386–93.
Pandya S, Srivastava G, Jhaveri R, Babu MR, Bhattacharya S, Maddikunta PKR, et al. Federated learning for smart cities: a comprehensive survey. Sustain Energy Technol Assess. 2023;55(5):102987. doi:10.1016/j.seta.2022.102987.
Dellermann D, Ebel P, Söllner M, Leimeister JM. Hybrid intelligence. Bus Inf Syst Eng. 2019;61:637–43.
He X, Liu Y, Ganesan K, Ahnood A, Beckett P, Eftekhari F, et al. A single sensor based multispectral imaging camera using a narrow spectral band color mosaic integrated on the monochrome CMOS image sensor. APL Photonics. 2020;5(4):046104. doi:10.1063/1.5140215.
Jeon HG, Lee JY, Im S, Ha H, Kweon IS. Stereo matching with color and monochrome cameras in low-light conditions. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. p. 4086–94.
Kruegle H. Video technology overview for schools. In: The handbook for school safety and security. 1st ed. Oxford, UK: Butterworth-Heinemann; 2014. p. 195–240.
Lee BY, Liew LH, Cheah WS, Wang YC. Occlusion handling in videos object tracking: a survey. IOP Conf Ser: Earth Environ Sci. 2014;18:012020.
Nuage D. LiDAR 3D perception and object detection [Internet]; 2024 [cited 2025 Feb 10]. Available from: https://www.digitalnuage.com/lidar-3d-perception-and-object-detection.
Moselhi O, Bardareh H, Zhu Z. Automated data acquisition in construction with remote sensing technologies. Appl Sci. 2020;10(8):2846. doi:10.3390/app10082846.
Grigorescu S, Trasnea B, Cocias T, Macesanu G. A survey of deep learning techniques for autonomous driving. J Field Robotics. 2020;37(3):362–86. doi:10.1002/rob.21918.
Boizard N, El Haddad K, Ravet T, Cresson F, Dutoit T. Deep learning-based stereo camera multi-video synchronization. In: ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2023 Jun 4–10; Rhodes, Greece. p. 1–5.
Ghannami MA, Daniel S, Sicot G, Quidu I. A likelihood-based triangulation method for uncertainties in through-water depth mapping. Remote Sens. 2024;16(21):4098. doi:10.3390/rs16214098.
Duba PK, Mannam NPB, Rajalakshmi P. Stereo vision based object detection for autonomous navigation in space environments. Acta Astronautica. 2024;218(17):326–9. doi:10.1016/j.actaastro.2024.02.032.
Pencer J, Wong FC, Bromley BP, Atfield J, Zeller M. Comparison of WIMS-AECL/DRAGON/RFSP and MCNP results with ZED-2 measurements for control device worth and reactor kinetics. In: International Conference on the Physics of Reactors 2010; 2010 May 9–14; Pittsburgh, PA, USA. p. 327–37.
He Y, Chen S. Recent advances in 3D data acquisition and processing by time-of-flight camera. IEEE Access. 2019;7:12495–510. doi:10.1109/access.2019.2891693.
Li L. Time-of-flight camera—an introduction. In: Technical white paper; 2024. SLOA190B. [cited 2025 Feb 10]. Available at https://www.ti.com/lit/pdf/sloa190.
Sanmartin-Vich N, Calpe J, Pla F. Analyzing the effect of shot noise in indirect Time-of-Flight cameras. Signal Process: Image Commun. 2024;122(4):117089. doi:10.1016/j.image.2023.117089.
Yang D, An D, Xu T, Zhang Y, Wang Q, Pan Z, et al. Object pose and surface material recognition using a single-time-of-flight camera. Adv Photonics Nexus. 2024;3(5):056001–1.
Elaraby AF, Hamdy A, Rehan M. A kinect-based 3D object detection and recognition system with enhanced depth estimation algorithm. In: 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). Vancouver, BC, Canada; 2018. p. 247–52. doi:10.1109/IEMCON.2018.8615020.
Han J, Shao L, Xu D, Shotton J. Enhanced computer vision with microsoft kinect sensor: a review. IEEE Tran Cybern. 2013;43(5):1318–34. doi:10.1109/tcyb.2013.2265378.
Du L, Zhang R, Wang X. Overview of two-stage object detection algorithms. J Phys: Conf Ser. 2020;1544:012033.
Hnoohom N, Chotivatunyu P, Jitpattanakul A. ACF: an armed CCTV footage dataset for enhancing weapon detection. Sensors. 2022;22(19):7158. doi:10.3390/s22197158.
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA; 2014. p. 580–7.
Girshick R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision; 2015 Dec 7–13; Santiago, Chile. p. 1440–8.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell. 2016;39(6):1137–49. doi:10.1109/tpami.2016.2577031.
Cai Z, Vasconcelos N. Cascade R-CNN: Delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 6154–62.
Qiao S, Chen LC, Yuille A. Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. p. 10213–24.
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW. Selective search for object recognition. Int J Comput Vis. 2013;104:154–71. doi:10.1109/iccv.2011.6126456.
Cortes C. Support-vector networks. Mach Learn. 1995;20:273–97.
Ren Y, Zhu C, Xiao S. Object detection based on fast/faster RCNN employing fully convolutional architectures. Math Probl Eng. 2018;2018(1):3598316. doi:10.1155/2018/3598316.
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. p. 779–88.
Jiang P, Ergu D, Liu F, Cai Y, Ma B. A review of Yolo algorithm developments. Procedia Comput Sci. 2022;199(11):1066–73. doi:10.1016/j.procs.2022.01.135.
Wang C, Bochkovskiy A, Liao H. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17–24; Vancouver, BC, Canada. p. 7464–75.
Reis D, Kupec J, Hong J, Daoudi A. Real-time flying object detection with YOLOv8. arXiv:2305.09972. 2023.
Wang A, Chen H, Liu L, Chen K, Lin Z, Han J, et al. YOLOv10: real-time end-to-end object detection. arXiv:2405.14458. 2024.
Tan M, Pang R, Le QV. EfficientDet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020 Jun 13–19; Seattle, WA, USA. p. 10781–90.
Wang CY, Bochkovskiy A, Liao HYM. Scaled-yolov4: scaling cross stage partial network. In: Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. p. 13029–38.
Xu S, Wang X, Lv W, Chang Q, Cui C, Deng K, et al. PP-YOLOE: an evolved version of YOLO. arXiv:2203.16250. 2022.
Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 7263–71.
Farhadi A, Redmon J. Yolov3: an incremental improvement. In: Computer vision and pattern recognition. Berlin/Heidelberg, Germany: Springer; 2018. Vol. 1804, p. 1–6.
Koonce B, Koonce B. EfficientNet. In: Convolutional neural networks with swift for Tensorflow: image recognition and dataset categorization. Berkeley, CA, USA: Apress; 2021. p. 109–23.
Huang X, Wang X, Lv W, Bai X, Long X, Deng K, et al. PP-YOLOv2: a practical object detector. arXiv: 2104.10419. 2021.
Zhou Y, Tuzel O. Voxelnet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 4490–9.
Yan Y, Mao Y, Li B. SECOND: sparsely embedded convolutional detection. Sensors. 2018;18(10):3337. doi:10.3390/ s18103337.
Qi CR, Su H, Mo K, Guibas LJ. PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. p. 652–60.
Qi CR, Yi L, Su H, Guibas LJ. PointNet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst. 2017;30:5099–108.
Shi S, Guo C, Jiang L, Wang Z, Shi J, Wang X, et al. PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020 Jun 13–19; Seattle, WA, USA. p. 10529–38.
Ding Z, Han X, Niethammer M. VoteNet: a deep learning label fusion method for multi-atlas segmentation. In: Medical Image Computing and Computer Assisted Intervention-MICCAI 2019: 22nd International Conference; 2019 Oct 13–17; Shenzhen, China. p. 202–10.
Li Z, Wang W, Li H, Xie E, Sima C, Lu T, et al. BEVFormer: learning bird’s-eye-view representation from LiDAR-camera via spatiotemporal transformers. IEEE Trans Pattern Anal Mach Intell. 2024;47(3):2020–36. doi:10.1109/ tpami.2024.3515454.
Liu Y, Wang T, Zhang X, Sun J. Position embedding transformation for multi-view 3D object detection. In: 2022 European Conference on Computer Vision; 2022 Oct 23–27; Tel Aviv, Israel. p. 531–48.
Wang Y, Guizilini VC, Zhang T, Wang Y, Zhao H, Solomon J. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: 5th Conference on Robot Learning (CoRL 2021). London, UK; 2022. p. 180–91.
Bai X, Hu Z, Zhu X, Huang Q, Chen Y, Fu H, et al. Transfusion: robust lidar-camera fusion for 3D object detection with transformers. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun. 18–24; New Orleans, LA, USA. p. 1090–9.
Hu C, Zheng H, Li K, Xu J, Mao W, Luo M, et al. FusionFormer: a multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3D object detection. arXiv:2309.05257. 2023.
StereoLabs. StereoLabs developers—release resources for ZED cameras [Internet]. 2024 [cited 2024 Dec 2]. Available from: https://www.stereolabs.com/en-be/developers/release.
Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI, USA; 2012. p. 3354–61.
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context [Internet]; 2014 [cited 2025 May 15]. Available from: https://cocodataset.org/#home.
Xiao J, Owens A, Torralba A. RGB-D object dataset [Internet]. 2023 [cited 2025 May 15]. Available from: https://rgbd.cs.princeton.edu/.
Sharma P, Gupta S, Vyas S, Shabaz M. Retracted: object detection and recognition using deep learning-based techniques. IET Commun. 2023;17(13):1589–99. doi:10.1049/cmu2.12513.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. p. 770–8.
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC. Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18–23; Salt Lake City, UT, USA. p. 4510–20.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. 2014.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot MultiBox detector. In: European Conference on Computer Vision; 2016 Oct 11–14; Amsterdam, The Netherlands. p. 21–37.
Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision; 2017 Oct 22–29; Venice, Italy. p. 2980–8.
Sozzi M, Cantalamessa S, Cogato A, Kayad A, Marinello F. Automatic bunch detection in white grape varieties using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms. Agronomy. 2022;12(2):319. doi:10.3390/ agronomy12020319.
Li C, Li L, Jiang H, Weng K, Geng Y, Li L, et al. YOLOv6: a single-stage object detection framework for industrial applications. arXiv:2209.02976. 2022.
Huang Z, Li L, Krizek GC, Sun L. Research on traffic sign detection based on improved YOLOv8. J Comput Commun. 2023;11(7):226–32. doi:10.4236/jcc.2023.117014.
Radosavovic I, Kosaraju RP, Girshick R, He K, Dollár P. Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020 Jun 13–19; Seattle, WA, USA. p. 10428–36.
Zhu X, Ma Y, Wang T, Xu Y, Shi J, Lin D. SSN: shape signature networks for multi-class object detection from point clouds. In: Computer Vision-ECCV 2020: 16th European Conference; 2020 Aug 23–28; Glasgow, UK. p. 581–97.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the 2017 IEEE International Conference on Computer Vision; 2017 Oct 22–29; Venice, Italy. p. 618–26.
Muhammad MB, Yeasin M. Eigen-cam: class activation map using principal components. In: 2020 International Joint Conference on Neural Networks (IJCNN); 2020 Jul 19–24; Glasgow, UK. p. 1–7.
Mahmoudi SA, Gloesener M, Benkedadra M, Lerat JS. Edge AI system for real-time and explainable forest fire detection using compressed deep learning models. In: Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025); 2025 Feb 26–28; Porto, Portugal. p. 847–54.
Lerat JS, Mahmoudi SA. Scalable deep learning for Industry 4.0: speedup with distributed deep learning and environmental sustainability considerations. In: 2023 International Conference of Cloud Computing Technologies and Applications; 2023 Nov 21–23; Marrakesh, Morocco. 2024. p. 182–204.
Infrabel. Infrabel—Gestionnaire de l’infrastructure ferroviaire belge [Internet]. [cited 2025 Jan 23]. Available from: https://infrabel.be/fr.
Amel O, Siebert X, Mahmoudi SA. Comparison analysis of multimodal fusion for dangerous action recognition in railway construction sites. Electronics. 2024;13(12):2294. doi:10.3390/electronics13122294.
Li Z, Chen Z, Liu X, Jiang J. Depthformer: exploiting long-range correlation and local information for accurate monocular depth estimation. Mach Intell Res. 2023;20(6):837–54. doi:10.1007/s11633-023-1458-0.
Lin Z, Liu Z, Xia Z, Wang X, Wang Y, Qi S, et al. RCBEVDet: radar-camera fusion in bird’s eye view for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2024 Jun 16–22; Seattle, WA, USA. p. 14928–37.
Jiang X, Hou Y, Tian H, Zhu L. Mirror complementary transformer network for RGB-thermal salient object detection. IET Comput Vis. 2024;18:15–32.
Liu H, Xue W, Chen Y, Chen D, Zhao X, Wang K, et al. A survey on hallucination in large vision-language models. arXiv:2402.00253. 2024.
Liu Y, Zhang K, Li Y, Yan Z, Gao C, Chen R, et al. Sora: a review on background, technology, limitations, and opportunities of large vision models. arXiv:2402.17177. 2024.