arXiv Analytics

Sign in

arXiv:2203.04895 [cs.CV]AbstractReferencesReviewsResources

Joint Learning of Salient Object Detection, Depth Estimation and Contour Extraction

Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu

Published 2022-03-09Version 1

Benefiting from color independence, illumination invariance and location discrimination attributed by the depth map, it can provide important supplemental information for extracting salient objects in complex environments. However, high-quality depth sensors are expensive and can not be widely applied. While general depth sensors produce the noisy and sparse depth information, which brings the depth-based networks with irreversible interference. In this paper, we propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD). Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks. In this way, the depth information can be completed and purified. Moreover, we introduce a multi-modal filtered transformer (MFT) module, which equips with three modality-specific filters to generate the transformer-enhanced feature for each modality. The proposed model works in a depth-free style during the testing phase. Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time. And, the resulted depth map can help existing RGB-D SOD methods obtain significant performance gain.

Related articles: Most relevant | Search more
arXiv:1604.07480 [cs.CV] (Published 2016-04-25)
Joint Semantic Segmentation and Depth Estimation with Deep Convolutional Networks
arXiv:2410.11610 [cs.CV] (Published 2024-10-15)
Depth Estimation From Monocular Images With Enhanced Encoder-Decoder Architecture
arXiv:2003.08933 [cs.CV] (Published 2020-03-19)
Depth Estimation by Learning Triangulation and Densification of Sparse Points for Multi-view Stereo