Multi-view stereo is a method that analyzes and processes images from multiple perspectives to estimate the 3D geometric information of the scene to achieve 3D reconstruction. To improve the accuracy of 3D reconstruction in large-scale scenes and reduce the complexity of the reconstruction algorithm, in this paper, we propose a coarse-to-fine multi-view stereo network based on attention mechanism. First, we use a feature pyramid to extract multi-scale features, introducing richer geometric information and more contextual information at different levels of the pyramid to improve modeling accuracy. Then, we use position encoding on the coarse-scale feature map and introduce an attention mechanism to obtain more context information. We adopt a cascade structure to achieve high-resolution depth map construction. We use the reference image to refine the final result again and enhance details such as edges. We conduct experiments on the publicly available DTU dataset. Experimental results show that our proposed method improves accuracy compared with existing algorithms. In addition, we also conduct experiments on other representative public datasets. The accuracy of the experimental results further validates the effectiveness of our proposed method.
|