PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315801 (2024) https://doi.org/10.1117/12.3033697
This PDF file contains the front matter associated with SPIE Proceedings Volume 13158, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315802 (2024) https://doi.org/10.1117/12.3029465
In recent years, self-supervised 3D face reconstruction methods have demonstrated notable advancements in both quality and efficiency. However, existing self-supervised 3D face reconstruction methods rely on sparse facial keypoints to constrain the 3D facial shape. Moreover, these methods predominantly emphasize the overall facial shape information, often overlooking local shape details, which consequently leads to inaccuracies in facial feature reconstruction. To address this issue, this paper proposes a self-supervised 3D face reconstruction method based on dense keypoints. In addition to utilizing traditional texture loss and deep feature loss, we also employ the Iterative Closest Point (ICP) algorithm to establish correspondences between the dense facial keypoints (Face Mesh) and the 3D facial model (BFM09), thereby introducing face dense keypoint loss. By assigning different weights to facial keypoints on the nose, eyes, lips, and other local areas of the face, the proposed loss function effectively constrains the local facial information, thereby enhancing the accuracy of reconstruction. Experimental results on the AFLW2000-3D dataset demonstrate that the proposed method achieves a normalized mean error of 3.13%. Comparative analysis against mainstream methods reveals that our approach yields the best results for small pose changes and outperforms them for medium pose variations. These experiments underscore the effectiveness of the method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315803 (2024) https://doi.org/10.1117/12.3029429
The realistic volume shadow of volumetric datasets can improve the perception of shape and depth, and further enhance the efficiency of detecting defects, anomalies, and other issues. In this paper, a novel and high-performance method called slice-based ray casting (SBRC) is proposed to implement the volume shadow of volumetric datasets. The first step of the SBRC method is to use the light source as the viewpoint to render the illumination information of each slice of the volumetric datasets, slice by slice, into the illumination attenuation buffer. The second step is to use the camera as the viewpoint, render volumetric datasets using ray casting, and calculate volume shadows using the illumination attenuation buffer. The experiments show that the method can obtain much better volume shadows and more scalable performance than other volume illumination methods. This is due to the illumination attenuation calculation slice by slice and the high-efficiency shadow calculation in ray casting. And, by using a genetic algorithm, we can optimize the shadow calculation parameters in CT images to make the edges in the images clearer.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315804 (2024) https://doi.org/10.1117/12.3029513
Realistic rendering involves the creation of visually realistic images through computer graphics techniques, commonly applied in fields such as film, video games, and computer-aided design. Recent advancements in modern computer architectures and graphics cards have propelled the rapid progress of ray-traced realistic rendering. Formerly confined to offline rendering, these algorithms are now gradually transitioning into real-time computing. While rendering times and processes for realistic rendering have become more straightforward and accessible, implementing rendering algorithms remains a challenge for beginners lacking coding experience. Generic graphics APIs like Vulkan and DirectX, though powerful, demand a significant investment of time for mastery. The implementation process further involves considerations of overall architectural design, algorithm selection, and future scalability. In addition to these challenges, industrial renderers, while offering mature rendering systems, often contain low-level implementations with software-specific APIs and magic numbers. These intricacies can cause confusion and make it nearly impossible for individual developers to extend the system. To overcome these obstacles, this paper aims to enhance accessibility for rendering enthusiasts by constructing a ray-traced realistic rendering engine based on OpenGL. The engine exhibits characteristics such as strong extensibility and ease of use. Furthermore, a clear and efficient CPU-GPU data transfer framework is designed.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315805 (2024) https://doi.org/10.1117/12.3029555
Present stereo depth estimation networks often build 4D cost volumes and employ computationally intensive 3D convolutions for global information fusion. However, this approach is not well-suited for mobile devices because of the limited computational resources. This paper proposes a novel lightweight stereo depth estimation network, conceived with the objective of efficient inference on mobile devices. The proposed network employs a lightweight CNN for feature extraction and constructs multi-scale 3D cost volumes instead of 4D ones to circumvent 3D convolution operations in the following regularization. Experimental results show that the network can achieve real-time performance of 25 FPS even on CPUs, with acceptable estimation accuracy, which is significantly faster than existing methods. When the network is deployed on the RKS3588S mobile platform and utilized its NPU for inference, it achieves a computational speed of 10 FPS after model quantization. To the best of our knowledge, it is the first implementation of a stereo depth estimation network based on RK3588S devices.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315806 (2024) https://doi.org/10.1117/12.3029380
With more and more vehicles in the city, traffic violations are becoming more and more serious. Vehicle violations rely on manual interpretation, which is not only inefficient, but also leads to a backlog of data and inconsistent law enforcement standards. In response to this situation, this paper proposes an artificial-intelligence-based algorithm for automatic interpretation of illegal behaviors of vehicles running red lights. This paper first discusses the modeling process of building a vehicle running a red light, including image data preprocessing, traffic light and vehicle recognition, vehicle red light detection, and license plate recognition; then an automatic interpretation algorithm for vehicle red light violations is designed. Finally, the actual traffic photos are used to test the algorithm. The experimental results show that the algorithm has a high recognition rate and can effectively automatically identify the illegal behavior of vehicles running red lights, so it can effectively solve the problem of low efficiency of manual judgment.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315807 (2024) https://doi.org/10.1117/12.3029703
Anomaly detection is a research hotspot in the field of object detection, aiming to construct models using normal samples to detect anomalies. The challenge of this task is the extreme imbalance of the dataset, and training models based on such datasets do not have good generalization ability. In order to solve the problem of low abnormal data affecting detection performance, we propose an asymmetric self-coding network based on knowledge distillation, combined with anomaly detection algorithms. Our method only uses normal samples for training, allowing the encoder to learn the distribution of normal samples in deep space. We use the decoder to reconstruct and restore deep features, outputting a generated graph of the corresponding samples. By forming an asymmetric structure with a lightweight decoder and encoder, the problem of reconstruction error failure is solved. The knowledge distillation algorithm is combined to train the network, using the pretrained encoding network as the teacher network and guiding the reconstruction of the asymmetric decoding network as the student network. A new multi-scale loss function is designed, which is composed of pixel level and global direction loss function. Experiments show that the average AUC of each category of MVTec AD dataset in our method is significantly higher than other anomaly detection methods. Especially when knowledge distillation strategy is used in reconstruction methods, the average AUC of our method is about 2 points higher than the highest MKD network.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315808 (2024) https://doi.org/10.1117/12.3030925
Gesture interaction is the most primitive and natural way for humans to interact and plays a crucial role in virtual reality and augmented reality technologies. It enables control of virtual environments, such as selecting, moving, and rotating virtual objects using gestures. While there are various 2D pose estimation methods based on convolutional neural networks (CNNs) that can be tracked and labeled from 2D videos, real-world gesture interactions occur in 3D space. Common 3D pose estimation methods rely on supervised learning and yield accurate results but are costly in terms of obtaining 3D data through camera calibration and annotation. Moreover, the limitations of mobile computing power hinder the deployment of advanced algorithms, posing challenges for industrial applications. To address the difficulty of acquiring 3D annotated data and the limitations of mobile algorithms, this paper proposes a lightweight approach that combines hand biomechanics and nonlinear optimization, enabling 3D pose estimation with binocular cameras during training without relying on extensive 3D data labeling. We employ a lightweight model based on convolutional neural networks to detect and track hand keypoints in binocular cameras, followed by the computation of reprojection error. Reprojection error serves as the optimization objective in 3D pose estimation, allowing for more accurate 3D camera coordinates by minimizing this error. Constraints on palm size and joint lengths are applied to prevent unrealistic hand poses. Finally, the Levenberg- Marquardt algorithm is used for nonlinear optimization to obtain the optimal 3D hand pose estimation. We conducted experiments on a test gesture dataset and compared our method with mediapipe, demonstrating our advantages in accuracy and real-time performance. Furthermore, we deployed the system on augmented reality glasses powered by the RK3588 SOC and utilized NPU acceleration, achieving a frame rate of 50PFS. The proposed 2.5D pose estimation model based on binocular cameras and nonlinear optimization leverages information from multiple viewpoints, resulting in more accurate 3D pose estimation suitable for virtual reality and augmented reality applications. It handles noise, mismatches, and hand occlusions, exhibiting superior robustness in complex scenarios.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 1315809 (2024) https://doi.org/10.1117/12.3029630
Gesture tracking is crucial for AR device human-computer interactions. Although many deep learning-based methods offer notable accuracy, their extensive parameters limit efficiency, challenging real-time deployments on low-power platforms. We present a lightweight, real-time 3D gesture tracking solution that determines hand positions and keypoints from a single RGB image in AR/VR devices. Using a two-stage algorithm, the initial stage identifies a hand's bounding frame. This frame then guides the second stage to detect 3D hand joint coordinates. These coordinates, once adjusted for camera parameters, yield the camera coordinates for hand keypoints. Our solution is optimized for low-power platforms, such as the RK3588 board, enabling real-time inferences with high detection quality (the speed performance of conventional models on RK3588 platform is illustrated in Table 1).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580A (2024) https://doi.org/10.1117/12.3029447
Humanoid characters in games and videos move naturally as humans do in the real world. Recently, research has been conducted on generating natural motion using deep learning. Currently, there is a need to generate and control diverse natural motions. Therefore, this study proposes a model for generating human motions using StyleGAN, a deep generative model used in image generation. The performance of the proposed model was experimentally assessed on motion capture datasets; the results confirmed that the model could generate diverse and natural motions through random generation and intermediate latent variable interpolation. Additionally, experiments with style mixing indicated that the low-level layer could control the class of movements and the middle-level layer could control the positions of the arms and legs, posture, and body orientation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Zhaobing Zhang, Yuehu Liu, Shaorong Wang, Yuanfeng Yan
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580B (2024) https://doi.org/10.1117/12.3029604
It is still difficult to accurately extract smooth and consistent 3D human motion from video footage over time. While some existing techniques have achieved favorable outcomes by utilizing the combined features of consecutive frames, many of them compromise accuracy in order to reduce jitter or do not have a complete understanding of the temporal nature of human movement. To this end, we model the natural smoothing properties in body motion by learning the long-range temporal relationship between the kinematic features of the human body in the video and the enhanced current frame features. First, we use the velocity and acceleration of key points to effectively capture temporal features as our temporal motion prior, and then we have created a module that uses a hierarchical attention mechanism to improve the representation of the current frame by selectively focusing on important temporal information from both past and future frames. This enhances the correlation between frames and improves the overall quality of the feature representation. Ultimately these two parts of features are aggregated together through a global motion aware network and linear fusion is performed to obtain the final accurate 3D human motion.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Qianyun Song, Hao Zhang, Yanan Liu, Shouzheng Sun, Dan Xu
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580C (2024) https://doi.org/10.1117/12.3029389
Existing human pose estimation methods in videos often rely on sampling strategies to select frames for estimation tasks. Common sampling approaches include uniform sparse sampling and keyframe selection. However, the former focuses solely on fixed positions of video frames, leading to the omission of dynamic information, while the latter incurs high computational costs by processing each frame. To address these issues, we propose an efficient and effective pose estimation framework, named Joint Misalignment-aware Bilateral Detection Network (J-BDNet). Our framework incorporates a Bilateral Dynamic Attention Module (BDA) using knowledge distillation for efficiency. BDA detects dynamic information on both left and right halves of a video segment, guiding the sampling process. Additionally, employing a smart bilateral recursive sampling strategy with BDA enables extracting more spatiotemporal dependencies from pose data, reducing computational costs without increasing the pose estimator’s usage frequency. Moreover, we enhance existing denoise network robustness by randomly exchanging body joint positions in pose data. Experiments demonstrate the performance of our framework in terms of high occlusion, spatial blur, and illumination variations, and achie state-of-the-art performance on Sub-JHMDB datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Medical Image Segmentation and Computational Models
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580D (2024) https://doi.org/10.1117/12.3029470
The boundaries of colonoscopy-acquired images are often blurred due to reflections and low contrast, and existing methods for colon polyp segmentation fail to effectively represent global contextual information and long-range dependencies, resulting in sub-optimal accuracy in segmenting polyp. To address this problem, we propose a novel approach which introduces uncertainty-guided cross-entropy loss into a Transformer model to achieve precise segmentation of colon polyps. Regarding the boundary blurred, we incorporate an uncertainty estimation module into the decoding process. This module assigns lower weights to pixels with higher boundary uncertainty so as to mitigate the influence of erroneous pixels, and a boundary attention module is employed between encoding and decoding to guide the network to capture polyp edges more effectively, thereby improving its ability for precise boundary localization. To enhance the contextual modeling capabilities of the model, we employ a Pyramid Vision Transformer v2 (PVTv2) encoder to extract semantic information and capture long-range dependencies in the lesion regions. Furthermore, we utilize a feature refinement module to capture local detailed information. Additionally, a low-level feature enhancement module is applied to highlight the region of interest (ROI) of polyps, thereby facilitating improved discrimination between normal tissues and polyps. Extensive experiments conducted on five publicly datasets demonstrate the superior accuracy and generalization performance of the proposed model. Furthermore, with minor refinements, this model can be extended to other tumor segmentation tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580E (2024) https://doi.org/10.1117/12.3029607
Brain tumor is one of the most dangerous diseases. Automated brain tumor segmentation technology is particularly important in the diagnosis and treatment of brain tumors. Traditional brain tumor segmentation methods mostly rely on UNet or associate variants, and the segmentation performance is highly dependent on the feature extraction quality. Recently, diffusion probabilistic model (DPM) has received a lot of attention and achieved remarkable success in medical image segmentation. However, the existing DPM-based brain tumor segmentation method did not utilize the advantages of complementary information between multimodal MRI. Additionally, they all constrained the generation of DPM using the original images. In this work, we propose a DPM-based brain tumor segmentation method, which consists of DPM, uncertainty generation module and collaborative Module. The collaborative module takes the input MRI from multimodal information and dynamically provide conditional constraints for DPM. This allows DPM to obtain more detailed brain tumor features. Considering that Previous works mainly ignore the influence of DPM's uncertainty on the results, we proposed an uncertainty generation module. It calculates the uncertainty of each step of the DPM and assigns corresponding uncertainty weights. The results of each step are fused according to inferred uncertainty weights to get the final segmentations. The proposed method obtained 89.32% and 87.82% dice scores on the BraTS2020 and BraTS2021 datasets, respectively, which verified the effectiveness of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580F (2024) https://doi.org/10.1117/12.3029445
Cervical cancer is one of the leading causes of women’s death. Currently, Pap smear images testing, one of the most conventional ways to check for cervical cancer, has a misidentification rate of around 40 percent which poses serious risks. Existing approaches to sorting Pap smear images are still not at an accurate enough level to be put into practical use. In this paper, we create an ensemble network by combining three CNN networks, namely DenseNet-169, VGG-19, and Xception with a Swin transformer to perform cervical cytolopy image classification on the standardized SIPaKMeD dataset and Mendeley LBC dataset. The proposed framework obtains an accuracy of 95.50% on the SIPaKMeD dataset and 98.65% on the Mendeley LBC dataset, which outperforms a majority of methods proposed on cervical cytology classification.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Seventh International Conference on Computer Graphics and Virtuality (ICCGV 2024), 131580G (2024) https://doi.org/10.1117/12.3029557
High-resolution tissue pathology image play a crucial role in the diagnosis of certain diseases. In this paper, we propose a method for super-resolution reconstruction using a diffusion denoising probabilistic model to convert low-resolution images of colonic tissue units into high-resolution images. The network employs a conditional diffusion denoising probabilistic generative model, which in the forward process, introduces gaussian noise into the input high-resolution image, transforming it into a gaussian noise distribution. In the reverse inference process, the model takes the low-resolution image as a condition, combines it with gaussian noise, and generates a high-resolution image through an inverse process. Experimental results demonstrate that, under 4x and 8x magnification, the high-resolution images reconstructed by our proposed diffusion denoising probability super-resolution model surpass those obtained by other super-resolution methods. The reconstructed histological images of colonic tissue units can still finely preserve complete information and edge details at large magnification factors. Through this approach, colonic tissue unit images can be clarified, facilitating physicians in observing physiological information in pathological images and improving the pathological-assisted diagnosis of certain colorectal diseases.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.