PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
Wolfgang Osten,1 Dmitry P. Nikolaev,2 Jianhong (Jessica) Zhou3
1Institut für Technische Optik (Germany) 2Institute for Information Transmission Problems (Kharkevich Institute) (Russian Federation) 3Univ. of Electronic Science and Technology of China (China)
This PDF file contains the front matter associated with SPIE Proceedings Volume 12701, including the Title Page, Copyright information, Table of Contents and Conference Committee list.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this research work, we have proposed a thermal tiny-YOLO multi-class object detection (TTYMOD) system as a smart forward sensing system that should remain effective in all weather and harsh environmental conditions using an end-to-end YOLO deep learning framework. It provides enhanced safety and improved awareness features for driver assistance. The system is trained on large-scale thermal public datasets as well as newly gathered novel open-sourced dataset comprising of more than 35,000 distinct thermal frames. For optimal training and convergence of YOLO-v5 tiny network variant on thermal data, we have employed different optimizers which include stochastic decent gradient (SGD), Adam, and its variant AdamW which has an improved implementation of weight decay. The performance of thermally tuned tiny architecture is further evaluated on the public as well as locally gathered test data in diversified and challenging weather and environmental conditions. The efficacy of a thermally tuned nano network is quantified using various qualitative metrics which include mean average precision, frames per second rate, and average inference time. Experimental outcomes show that the network achieved the best mAP of 56.4% with an average inference time/ frame of 4 milliseconds. The study further incorporates the optimization of tiny network variant using the TensorFlow Lite quantization tool which is beneficial for the deployment of deep learning architectures on the edge and mobile devices. For this study, we have used a raspberry pi 4 computing board for evaluating the real-time feasibility performance of an optimized version of the thermal object detection network for the automotive sensor suite. The source code, trained and optimized models and complete validation/ testing results are publicly available at https://github.com/MAli-Farooq/Thermal-YOLO-And-Model-Optimization-Using-TensorFlowLite.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Taillight detection is of great significance and value in predicting driving intention of vehicle ahead in assisted driving and unmanned driving systems. Aiming at the problem that the vehicle taillight detection model YOLOX is not able to detect small targets and the channel information does not contain location information, a taillight detection algorithm based on the attention mechanism and improved YOLOX is proposed. Firstly, a micro-scale detection layer is added to further multi-scale feature fusion to extract more taillight feature information. Secondly, the mobile network attention mechanism is inserted into the proposed improved feature fusion network to extract taillight position information. Finally, the sensing field area at the output end of feature extraction layer is enlarged to improve the detection accuracy of small targets such as taillights. The taillight data set established in natural environment was used, and the number of samples was 1,962 images. The training set, verification set and test set were randomly divided in 8:1:1 ratio to verify the improved network. Experimental results show that the improved model has a 91.71% average detection accuracy (mAP) for taillight images, which is 3.48% higher than the standard YOLOX algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recent events in both armed conflict and the civil aviation space continue to highlight the threat of Unmanned Aerial Systems (UAS), often referred to as drones. Most drone counter measure systems and all early warning systems require drone detection. A number of drone detection techniques, including radars, RF signal capture and optical sensing have been developed to provide this capability. These techniques all have different advantages and disadvantages and a robust counter UAS (C-UAS) or UAS early warning system should combine several of these systems. One of the available detection systems is computer vision using deep learning and optical sensors. Due to the rapid advancement of this area, there are many options for practitioners seeking to utilize cutting edge deep learning techniques for optical UAS detection. In this study, we provide a comparative performance analysis of four state-of-the-art deep learning-based object detection algorithms, namely YOLOv5 small and large, SSD and Faster RCNN. We show that the YOLOv5 based models and Faster RCNN model are very close to each other in terms of accuracy while they outperform SSD. The YOLOv5 based models are also significantly faster than both SSD and Faster RCNN algorithms. Our analysis suggests that among the investigated algorithms, YOLOv5 small provides the best trade-off between accuracy and speed for a CUAS self-protection or early warning system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Falls are a growing social problem and have become a hot topic in healthcare. Thanks to recent advances in deep convolutional neural networks, the accuracy of video-based fall detection has been greatly improved. However, these methods are affected by illumination and complex backgrounds. Video angles and other influencing factors reduce the accuracy and generalization ability of these methods. In this paper, a video-based human fall detection method is proposed. First, a 2D joint point sequence in the video is extracted using a pose estimator, and then a 2D joint point pose sequence is extracted. It is elevated to a 3D joint point pose sequence and then recognized whether it is a fall action by our improved multi-scale unified spatial-temporal graph convolutional network (MS-G3D). The system proves its effectiveness and robustness in the field of action recognition, achieving 99.84% accuracy on the large benchmark action recognition dataset NTU RGB+D, and 95.72% accuracy on the LE2I fall dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Despite recent advances in deep learning, object detection and tracking still require considerable manual and computational effort. First, we need to collect and create a database of hundreds or thousands of images of the target objects. Next we must annotate or curate the images to indicate the presence and position of the target objects within those images. Finally, we must train a CNN (convolution neural network) model to detect and locate the target objects in new images. This training is usually computationally intensive, consists of thousands of epochs, and can take tens of hours for each target object. Even after the model training in completed, there is still a chance of failure if the real-time tracking and object detection phases lack sufficient accuracy, precision, and/or speed for many important applications. Here we present a system and approach which minimizes the computational expense of the various steps in the training and real-time tracking process outlined above of for applications in the development of mixed-reality science laboratory experiences by using non-intrusive object-encoding 2D QR codes that are mounted directly onto the surfaces of the lab tools to be tracked. This system can start detecting and tracking it immediately and eliminates the laborious process of acquiring and annotating a new training dataset for every new lab tool to be tracked.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The degree of autonomy in vehicles depends directly on the performance of their sensor systems. The transition to even more autonomously driven cars therefore requires the development of robust sensor systems with different skills. Especially in adverse and changing weather conditions (rain, snow, fog, etc.), conventional sensor systems such as cameras perform unreliably. Moreover, data evaluation has to be performed in real-time, i.e. within a fraction of seconds, in order to safely guide the car through traffic and to avoid a crash with any obstacle. Therefore, we propose to use a so called time-gated-single-pixel-camera, which combines the principles of time gating and compressed sensing. In a single pixel camera, the amount of recorded data can be significantly reduced compared to a conventional camera by exploiting the inherent sparsity of scenes. The lateral information is gained with the help of binary masks in front of a simple photodiode. We optimize the pattern of the masks by including them as trainable parameters within our data evaluation neural network. Additionally, our camera is able to cope with adverse weather conditions due to the underlying time gating principle. The feasibility of our method is demonstrated by simulated and measured data as well.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The lack of large-scale datasets has been impeding the advance of deep learning approaches to the problem of F-formation detection. Moreover, most research on this problem rely on input sensor signals of object location and orientation rather than image signals. To address this, we develop a new, large-scale dataset of simulated images for F-formation detection, called F-formation Simulation Dataset (F2SD). F2SD contains nearly 60,000 images simulated from GTA-5, with bounding boxes and orientation information on images, making it useful for a wide variety of modelling approaches. It is also closer to practical scenarios, where three-dimensional location and orientation information are costly to record. It is challenging to construct such a large-scale simulated dataset while keeping it realistic. Furthermore, the available research utilizes conventional methods to detect groups. They do not detect groups directly from the image. In this work, we propose (1) a large-scale simulation dataset F2SD and a pipeline for F-formation simulation, (2) a first-ever end-to-end baseline model for the task, and experiments on our simulation dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The use of Compressed Sensing (CS) and Deep Learning (DL) techniques in Machine Vision (MV) is an area of significant research interest. It is especially promising in the area of image free MV, where one deliberately skips the image reconstruction step and performs the information extraction directly on a set of highly compressed raw data, as provided e.g. through CS schemes. These approaches tend to perform well on simplified data sets where there is a single object to detect with otherwise little background. To make them useful in practice, they need to be able to detect multiple objects from this raw data. In this work, we present an expansion to our own DL based detection scheme that satisfies this condition. Its defining feature is that it works without requiring extra data acquisition steps compared to the single object case. We will discuss trainability and robustness aspects as well as the mathematical background that enables the concept. Finally, we will show this implementation in action.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Neuromorphic vision or event vision is an advanced vision technology, where in contrast to visible camera sensors that output pixels, the event vision generates neuromorphic events every time there’s a brightness change which exceeds a specific threshold in the field of view (FoV). This study focuses on leveraging neuromorphic event data for roadside object detection. This is a proof of concept towards building artificial intelligence (AI) based imaging pipelines which can be used for forward perception systems for advanced vehicular applications. The focus is on building efficient stateof- the-art object detection networks with better inference results for fast-moving forward perception using an event camera. In this article, the event simulated A2D2 dataset is manually annotated and trained on two different YOLOv5 networks (small and large variants). To further assess its robustness, single model testing and ensemble model testing are carried out.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Nuclei Segmentation is a very essential and intermediate step for automatic cancer detection from H and E stained histopathology images. In the recent advent, the rise of Convolutional Neural Network (CNN), has enabled researchers to detect nuclei automatically from histopathology images with higher accuracy. However, the performance of automatic nuclei segmentation by CNN is fraught with overfitting, due to very less number of annotated segmented images available. Indeed, we find that the problem of nuclei segmentation is an unsupervised problem, because still now there is no automatic tool available which can make annotated images (nuclei segmented images) accurately, to the best of our knowledge. In this research article, we present a Logarithmic-Base2 of Gaussian (Log-Base2-G) Kernel which has the ability to track only the nuclei portions automatically from Colorectal Cancer H and E stained histopathology images. First, Log-Base2-G Kernel is applied to the input images. Thereafter, we apply an adaptive Canny Edge detector, in order to segment only the nuclei edges from H and E stained histopathology images. Experimental results revealed that our proposed method achieved higher accuracy and F1 score, without the help of any annotated data which is a significant improvement. We have used two different datasets (Con-SeP dataset, and Glass-contest dataset, both contains Colorectal Cancer histopathology images) to check the effectiveness and validity of our proposed method. These results have shown that our proposed method outperformed other image processing or unsupervised methods both qualitatively and quantitatively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This study focuses on pixel-wise semantic segmentation of crop production regions by using satellite remote sensing multispectral imagery. One of the principal aims of the study is to find out whether the raw multiple channel inputs are more effective in the training process of the semantic segmentation models or if the formularized counterparts as the spectral indices are more effective. For this purpose, the vegetation indices NDVI, ARVI and SAVI and the water indices NDWI, NDMI, and WRI are employed as inputs. Additionally, using 8, 10 and 16 channels, multiple channel inputs are utilized. Moreover, all spectral indices are taken as separate channels to form a multiple channel input. We conduct deep learning experiments using two semantic segmentation architectures, namely U-Net and DeepLabV3+. Our results show that, in general, feeding raw multiple channel inputs to semantic segmentation models performs much better than feeding the spectral indices. Hence, regarding crop production region segmentation, deep learning models are capable of encoding multispectral information. When the spectral indices are compared among themselves, ARVI, which reduces the atmospheric scattering effects, achieves better accuracy for both architectures. The results also reveal that spatial resolution of multispectral data has a significant effect on the semantic segmentation performance, and therefore the RGB band, which has the lowest ground sample distance (0.31 m) outperforms multispectral bands and shortwave infrared bands.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of 0.8796 and an average surface distance of 1.0379 pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents semantic food segmentation to detect individual food items in an image. The presented approach has been developed in the context of the FoodRec project, which aims to study and develop an automatic framework to track and monitor the dietary habits of people, during their smoke quitting protocol. The goal of food segmentation is to train a model that can look at the images of food items and infer semantic information to recognize individual food items present in an image. In this contribution, we propose a novel Convolutional Deconvolutional Pyramid Network for food segmentation to understand the semantic information of an image at a pixel level. This network employs convolution and deconvolution layers to build a feature pyramid and achieves high-level semantic feature map representation. As a consequence, the novel semantic segmentation network generates a dense and precise segmentation map of the input food image. Furthermore, the proposed method demonstrated significant improvements on a well-known public benchmark dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
LiDAR Point Cloud segmentation is a key input to downstream tasks such as object recognition and classification, obstacle avoidance, and even 3D reconstruction. Akey challenge in the segmentation of large city-scale datasets is uneven distribution of points to specific classes and significant class imbalances. As highly detailed point cloud datasets of urban environments become available, neural networks have shown significant performance in recognizing large well-defined objects. However, data is fed into these networks in chunks and the scheme by which data is presented for training and evaluation can have a significant impact on performance. In this work, we establish a method analogous to gradients in image processing to segment the ground in point clouds, achieving an accuracy of 91.4% on the Sensaturban dataset. By isolating the ground, we reduce the quantity of classes that need to be segmented from structures in urban LiDAR and improve data partitioning schemes when combined with random/grid down-sampling techniques for neural network inputs.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Semantic segmentation consists of classifying each pixel according to a set of classes. This process is particularly slow for high-resolution images, which are present in many applications, ranging from biomedicine to the automotive industry. In this work, we propose an algorithm targeted to segment high-resolution images based on two stages. During stage 1, a lower-resolution interpolation of the image is the input of a first neural network, whose low-resolution output is resized to the original resolution. Then, in stage 2, the probabilities resulting from stage 1 are divided into contiguous patches, with less confident ones being collected and refined by a second neural network. The main novelty of this algorithm is the aggregation of the low-resolution result from stage 1 with the high-resolution patches from stage 2. We propose the U-Net architecture segmentation, evaluated in six databases. Our method shows similar results to the baseline regarding the Dice coefficient, with fewer arithmetic operations.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Action recognition is one of the challenging video understanding tasks in computer vision. Although there has been extensive research in the task of classifying coarse-grained actions, existing methods are still limited in differentiating actions with low inter-class and high intra-class variation. Particularly, the table tennis sport that involves shots of high inter-class similarity, subtle variations, occlusion, and view-point variations. While a few datasets have been available for event spotting and shot recognition, these benchmarks are mostly recorded in a constrained environment with a clear view/perception of shots executed by players. In this paper, we introduce a Table tennis shots 1.0 dataset consisting of 9000 videos of 6 fine-grained actions collected in an unconstrained manner to analyze the performance of both players. To effectively recognise these different types of table tennis shots, we propose an adaptive spatial and temporal aggregation method that can handle the spatial and temporal interactions concerning the subtle variations among shots and low inter-class variations. Our method consists of three components, namely, (i) feature extraction module, (ii) spatial aggregation network, and (iii) temporal aggregation network. The feature extraction module is a 3D convolutional neural network (3D-CNN) that captures the spatial and temporal characteristics of table tennis shots. In order to capture the interaction among the elements of the extracted 3D-CNN feature maps efficiently, we employ spatial aggregation network to obtain the compact spatial representation. Later, we propose to replace the final global average pooling layer (GAP) with the temporal aggregation network to overcome the loss of motion information due to averaging of temporal features. This temporal aggregation network utilizes the attention mechanism of bidirectional encoder representations from Transformers (BERT) to model the significant temporal interactions among the shots effectively. We demonstrate that our proposed approach improves the performance of existing 3D-CNN methods by ~10% on the Table tennis shots 1.0 dataset.We also show the performance of our approach on other action recognition datasets, namely, UCF-101 and HMDB-51.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
An understanding of the image starts with sensing pertinent information, and subsequently recognize domain objects, based on a prior conceptualization. Thus, suitable modeling of the image content is essential to make use of the dependency between patterns in a particular domain. Through a computer interpretable model that results in a knowledge-based model, we can optimize the leverage of knowledge in image interpretation of a certain domain. In this paper, we focus on a systemic knowledge modeling intended for surface defect classification. Accordingly, we have exploited the image spatial information for building surface defect domain ontology. A set of statistical texture features has been extracted. A systemic approach of conceptualisation has been proposed, based on a decision tree classification, looking at filling the gap between low and medium level knowledge on the one hand and high level knowledge, which is defect detect categories on the other hand. Accordingly, the proposed ontology has been modeled with OWL and SWRL for reasoning and rule inference. The information, extracted from the grayscale image and its significance for deducing the surface flaws, is formalized to establish surface defect ontology. Validation of the proposed approach has been done on an industrial radio-graphs dataset NEU-DET. Compared to the state-of-the-art, our method yields on the same dataset a challenging performance of 85.87 % in term of mean average precision (mAP).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper introduces deep learning (DL) for leather species identification. It exploits the application of transfer learning on the existing Convolutional Neural Networks (ConvNets). The application of transfer learning fine-tunes the ConvNet parameters to learn the novel leather image data. This research investigates the performance of four ConvNets, namely, AlexNet, VGG16, GoogLeNet, and ResNet18, to predict the leather species. The comparative study affirms the efficacy of ResNet18 in learning the complex pore structural behavior of leather images. It efficiently classifies the leather images into four respective species with the highest accuracy (99.69%). It outperforms the existing ML-based prediction with a 7% improvement. Therefore, ConvNet is the best solution to deal with inter-species similarity and intra-species variability, the practical challenges of the leather images. It thus develops a fully-automated leather species identification technique that paves the way for biodiversity preservation and consumer protection
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Researchers have been inspired to implement transformer models in solving machine vision problems after their tremendous success with natural language tasks. Using the straightforward architecture and swift performance of transformers, a variety of computer vision problems can be solved with more ease and effectiveness. However, a comparative evaluation of their uncertainty in prediction has not been done yet. As we know, real world applications require a measure of uncertainty to produce accurate predictions, which allows researchers to handle uncertain inputs and special cases, in order to successfully prevent overfitting. Our study approaches the unexplored issue of uncertainty estimation among three popular and effective transformer models employed in computer vision, such as Vision Transformers (ViT), Swin Transformers (SWT), and Compact Convolutional Transformers (CCT). We conduct a comparative experiment to determine which particular architecture is the most reliable in image classification. We use dropouts at the inference phase in order to measure the uncertainty of these transformer models. This approach, commonly known as Monte Carlo Dropout (MCD), works well as a low-complexity estimation to compute uncertainty. The MCD-based CCT model is the least uncertain architecture in this classification task. Our proposed MCD-infused CCT model also yields the best results with 78.4% accuracy, while the SWT model with embedded MCD exhibits Pmaximum performance gain where the accuracy increased by almost 3% with the final result being 71.4%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image captioning generation is a combination of the visual domain and natural language processing. The transformer framework has become the mainstream approach. This paper combines reinforcement learning and transformer methods to reward dynamics backpropagation and normalization in the testing phase. Its characteristic is that when the steps of reinforcement learning increase, the agent model has more knowledge of the fully information, which reduces the computing cost of the system. The experimental results show that the reinforcement transformer structure has achieved a certain improvement in speed.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
ognition in simple scenes, the difference between classes of basketball actions is slight, and the backgrounds in the video are very similar. Therefore, it is not easy to recognize the basketball actions directly based on short-term temporal information or the scene information in the video. A Global Context-Aware Network (GCA-Net) for basketball action recognition is proposed to address this problem in this paper. It contains a Multi-Time Scale Aggregation (MTSA) module and a Spatial-Channel Interaction (SCI) module to process multiple types of information on feature layers. The MTSA module uses a temporal pyramid to get contextual links in the temporal dimension through one-dimensional convolution with different dilation rates. The SCI module enhances the feature representation to obtain more prosperous category attributes and spatial information by interacting with information across dimensions. We conducted experiments on the basketball action recognition dataset SpaceJam, and the results show that GCA-Net can effectively classify basketball actions. The average recognition accuracy of ten types of basketball actions in the dataset is 91.54%, which is an improvement compared with the current mainstream methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the growing number of solutions based on deep learning methods, there is a need to protect pretrained models against unauthorized distribution. For deep model watermarking, one of the most important criteria is to maintain the accuracy of predictions after embedding the protective information. In this paper, we propose a black-box watermarking method based on fine-tuning image classification models on a watermarking dataset, which is synthesized by superimposing pseudo-holograms on images of the original dataset. The proposed method allows to preserve the initial quality of classification, in addition, a series of experiments for five different models showed the invariance of the method to the architecture of a deep neural network. The conducted simulation of the most common attacks on watermarked models shows that adversarial attempts to completely remove the watermark are improbable without significant loss of model accuracy. Additionally, experimental results contain the selection of parameters, such as the number of triggers and original images in watermarking dataset, allowing to increase method efficiency.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Nowadays, the COVID-19 pandemic imposes the use of a contactless biometric system to prevent the spread of contagious diseases efficiently. Hand veins are contactless and independent of the body’s appearance. However, researches on age and gender estimation by hand veins are very limited. They focused only on age group discrimination and not the exact age. Estimating the age and gender by veins features is a challenging task since hand vein images are poor in quality and subject to variation in illumination. In this paper, a finger vein gender and age recognition system based on Pyramidal histograms of oriented gradient (PHOG) is presented. PHOG can better describe both the local shape and the spatial distribution of the veins as the image is divided into sub-regions at different resolutions for which the HOG descriptor is applied. Experimental validation on finger vein databases MMCBNU 6000 and UTFVP demonstrates the effectiveness of extracted features in gender classification and age estimation including ages from 16 to 72 years with an uncertainty of one year. The middle finger of the left hand provides the best results for both age and gender classification (F-measure 100%) for MMCBNU 6000 database, whereas for UTFVP database, F-measure is about 98,62% for age estimation and 99,47% for gender classification. A comparison study with recent approaches is carried on, showing an improvement of F-measure by 5.76% for age estimation and 1.38% for gender classification.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Virtual unrolling or unfolding, digital unwrapping, flattening or unfurling - all these terms are used to describe the process of surface straightening of a tomographically reconstructed digital object. For many objects of historical heritage, tomography is the only way to obtain a hidden image of the original object without its destruction. Digital flattening is no longer considered a unique met hodology. It being applied by many research group, but AI-based methods are used insignificantly in such projects, despite the amazing success of AI in computer vision, in particular optical text recognition. It can be explained by the fact that the success of AI depends on large, broad and high quality datasets, but there are very few published CT-based datasets relevant to the task of digital flattening. Accumulation of a sufficient amount of data necessary for training models is a key point for the next technological breakthrough. In this paper, we present open and cumulative dataset CT-OCR-2022. Dataset includes 6 packages data for different model objects that help to enrich tomographic solutions and to train machine learning models. Each package contains optically scanned image of model objects, 400 measured X-ray projections, 2687 CT- reconstructed cross-sections of 3D reconstructed image, segmentation markups. We believe that CT-OCR-2022 dataset will serve as a benchmark for reconstructed object digital flattening and recognition systems, and that it will prove invaluable for advancement of the field of CT-reconstruction, symbols analysis and recognition. The data presented are openly available in Zenodo at doi:10.5281/zenodo.7123495 and linked repositories.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Due to the characteristics of small die size, large number of dies and large initial swing angle of glassivation passivation parts (GPP) wafer, it becomes a more troublesome problem to accurately correct the wafer angle. To solve this difficulty, a coarse-to-fine angle automatic correction method for GPP wafer is proposed in this paper. First, we design and implement the GPP wafer angle automatic correction system. Then, for the large initial wafer swing angle problem, the coarse correction method based on Hough transform line detection and K-Means clustering is presented for coarse correction of wafer angle. Finally, for accurate correction of the coarse-corrected wafer angle, the edge coordinates of multiple dies in the same row in the image are extracted and the fine correction method of wafer angle based on least squares line fitting is achieved. The experimental results show that, compared with other methods, the proposed method can control the mean absolute error of angle correction within 0.02 degree, so as to fulfill stable and reliable wafer angle automatic correction requirement in practice.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Contrary to the World Health Organization’s (WHO) and the medical community’s projections, Covid-19, which started in Wuhan, China, in December 2019, still doesn’t show any signs of progressing to the endemic stage or slowing down any time soon. It continues to wreak havoc on the lives and livelihood of thousands of people every day. There is general agreement that the best way to contain this dangerous virus is through testing and isolation. Therefore, in these epidemic times, developing an automated Covid-19 detection method is of utmost importance. This study uses three different Machine Learning classifiers, such as Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR), along with five Transfer Learning models such as DenseNet121, DenseNet169, ResNet50, ResNet152V2, and Xception as feature extraction methods for identifying Covid-19. Five different datasets are used to assess the models’ performance to generalize. There are encouraging findings, with the best one being the combination of DenseNet121 and DenseNet169 together with SVM and LR.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose two separate and lightweight convolutional neural networks, SobelNet and DesNet, which work in parallel, as keypoint detector and descriptor respectively. Sobel filter provides the edge structure map of the grayscale image as the input of SobelNet. The locations of keypoints will be obtained after exerting the non-maximum suppression process on the output score map of SobelNet. Gaussian loss is designed to train SobelNet to detect corner points in the edge structure map as keypoints. In the meantime, a dense descriptor map is produced by DesNet which is trained with Circle loss. Besides, the output score map of SobelNet is utilized while training DesNet. The proposed method is evaluated on two widely used datasets, FM benchmark and ETH benchmark. Compared with other state-ofthe- art methods, SobelNet and DesNet can reduce more than half of the computation and achieve comparable or even better performance. The inference times of an image with the size of 640×480 are 7.59 ms and 1.09 ms for SobelNet and DesNet respectively on RTX 2070 SUPER, which meet the real-time requirement.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we present a robust system for identifying and predicting physical conditions like sleeplessness, tiredness, and potential unwellness in construction worker by observing the visible properties and movement pattern of their eyes. Site managers are responsible for taking care of the physical condition of a worker in construction site, which cause a lot of trouble and is time consuming. Our system aims to automate this entire process. We rely on the fact that the above-mentioned physical conditions directly affect the visible properties of human eyes like color, and eye movement. Our approach collects individual eye data over time to determine a normal point for the said individual and tries to find the deviation from the normal point to identify any abnormalities which might be a direct result of sleeplessness or tiredness. For this purpose, we propose an original algorithm. Then we have successfully built and deployed a system that identifies workers physical condition reading the properties of their eyes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Skin Lesion is a controversial disease all over the world, particularly Melanoma which is a kind of skin cancer. In recent years, there are several methods of using the Convolutional Neural Network and Vision Transformer model have been proposed for the detection and classification of skin images and have achieved competitive results. In this paper, we introduce and demonstrate the efficiency of the Tiny Convolution Contextual Neural Network (TCC Neural Network) which is a tiny model with light-weight architecture than the previous architecture, and have fewer parameter than the popular model for classification of nine lesions from skin images. Our proposal achieves 0.75 on accuracy and 0.55 on F1 score with 5.6 million parameters in the Skin Lesion classification task.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Images captured by mobile camera systems are subject to distortions that can be irreversible. Sources of these distortions vary and can be attributed to sensor imperfections, lens defects, or shutter inefficiency. One form of image distortion is associated with high Parasitic-Light-Sensitivity (PLS) in CMOS Image Sensors when combined with Global Shutters (GS-CIS) in a moving camera system. The resulting distortion appears as widespread semi-transparent purple artifacts, or a complex purple fringe, covering a large area in the scene around high-intensity regions. Most of the earlier approaches addressing the purple fringing problems have been directed towards the simplest forms of this distortion and rely on heuristic image processing algorithms. Recently, machine learning methods have shown remarkable success in many image restoration and object detection problems. Nevertheless, they have not been applied for the complex purple fringing detection or correction. In this paper, we present our exploration and deployment of deep learning algorithms in a pipeline for the detection and correction of the purple fringing induced by high-PLS GS-CIS sensors. Experiments show that the proposed methods outperform state-of-the-art approaches for both problems of detection and color restoration. We achieve a final MS-SSIM of 0.966 on synthetic data, and a distortion classification accuracy of 96.97%. We further discuss the limitations and possible improvements over the proposed methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Video Anomaly Detection refers to the concept of discovering activities in a video feed that deviate from the usual visible pattern. It is a very well-studied and explored field in the domain of Computer Vision and Deep Learning, in which automated learning-based systems are capable of detecting certain kinds of anomalies at an accuracy greater than 90%. Deep Learning based Artificial Neural Network models, however, suffer from very low interpretability. In order to address and design a possible solution for this issue, this work proposes to shape the given problem by means of graphical models. Given the high flexibility of compositing easily interpretable graphs, a great variety of techniques exist to build a model representing spatial as well as temporal relationships occurring in the given video sequence. The experiments conducted on common anomaly detection benchmark datasets show that significant performance gains can be achieved through simple re-modelling of individual graph components. In contrast to other video anomaly detection approaches, the one presented in this work focuses primarily on the exploration of the possibility to shift the way we currently look at and process videos when trying to detect anomalous events.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Deep Learning (DL) algorithms allow fast results with high accuracy in medical imaging analysis solutions. However, to achieve a desirable performance, they require large amounts of high quality data. Active Learning (AL) is a subfield of DL that aims for more efficient models requiring ideally fewer data, by selecting the most relevant information for training. CheXpert is a Chest X-Ray (CXR) dataset, containing labels for different pathologic findings, alongside a “Support Devices” (SD) label. The latter contains several misannotations, which may impact the performance of a pathology detection model. The aim of this work is the detection of SDs in CheXpert CXR images and the comparison of the resulting predictions with the original CheXpert SD annotations, using AL approaches. A subset of 10,220 images was selected, manually annotated for SDs and used in the experimentations. In the first experiment, an initial model was trained on the seed dataset (6,200 images from this subset). The second and third approaches consisted in AL random sampling and least confidence techniques. In both of these, the seed dataset was used initially, and more images were iteratively employed. Finally, in the fourth experiment, a model was trained on the full annotated set. The AL least confidence experiment outperformed the remaining approaches, presenting an AUC of 71.10% and showing that training a model with representative information is favorable over training with all labeled data. This model was used to obtain predictions, which can be useful to limit the use of SD mislabelled images in future models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Driver stress is a major cause of car accidents and death worldwide. Furthermore, persistent stress is a health problem, contributing to hypertension and other diseases of the cardiovascular system. Stress has a measurable impact on heart and breathing rates and stress levels can be inferred from such measurements. Galvanic skin response is a common test to measure the perspiration caused by both physiological and psychological stress, as well as extreme emotions. In this paper, galvanic skin response is used to estimate the ground truth stress levels. A feature selection technique based on the minimal redundancy-maximal relevance method is then applied to multiple heart rate variability and breathing rate metrics to identify a novel and optimal combination for use in detecting stress. The support vector machine algorithm with a radial basis function kernel was used along with these features to reliably predict stress. The proposed method has achieved a high level of accuracy on the target dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Driver monitoring systems (DMS) are a key component of vehicular safety and essential for the transition from semiautonomous to fully autonomous driving. A key task for DMS is to ascertain the cognitive state of a driver and to determine their level of tiredness. Neuromorphic vision systems, based on event camera technology, provide advanced sensing of facial characteristics, in particular the behavior of a driver’s eyes. This research explores the potential to extend neuromorphic sensing techniques to analyze the entire facial region, detecting yawning behaviors that give a complimentary indicator of tiredness. A neuromorphic dataset is constructed from 952 video clips (481 yawns, 471 not-yawns) captured with an RGB colour camera, with 37 subjects. A total of 95,200 neuromorphic image frames are generated from this video data using a video-to-event converter. From these data 21 subjects were selected to provide a training dataset, 8 subjects were used for validation data, and the remaining 8 subjects were reserved for an ‘unseen’ test dataset. An additional 12,300 frames were generated from event simulations of a public dataset to test against other methods. A convolutional neural network (CNN) with self-attention and a recurrent head was trained and tested with these data. Respective precision and recall scores of 95.9% and 94.7% were achieved on our test set, and 89.9% and 91% on the simulated public test set, demonstrating the feasibility to add yawn detection as a sensing component of a neuromorphic DMS.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Vehicle in-cabin occupant monitoring system is becoming a crucial feature of the automobile industry and challenging research topic to enhance both vehicle safety, security, and comfort of conventional and future intelligent vehicles. Precise information about the number, position, and characteristics of occupants as well as objects located inside the vehicle must be available. Current industrial systems for seat occupancy detection are based on multiple weight sensors, capacitive sensors, electric field, or ultrasonic sensors. They cannot necessarily make the right distinction in borderline cases. A simple pressure sensor cannot tell whether the weight on the seat comes from a person or an inanimate object. Recently, the Artificial Intelligence (AI) based advanced systems have attracted attention for various fields such as automobile industry. Especially, with the advancement of deep learning that has shown very high classification accuracies compared to hand-crafted features on various computer vision tasks. For the above reasons, we propose a new automatic AI occupant monitoring system based on two cameras installed inside the vehicle. Our goal is to develop an automatic detection and recognition system with high accuracy performance, low computational cost and small weight model. Our system fuses our modified deep convolutional network Yolo model and deep reinforcement learning to detect and classify passengers and objects inside the vehicle. It can predict the gender, the age and the emotion of occupants based on our proposed muti-task convolutional neural networks. In our end-to-end system, this approach is more efficient time and memory wise by solving all the tasks in the same process and storing a single CNN instead of storing a CNN for each task. Principal applications of our system are intelligent airbag management, seat belt reminder, life presence and in shared cabin preferences. We perform comparative evaluation based on the public datasets SVIRO, TiCaM, Aff-Wild and Adience dataset to demonstrate the superior performance of our proposed system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper proposes a growing-based floor-plan generation method that creates the global layout of buildings from noisy point clouds obtained by a stereo camera. We introduce a PCA-based line-growing concept with a subsequent filtering step, which is able to robustly handle the high noise levels in input point clouds. Experimental results show that this method outperforms the state-of-the-art techniques in floor-plan generation. The average F1 score for building layouts has increased from 0.38 to 0.66 on our test dataset, compared to the previous best floor-plan generation method. Furthermore, the resulting floor plans are multiple thousands of times smaller in memory size than the input point clouds, while still preserving the main building structures.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Nowadays pathologists have to analyse blood cells manually with the aim to diagnose diseases. In order to perform this manual task blood samples must be collected from the patient and then placed on a microscope slide. This slide is studied with the aim to detect the abnormality presence. To automatise this process by helping pathologists, in this paper we propose the adoption of deep learning to automatically count and localise red blood cells, white blood cells and platelets by analysing blood microscopic images. We resort to the YOLO object detection model, able to look at the whole image so its predictions are informed by global context in the image. To show the effectiveness of the proposed method we evaluate our model on a dataset composed by 874 microscopic blood images, obtaining interesting results. Furthermore we show several examples related to how the proposed method can be helpful for the pathologists in their real-world work.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Motion and dynamic environments, especially under challenging lighting conditions, are still an open issue in the field of computer vision. In this paper, we propose an online, end-to-end pipeline for real-time, low latency, 6 degrees-of-freedom pose estimation and tracking of fiducial markers. We employ the high-speed abilities of event-based sensors to directly refine spatial transformations. Furthermore, we introduce a novel two-way verification process for detecting tracking errors by backtracking the estimated pose, allowing to evaluate the quality of our tracking. This approach allows us to achieve pose estimation with an average latency lower than 3 ms and with an average error lower than 5 mm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Deep Metric Learning trains a neural network to map input images to a lower-dimensional embedding space such that similar images are closer together than dissimilar images. When used for item retrieval, a query image is embedded using the trained model and the closest items from a database storing their respective embeddings are returned as the most similar items for the query. Especially in product retrieval, where a user searches for a certain product by taking a photo of it, the image background is usually not important and thus should not influence the embedding process. Ideally, the retrieval process always returns fitting items for the photographed object, regardless of the environment the photo was taken in. In this paper, we analyze the influence of the image background on Deep Metric Learning models by utilizing five common loss functions and three common datasets. We find that Deep Metric Learning networks are prone to so-called background bias, which can lead to a severe decrease in retrieval performance when changing the image background during inference. We also show that replacing the background of images during training with random background images alleviates this issue. Since we use an automatic background removal method to do this background replacement, no additional manual labeling work and model changes are required while inference time stays the same. Qualitative and quantitative analyses, for which we introduce a new evaluation metric, confirm that models trained with replaced backgrounds attend more to the main object in the image, benefiting item retrieval systems.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image distortion is a problem due to wide field-of-view cameras, and camera calibration is a fundamental step in various applications such as image undistortion, 3D reconstruction, and camera motion estimation to overcome this problem. In image calibration, intrinsic camera parameters such as focal length and distortion are estimated. The quality of the undistorted/enhanced image depends on the correctness of focal length and distortion. However, existing methods consist of two approaches: checkerboard, which requires manual interaction, and others are deep learning approaches. Most Deep Learning approaches are based on the Convolution Neural Network (CNN) framework, and it fails to capture the long-term dependency in a distorted image. This paper proposes a fully automated EnsembleNet method to infer the focal length and distortion parameters to overcome this problem. The proposed model extracts various contexts (local patches) by exploiting ViT(Vision Transformer) and spatial features from various CNN-based models using a single input image. The proposed model uses the differential evolution (DE) approach to learn the ensemble weights. The experiments show that the proposed EnsembleNet outperforms the state-of-the-art deep learning-based models in terms of mean squared error.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The rise of sophisticated in-car multimedia solutions has led to both positive and negative impacts on the road-user’s driving experience. A drastic increase in the number of road accidents due to drivers’ inattention is a clear negative consequence. Thus, there has been an increased interest lately in measuring real-time driver cognitive load to alert them to focus on driving. Quantifying the ability to solve a task, such as driving safely, is difficult to accomplish in terms of diversity of subjects, their emotional state or fatigue at a given time. In this paper, a pipeline is presented that obtains ground truth labels for cognitive load from video and biosignal data. The experimental design for inducing the cognitive load state and the data processing are presented as part of the pipeline. This methodology was validated using the biosignal data collected from 31 subjects and conducting a comparative analysis between cognitive and non-cognitive states.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Multiple sclerosis (MS) is a chronic autoimmune inflammatory disease that damages the central nervous system by causing small lesions in the brain. In this study, we present the fusion of four features extraction methods such as the 3D Local Binary Pattern (3D-LBP), 3D Decimal Descriptor Patterns (3D-DDP), Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) and Decimal Descriptor Patterns from Three Orthogonal Planes (DDP-TOP) with Convolutional Neural Network (CNN) for MS classification using three 3D MRI sequences datasets T1, T2 and PD from 3D BrainWeb dataset. We implement twelve CNN models and apply each method with each of the CNN models on T1, T2 then PD MRI sequences. The experimental results demonstrate that 3D-DDP and DDP-TOP methods are the most robust and, for the contrast change effect of MRI sequences on the classification results, T2 yields the best performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Runway and taxiway pavements are exposed to high stress during their projected lifetime, which inevitably leads to a decrease in their condition over time. To make sure airport pavement condition ensure uninterrupted and resilient operations, it is of utmost importance to monitor their condition and conduct regular inspections. UAV-based inspection is recently gaining importance due to its wide range monitoring capabilities and reduced cost.In this work, we propose a vision-based approach to automatically identify pavement distress using images captured by UAVs. The proposed method is based on Deep Learning (DL) to segment defects in the image. The DL architecture leverages the low computational capacities of embedded systems in UAVs by using an optimised implementation of EfficientNet feature extraction and Feature Pyramid Network segmentation. To deal with the lack of annotated data for training we have developed a synthetic dataset generation methodology to extend available distress datasets. We demonstrate that the use of a mixed dataset composed of synthetic and real training images yields better results when testing the training models in real application scenarios.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Style transfer aims to render a new artistic image based on a content image and given artwork style. Recent style transfer techniques often suffer structure distortion and artifact problems that abate the quality of stylized images. Motivated by these observations and the previous works, we introduce a novel GAN framework to enhance the aesthetics, faithfulness and flexibility in the style transfer process. The key factor of our model is the Laplacian Pyramid loss that naturally forces the content preservation and the ResidualStyle discriminator block to capture the artwork’s painting style better. In contrast to existing methods that calculate the Euclidean distance between the features of generated image and content image, our Laplacian Pyramid loss better captures the content representation by different frequency bands of the content image. As evaluated by experimental results, our framework surmounts the unrealistic artifacts to synthesize the photorealistic artworks in real-time, hence attaining striking visual effects.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Optical coherence tomography (OCT) is a non-invasive technique that allows the retina to be studied with precision, the analysis of the features of its layers and other structures such as the macula or the optic nerve. This is why it is used in the diagnosis and monitoring of eye diseases such as glaucoma and optic neuritis. A crucial step in this process is the segmentation of the different layers, which is a great challenge due to its complexity. In this work, a methodology based on deep learning and transfer learning will be developed to automatically segment nine retinal layers in OCT images centred on the optic disc. In addition, the thickness of each retinal layer will be measured along each B-scan. For this purpose, OCT images from a public dataset and a dataset collected from depth-enhanced images will be used. The proposed method achieves a Dice score of 83.6%, similar to that obtained in the state of the art, segmenting the nine retinal layers and the optic disc in both sets of images. In addition, the different layers are represented in three different graphical formats.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Real-time semantic segmentation is an important field in computer vision. It is widely employed in real-world scenarios such as mobile devices and autonomous driving, requiring networks to achieve a trade-off between efficiency, performance, and model size. This paper proposes a lightweight network with multi-scale information interaction attention (MSIANet) to solve this issue. Specifically, we designed a multi-scale information interaction module (MSI) is the main component of the encoder and is used to densely encode contextual semantic features. Moreover, we designed the multi-channel attention fusion module (MAF) in the decoder part, thereby realizing multi-scale information fusion through channel attention mechanism and spatial attention mechanism. We verify our method through numerous experiments and prove that our network possesses fewer parameters and faster inference speed compared to most of the existing real-time semantic segmentation methods in multiple datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
An electroencephalogram (EEG) signal is a dominant indicator of brain activity that contains conspicuous information about the underlying mental state. The EEG signals classification is desirable in order to comprehend the objective behavior of the brain in various diseased or control activities. Even though many studies have been done to find the best analytical EEG system, they all focus on domain-specific solutions and can't be extended to more than one domain. This study introduces a multidomain adaptive broad learning EEG system (MABLES) for classifying four different EEG groups under a single sequential framework. In particular, this work expands the applicability of three previously proposed modules, namely, empirical Fourier decomposition (EFD), improved empirical Fourier decomposition (IEFD), and multidomain features selection (MDFS) approaches for the realization of MABLES. The feed-forward neural network classifier is used in extensive trials on four different datasets utilizing a 10-fold cross-validation technique. Results compared to previous research show that the mental imagery, epilepsy, slow cortical potentials, and schizophrenia EEG datasets have the highest average classification accuracy, with scores of 94.87%, 98.90%, 92.65% and 95.28%, respectively. The entire qualitative and quantitative study verifies that the suggested MABLES framework exceeds the existing domain-specific methods regarding classification accuracies and multi-role adaptability, therefore can be recommended as an automated real-time brain rehabilitation system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Electroencephalogram (EEG) is a set of time series each of which can be represented as a 2D image (spectrogram), so that EEG recording can be mapped to the C-dimensional image (where C denotes the number of channels in the image and equals to the number of electrodes in EEG montage). In this paper, a novel approach for automated feature extraction from spectrogram representation is proposed. The method involves the usage of autoencoder models based on 3-dimensional convolution layers and 2-dimensional deformable convolution layers. Features, extracted by autoencoders, can be used to classify patients with Major Depressive Disorder (MDD) from healthy controls based on resting-state EEG. The proposed approach outperforms baseline ML models trained on spectral features extracted manually.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We propose a new model for learning to rank two images with respect to their relative strength of expression for a given attribute. We address this problem – called relative attribute learning — using a vision transformer backbone. The embedded representations of the two images to be compared are extracted and used for comparison with a ranking head, in an end-to-end fashion. The results demonstrate the strength of vision transformers and their suitability for relative attributes classification. Our proposed approach outperforms the state-of-the-art by a large margin, achieving 90.40% and 98.14% mean accuracy over the attributes of LFW-10 and Pubfig datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Over the last decade, the automotive industry has introduced advanced driving assistance systems (ADAS) and automated driving (AD) features into roads to reduce fatality rates. One of these ADAS is the surround-view system, which provides an orthographic view of the vehicle by using at least four fish-eye lens cameras embedded in it. Small bumps or temperature changes may modify these cameras' relative poses leading to some geometrical mismatches between views in the top-view projection plane. In addition, terrain irregularities may misalign the orthographic view with the ground plane surface. Both problems can be solved by reestimating the relative poses of the cameras with respect to a single common point in the vehicle. This procedure, also known as recalibration, is offline performed in technical garages, or by online calibration mechanisms on engine start. However, it is a slow and cumbersome process. Research to date studies how to optimally recalibrate these cameras in an online manner, neglecting the practical aspects of when this procedure should be undertaken. Therefore, depending on the functionalities for which the embedded cameras are required, a compromise between using out-of-calibration cameras and the consequences derived from the recalibration process must be considered. This would prevent reestimating the cameras’ relative poses in situations where misalignment between adjacent cameras may not be noticeable. For this reason, a novel approach that measures the degree of calibration between cameras embedded in a vehicle is proposed. This method extracts relevant features from the predefined regions of interest of each camera by using the histogram of oriented gradients (HOG) descriptor. Then, features that belong to adjacent cameras are compared by employing the cosine similarity metric. The proposed method is evaluated on the open-source AD research simulator CARLA providing detailed analysis to objectively highlight the usefulness of this method in studying the degree of calibration of a camera array in a surround-view system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In X-ray computed tomography (CT) the real rotation axis position often does not coincide with the assumed one: technical imperfections of the tomographic setup, the fast speed of movement of the gantry and goniometer cause rotation axis displacements and inclinations. At the same time the use of incorrect axis location parameters during reconstruction leads to the appearance of so-called tuning-fork artifacts in the form of stripes and blurs at the object boundary. The existing rotation axis alignment methods for cone-beam CT require a large amount of computing resources, are laborious in implementation, are not able to accurately determine several axis location parameters at once, or are based on the processing of additional equipped with reference markers object post-scans and shots that are not always available. Thus the rotation axis alignment methods development in the cone-beam CT still seems to be relevant. In this paper, the developed model for parameterizing the rotation axis position is described and justified. The novel several-stage automatic method for rotation axis parameters determination is described. The proposed method is based on usage of mean projection image and tested both on synthetic and real data in parallel-beam and cone-beam geometric schemes. The absolute error of that method on the simulated data is no more than 1 pixel and 1 degree, respectively for shift and slope.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The area of Computer Vision has gone through exponential growth and advancement over the past decade. It is mainly due to the introduction of effective deep-learning methodologies and the availability of massive data. This has resulted in the incorporation of intelligent computer vision schemes to automate the different number of tasks. In this paper, we have worked on similar lines. We have proposed an integrated system for the development of robotic arms, considering the current situation in fruit identification, classification, counting, and generating their masks through semantic segmentation. The current method of manually doing these processes is time-consuming and is not feasible for large fields. Due to this, multiple works have been proposed to automate harvesting tasks to minimize the overall overhead. However, there is a lack of an integrated system that can automate all these processes together. As a result, we are proposing one such approach based on different machine learning techniques. For each process, we propose to use the most effective learning technique with computer vision capability. Thus, proposing an integrated intelligent end-to-end computer vision-based system to detect, classify, count, and identify the apples. In this system, we modified the YOLOv3 algorithm to detect and count the apples effectively. The proposed scheme works even under variable lighting conditions. The system was trained and tested using a standard benchmark i.e., MinneApple. Experimental results show an average accuracy of 91%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Event cameras also known as neuromorphic sensors are relatively a new technology with some privilege over the RGB cameras. The most important one is their difference in capturing the light changes in the environment, each pixel changes independently from the others when it captures a change in the environment light. To increase the user’s degree of freedom in controlling the output of these cameras, such as changing the sensitivity of the sensor to light changes, controlling the number of generated events and other similar operations, the camera manufacturers usually introduce some tools to make sensor level changes in camera settings. The contribution of this research is to examine and document the effects of changing the sensor settings on the sharpness as an indicator of quality of the generated stream of event data. To have a qualitative understanding this stream of event is converted to frames, then the average image gradient magnitude as an index of the number of edges and accordingly sharpness is calculated for these frames. Five different bias settings are explained and the effect of their change in the event output is surveyed and analyzed. In addition, the operation of the event camera sensing array is explained with an analogue circuit model and the functions of the bias foundations are linked with this model.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
3D shape reconstruction from images is an active topic in computer vision. Shape-from-Focus (SFF) is an important approach which requires image stack in a focus controlled manner to infer the 3D shape. In this article, 3D reconstruction of synthetic gastrointestinal regions is done using SFF. Image stack is generated in Blender software with focus controlled camera. A color focus measure is applied for shape recovery followed by a weighted L2 regularizer to estimate for inaccurate depth values. A precise comparison is done between recovered shape and ground truth data by measuring the depth error and correlation between them. Results shows that SFF technique will be practical for 3D reconstruction of GI regions with focus and motion controlled pillcams which is technologically feasible to implement.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
It is a well-known fact that the quality of a seed highly impacts the germination of a rice seed. The age of the seed is one of the primary key points in assessing the seed quality. Therefore, this study aims to develop an AI-based machine-learning model to classify age-wise rice seeds. This study employs the SURF-BOF-based Cascaded-ANFIS algorithm for the implementation of the classifier. The proposed model performances were compared to the VGG16. Moreover, this research contributes a novel Japanese rice seed dataset to the scientific community. Furthermore, a 10-Fold cross-validation is performed to evaluate the robustness of the novel approach. The K-fold cross-validation’s mean accuracy confirmed the proposed algorithm’s higher robustness in the age-wise classification of rice seeds. Nevertheless, the results were evaluated using the confusion matrix and metrics such as precision, recall, and F1-Score. The Accuracy of Akitakomachi, Koshihikari, Yandao-8, and rice variety classification is 99%, 99%, 92%, and 97%. Analysis of the results determines the ability to classify rice by age and the general robustness of the algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, a brand-new approach to image encryption is put forth that is based on the new Beta chaotic map, the Beta wavelet, and the Latin square. The proposed strategy is made up of various steps. The Wavelet Beta map is used to produce the random key after generating the Latin square S-box. The encryption stage uses the obtained key. The ciphered images have undergone numerous tests after the encryption procedures, including histogram analysis, information entropy analysis, and differential analysis. The results, which demonstrate that the proposed method has high efficiency and satisfactory security, are promising when compared to earlier systems and demonstrate that it is appropriate for image data transmission.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We address the problem of uncertainty quantification in the domain of face attribute classification, using Evidential Deep Learning (EDL) framework. The proposed EDL approach leverages the strength of Convolution Neural Networks (CNN), with the objective of representing the uncertainty in the output predictions. Predominantly, the softmax/sigmoid activation functions are applied to map the output logits of the CNN to target class probabilities in multi-class classification problems. By replacing the standard softmax/sigmoid output of a CNN with the parameters of the evidential distribution, EDL learns to represent the uncertainty in its predictions. The proposed approach is evaluated on CelebA and LFWA datasets. The quantitative and qualitative analysis demonstrate the suitability and strength of EDL to estimate the uncertainty in the output predictions without hindering the accuracy of CNN-based models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The results of automatic learning algorithms based on deep neural networks are impressive, and they are extensively used in a variety of fields. However, access to private information, frequently sensitive to confidentiality (financial, medical, etc.), is required in order to use them. This calls for good precision as well as special attention to the privacy and security of the data. In this paper, we propose a novel approach to solve this issue by using Convolutional Neural Network (CNN) model over encrypted data. In order to achieve our contribution, we focus on approximating the often used activation functions that seem to be the key functions in CNN networks which are: ReLu, Sigmoid and Tanh. We start by creating a low-degree polynomial, which is essential for a successful homomorphic encryption (HE). This polynomial which is based on Beta function and its primitive will be used as an activation function. The next step is to build a CNN model using batch normalization to ensure that the data are contained inside a limited interval. Finally, MNIST is used in order to evaluate our methodology and assess the effectiveness of the proposed approach. The experimental results support the efficacy of the proposed approach.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Deep learning models for computer vision in remote sensing such as Convolutional Neural Network (CNN) has benefited acceleration from the usage of multiple CPUs and GPUs. There are several ways to make the training stage more effective in terms of utilizing multiple cores at the same time by processing different image mini-batches with a duplicated model called Distributed Data Parallelization (DDP) and computing the parameters in a lower precision floating-point number called Automatic Mixed Precision (AMP). We would like to investigate the impact of DDP and AMP training modes on the overall utilization and memory consumption of CPU and GPU, as well as the accuracy of a CNN model. The study is performed on the EuroSAT dataset, a Sentinel-2-based benchmark satellite image dataset for image classification of land covers. We compare training using 1 CPU, using DDP, and using both DDP and AMP over 100 epochs using ResNet-18 architecture. The hardware that we used are Intel Xeon Silver 4116 CPU with 24 cores and an NVIDIA v100 GPU. We find that although parallelization of CPUs or DDP takes less time to train on the images, it can take 50 MB more memory than using only a single CPU. The combination of DDP and AMP can release memory up to 160 MB and reduce computation time by 20 seconds. The test accuracy is slightly higher for both DDP and DDP-AMP at 90.61% and 90.77% respectively than without DDP and AMP at 89.84%. Hence, training using Distributed Data Parallelization (DDP) and Automatic Mixed Precision (AMP) has more benefits in terms of lower GPU memory consumption, faster training execution time, faster convergence towards solutions, and finally, higher accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Bioassay data classification is an important task in drug discovery. However, the data used in classification is highly imbalanced, leading to inaccuracies in classification for the minority class. We propose a novel approach for classification in which we train separate models by using different features that are derived by training stacked autoencoders (SAE). Experiments are performed on 7 bioassay datasets, in which each data file consists of feature descriptors for every compound along with class label of compound being active, or inactive. We first perform data cleaning using borderline synthetic minority oversampling technique (SMOTE) followed by removing the Tomek links, and then learn different features hierarchically, based on the cleaned data or feature vectors. We then train separate cost-sensitive feed-forward neural network (FNN) classifiers using the hierarchical features in order to obtain the final classification. To increase the True Positive Rate (TPR), a test sample is labeled as active if at least one classifier predicts it as active. In this paper, we demonstrate that by data cleaning and learning separate classifiers one can improve the TPR and F1 score when compared to other machine learning approaches. To the best of our knowledge, the researchers have not yet attempted the use of SAE and FNN for classifying bioassay data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Current encoder-decoder methods for remote sensing image captioning (RSIC) avoids fine-grained structural representation of objects due to the lack of prominent encoding frameworks. This paper proposes a novel structural representative network (SRN) for acquiring fine-grained structures of remote sensing images (RSI) for generating semantically meaningful captions. Initially, we employ SRN on top of the final layers of the convolutional neural network (CNN) for attaining the spatially transformed RSI features. A multi-stage decoder is incorporated into the extracted features of SRN to produce fine-grained meaningful captions. The efficacy of our proposed methodology is exhibited on two RSIC datasets, i.e Sydney-Captions dataset, and the UCM-Captions dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image matching and wireless signal fingerprint are two methods to realize indoor localization. However, the method based on image-matching needs to build a large scale image database, the process of matching tends to be high computing complexity, which cannot satisfy the requirements of real time. Meanwhile, WiFi signal fingerprint is affected easily by changing surroundings, so the positioning accuracy and stability are not so perfect. The process of database building is time-consuming and laborious. To address these issues, we propose a crowd-sourced optical indoor positioning algorithm updated by WiFi fingerprint. At first, we use WiFi fingerprint based on K Weighted Nearest Neighbor (KWNN) algorithm to make a coarse estimation, which will reduce the scope of image retrieval during image matching stage, then fuse image and posture data based on mean-weighted exponent algorithm to refine the previous coarse estimation. We also update the positioning database in a crowd-sourced way. Experimental results show that the mean error of the proposed algorithm can reach 1.71m, under the condition of real-time calculating, which decreases by 50% compared with the standard KWNN algorithm. Meanwhile, the positioning stability has improved greatly.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In the search of low power communication technology for future battery-free device and IoT, ambient backscatter communication is a promising solution. To this end, this work aims to analyze theoretically the performance of ambient backscatter system using LoRa transmission as an excitation signal. The tag data are encoded using FSK modulation, and at the receiver side, the quadrature demolator is used to generate the baseband signal and a correlator demodulator combined with a square law detection is used to decode the tag data. A theoretical BER calculation is provided using the numerical computation of the approximated union band equation and show a good performance of LoRa backscatter over the bi-static backscatter configuration for SNR values comprise between 2db and 9dB.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recent development in deep learning has shown great possibilities in many computer vision tasks. Image compression is a crown field of computer vision, and deep learning is gradually being used for image compression & decompression tasks. The compression rate of the lossy compression algorithm is higher than lossless compression but the main disadvantage of lossy compression is the loss of data while compressing it. It is observed that a higher compression rate causes higher data loss. Recent advancements in deep learning technique for computer vision like image noise reduction and image super-resolution has shown great possibilities in the area of image enhancement. If these techniques can be utilized to mitigate the impact of a higher compression rate on the output of lossy compression, then the nearly same image quality can be achieved. In this paper, an image compression & decompression framework are proposed. This framework is based on two convolutional neural networks (CNN) based autoencoders. Images and videos are the main sources of unstructured data. Storing and transmitting these unstructured data in the cloud server can be costly and resource-consuming. If a lossy compression can compress the image to store in the cloud server then it can reduce the storage cost and space. We have achieved a 4x compression ratio which means an image will occupy only 25% space compared to the original image. The original image will be retrieved using a joint operation of deconvolution and image enhancement algorithms. The proposed framework receives state-of-the-art performance in terms of Peak Signal-to-Noise ratio (PSNR) and Structure Similarity Index Measure (SSIM) with 4x compression.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Despite the great success of state-of-the-art deep neural networks, several studies have reported models to be over-confident in predictions, indicating miscalibration. Label Smoothing has been proposed as a solution to the over-confidence problem and works by softening hard targets during training, typically by distributing part of the probability mass from a ‘one-hot’ label uniformly to all other labels. However, neither model nor human confidence in a label are likely to be uniformly distributed in this manner, with some labels more likely to be confused than others. In this paper we integrate notions of model confidence and human confidence with label smoothing, respectively Model Confidence LS and Human Confidence LS, to achieve better model calibration and generalization. To enhance model generalization, we show how our model and human confidence scores can be successfully applied to curriculum learning, a training strategy inspired by learning of ‘easier to harder’ tasks. A higher model or human confidence score indicates a more recognisable and therefore easier sample, and can therefore be used as a scoring function to rank samples in curriculum learning. We evaluate our proposed methods with four state-of-the-art architectures for image and text classification task, using datasets with multi-rater label annotations by humans. We report that integrating model or human confidence information in label smoothing and curriculum learning improves both model performance and model calibration. The code are available at https://github.com/AoShuang92/Confidence Calibration CL.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Heart rate is one of the most vital health metrics which can be utilized to investigate and gain intuitions into various human physiological and psychological information. Estimating heart rate without the constraints of contact-based sensors thus presents itself as a very attractive field of research as it enables well-being monitoring in a wider variety of scenarios. Consequently, various techniques for camera-based heart rate estimation have been developed ranging from classical image processing to convoluted deep learning models and architectures. At the heart of such research efforts lies health and visual data acquisition, cleaning, transformation, and annotation. In this paper, we discuss how to prepare data for the task of developing or testing an algorithm or machine learning model for heart rate estimation from images of facial regions. The data prepared is to include camera frames as well as sensor readings from an electrocardiograph sensor. The proposed pipeline is divided into four main steps, namely removal of faulty data, frame and electrocardiograph timestamp de-jittering, signal denoising and filtering, and frame annotation creation. Our main contributions are a novel technique of eliminating jitter from health sensor and camera timestamps and a method to accurately time align both visual frame and electrocardiogram sensor data which is also applicable to other sensor types.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.