Human-Object Interaction (HOI) Detection aims to locate and recognize HOI instances in the form of in images or videos. However, the significant cost of manpower and resources for annotations poses challenges, particularly in addressing the long-tail or even zero-shot distribution problem. Additionally, the generation of visual features from fixed semantic information encounters the issue of a lack of diversity. In response of these challenges, we develop a novel Visual-Semantic and Multi-source Feature Generation (VSMG) network for zero-shot human-object interaction detection. Firstly, by combining the visual-semantic GAN, the model not only generates visual/semantic features from corresponding semantic/visual ones but also introduces the diverse unseen visual features during the training phase. Secondly, by employing a knowledge-aware relation graph, the model encodes the relations between objects, actions and interactions including both seen and unseen classes. Based on the relation weights in the relation graph, a dynamic multisource feature generation strategy is performed to generate diverse virtual visual features for unseen classes. Finally, experimental results on HICO-DET dataset validate the effectiveness of our proposed method, demonstrating improvements the detection performance of the trained HOI detector.
Automatic emotion recognition for video clips has become a popular area of research in recent years. Previous studies have explored emotion recognition methods through monomodal approaches, such as voice, text, facial expression, and physiological information. We focus on the complementarity of the information and construct an automatic emotion recognition model based on deep learning technology and multimodal fusion strategy. In this model, visual features, audio features, and text features are extracted from the video clips. A decision-level fusion strategy, based on the theory of evidence, is proposed to fuse the multiple classification results. To solve the problem of evidence conflict in evidence theory, we study a compatibility algorithm designed to correct conflicting evidence based on the similarity matrix of the evidence. This approach is shown to improve the accuracy of emotion recognition.
Understanding human facial expressions is one of the key steps to achieving human-computer interaction. However, the facial expression is a combination of an expressive component called facial behavior and a neutral component of a person. The most commonly used taxonomy to describe facial behaviors is the Facial Action Coding System (FACS). FACS segments the visible effects of facial muscle activation into 30+ action units (AUs). So, we introduce a method to recognize AUs by extracting information of the expressive component through a de-expression learning procedure, called De-expression Residue Learning (DeRL). Firstly, we train a Generative Adversarial Network named cGAN to filter out the expressive information and generate the corresponding neutral face image. Then, we use the intermediate layers, which contains the action unit information, to recognition AUs. Our work alleviates problems of AUs recognition based on the pixel level difference, which is unreliable due to the variation between images i.e., rotation, translation and lighting condition changes, or the feature level difference, which is also unstable as the expression information may vary according to the identity information. As for experiments, we use the data augmentation method to avoid overfitting and trained deep network to recognition AUs on CK+ datasets. The results reveal that our work achieves more competitive performance than several other popular approaches.
Due to the high demand of deep learning for data quantity, semi-supervised learning (SSL) has a very important application prospect because of its successful use of unlabeled data. Existing SSL algorithms have achieved high accuracy on MINIST, CIFAR-10 and SHVN datasets, and even outperform fully supervised algorithms. However, because the above three datasets have the characteristics of balanced data categories and simple identification tasks which can’t be ignored for classification problems, the SSL algorithm has uncertainties of effectiveness in the case of unbalanced datasets and specific recognition tasks. We analyze the datasets and find that the number of “disgust” in expressions dataset is less than other categories, and so is “discussion” in the classroom action recognition dataset. Therefore, we use a novel SSL model: Deep Co-Training (DCT) model to experiment on the expression recognition database (FER2013), as well as our own classroom student action database (BNU-LCSAD) and analyze the effectiveness of the algorithm in specific application scenarios. Moreover, we use a training strategy of TSA when train our model to solve the problem of being easily overfitting which is more likely to occur when data categories are not balanced. The experimental results prove the effectiveness of the SSL algorithm in practical application and the significance of using TSA.
With the development and application of digital cameras, especially in education, a great number of digital video recordings are produced in classrooms. Taking Beijing Normal University as an example, 3.4 TB of videos are recorded every day in more than 200 classrooms. Such huge data is beneficial for us, computer vision researchers, to automatically recognize students' classroom actions and even evaluate the quality of classroom teaching. To focus action recognition on students, we propose Beijing Normal University Large-scale Classroom Student Action Database version 1.0(BNU-LCSAD) which is the first large-scale classroom student action database for student action recognition and consists of 10 classroom student action classes from digital camera recordings at BNU. We introduce the construct and label Processing of this database in detail. In Addition , we provide baseline of student action recognition results based our new database using C3D network.
Smile intensity estimation plays important roles in applications such as affective disorder prediction, life satisfaction prediction, camera technique improvement, etc. In recent studies, many researchers applied only traditional features, such as local binary pattern and local phase quantization (LPQ) to represent smile intensity. To improve the performance of spontaneous smile intensity estimation, we introduce a feature set that combines the saliency map (SM)-based handcrafted feature and non-low-level convolutional neural network (CNN) features. We took advantage of the opponent-color characteristic of SMs and the multiple convolutional level features, which were assumed to be mutually complementary. Experiments were made on the Binghamton-Pittsburgh 4D (BP4D) database and Denver Intensity of Spontaneous Facial Action (DISFA) database. We set the local binary patterns on three orthogonal planes (LBPTOP) method as a baseline, and the experimental results show that the CNN features can better estimate smile intensity. Finally, through the proposed SM-LBPTOP feature fusion with the median- and high-level CNN features, we obtained the best result (52.08% on BP4D, 70.55% on DISFA), demonstrating our hypothesis is reasonable: the SM-based handcrafted feature is a good supplement to CNNs in spontaneous smile intensity estimation.
Obviously, face recognition may be good for obtaining the students’ learning behaviors in class, which are useful for either teaching quality estimation or individualized teaching. While, exact face detection is the first and necessary task in such application. Considering the real setting of a classroom, it is also challenging. After careful studying, it is found that special position of the cameras in a classroom may lead various poses, and severe occlusion problem, which can also occur in other indoor surveillance-used places, such as large gatherings. In this paper, a forehead-based face detection model applied to such particular environments are proposed. The key idea is to obtain faces by detecting forehead area, which has a relatively high position and rich-information of shape, color and texture, instead of commonly used landmarks. The method consists of a post classifier based on extended Haar-Like feature, and a second classifier based on a color feature, called Multi-Channel-Color-Frequency Feature (MCCFF). To make it more efficient, we combine them in the same cascade framework. Practically, experiments on the database obtained from the real class, i.e. BNULSVED show that the proposed approach is effective and efficient.
Human emotions are known to always have four phases in the temporal domain: neutral, onset, apex, and offset. This has been demonstrated to be of great benefit for emotion recognition. Therefore, temporal segmentation has attracted considerable research interest. Although state-of-the-art techniques use recurrent neural networks to highly increase the performance, they ignore the relevance of each frame (time step) of a video, and they do not consider the changing contribution of different features when fusing them. We propose a framework called dual-level attention-aware bidirectional grated recurrent unit, which integrates ideas from attention models to discover the most important frames and features for improving temporal segmentation. Specifically, it applies attention mechanisms at two levels: frame and feature. A significant advantage is that the two-level attention weights provide a meaningful value to depict the importance of each frame and feature. The experiments demonstrated that the proposed framework outperforms state-of-the-art methods.
For education or management, it is often necessary to identify students with their identification (ID) photos through the surveillance videos of the college classrooms. This is a typical application of ID photo based single-sample per-person (SSPP-ID) face recognition. After analyzing the main challenges, we propose a framework by combining deep learning method and label propagation algorithm together. It is composed of three sequential steps: the first step aims to partition the face image into several patches and get an unbalanced-patch based feature using ConvNets; In the second step, we select a few key-frames by using the log-likelihood ratio calculated by the Joint Bayesian model; The last step uses label propagation algorithm to propagate the labels from the key frames to the whole video by simultaneously incorporating constraints in temporal and feature spaces. The performance of the proposed method is evaluated on Movie Trailer Face Dataset and practical college class surveillance videos. Experiments with these challenging datasets validate the utility of the proposed method.
Constrained by the physiology, the temporal factors associated with human behavior, irrespective of facial movement or body gesture, are described by four phases: neutral, onset, apex, and offset. Although they may benefit related recognition tasks, it is not easy to accurately detect such temporal segments. An automatic temporal segment detection framework using bilateral long short-term memory recurrent neural networks (BLSTM-RNN) to learn high-level temporal–spatial features, which synthesizes the local and global temporal–spatial information more efficiently, is presented. The framework is evaluated in detail over the face and body database (FABO). The comparison shows that the proposed framework outperforms state-of-the-art methods for solving the problem of temporal segment detection.
Face detection is important for face localization in face or facial expression recognition, etc. The basic idea is to determine
whether there is a face in an image or not, and also its location, size. It can be seen as a binary classification problem,
which can be well solved by support vector machine (SVM). Though SVM has strong model generalization ability, it has
some limitations, which will be deeply analyzed in the paper. To access them, we study the principle and characteristics
of the Multiple Kernel Learning (MKL) and propose a MKL-based face detection algorithm. In the paper, we describe the
proposed algorithm in the interdisciplinary research perspective of machine learning and image processing. After analyzing
the limitation of describing a face with a single feature, we apply several ones. To fuse them well, we try different kernel
functions on different feature. By MKL method, the weight of each single function is determined. Thus, we obtain the face
detection model, which is the kernel of the proposed method. Experiments on the public data set and real life face images
are performed. We compare the performance of the proposed algorithm with the single kernel-single feature based
algorithm and multiple kernels-single feature based algorithm. The effectiveness of the proposed algorithm is illustrated.
Keywords: face detection, feature fusion, SVM, MKL
In order to study the learning behaviors of college students, we need to identify students with the monitoring videos in the
context of college classroom. We analyze the main challenges of face recognition under classroom conditions, and propose
a face recognition method via combining multi-feature with alignment preprocessing. Obviously, the more identification
photos per student may improve the recognition accuracy and the computational complexity simultaneously. Aiming at
the practical use, we not only study on the accuracy but also the efficiency of the proposed method. We use the practical
classroom videos and standard face databases to explore the best feature fusion strategies and prove the universality of the
method.
In the field of pedagogy or educational psychology, emotions are treated as very important factors, which are closely associated with cognitive processes. Hence, it is meaningful for teachers to analyze students’ emotions in classrooms, thus adjusting their teaching activities and improving students ’ individual development. To provide a benchmark for different expression recognition algorithms, a large collection of training and test data in classroom environment has become an acute problem that needs to be resolved. In this paper, we present a multimodal spontaneous database in real learning environment. To collect the data, students watched seven kinds of teaching videos and were simultaneously filmed by a camera. Trained coders made one of the five learning expression labels for each image sequence extracted from the captured videos. This subset consists of 554 multimodal spontaneous expression image sequences (22,160 frames) recorded in real classrooms. There are four main advantages in this database. 1) Due to recorded in the real classroom environment, viewer’s distance from the camera and the lighting of the database varies considerably between image sequences. 2) All the data presented are natural spontaneous responses to teaching videos. 3) The multimodal database also contains nonverbal behavior including eye movement, head posture and gestures to infer a student ’ s affective state during the courses. 4) In the video sequences, there are different kinds of temporal activation patterns. In addition, we have demonstrated the labels for the image sequences are in high reliability through Cronbach's alpha method.
It is well known that rapid building damage assessment is necessary for postdisaster emergency relief and recovery. Based on an analysis of very high-resolution remote-sensing images, we propose an automatic building damage assessment framework for rainfall- or earthquake-induced landslide disasters. The framework consists of two parts that implement landslide detection and the damage classification of buildings, respectively. In this framework, an approach based on modified object-based sparse representation classification and morphological processing is used for automatic landslide detection. Moreover, we propose a building damage classification model, which is a classification strategy designed for affected buildings based on the spectral characteristics of the landslide disaster and the morphological characteristics of building damage. The effectiveness of the proposed framework was verified by applying it to remote-sensing images from Wenchuan County, China, in 2008, in the aftermath of an earthquake. It can be useful for decision makers, disaster management agencies, and scientific research organizations.
Facial expression recognition in the wild is a very challenging task. We describe our work in static and continuous facial expression recognition in the wild. We evaluate the recognition results of gray deep features and color deep features, and explore the fusion of multimodal texture features. For the continuous facial expression recognition, we design two temporal–spatial dense scale-invariant feature transform (SIFT) features and combine multimodal features to recognize expression from image sequences. For the static facial expression recognition based on video frames, we extract dense SIFT and some deep convolutional neural network (CNN) features, including our proposed CNN architecture. We train linear support vector machine and partial least squares classifiers for those kinds of features on the static facial expression in the wild (SFEW) and acted facial expression in the wild (AFEW) dataset, and we propose a fusion network to combine all the extracted features at decision level. The final achievement we gained is 56.32% on the SFEW testing set and 50.67% on the AFEW validation set, which are much better than the baseline recognition rates of 35.96% and 36.08%.
Face detection and alignment are two crucial tasks to face recognition which is a hot topic in the field of defense and security, whatever for the safety of social public, personal property as well as information and communication security. Common approaches toward the treatment of these tasks in recent years are often of three types: template matching-based, knowledge-based and machine learning-based, which are always separate-step, high computation cost or fragile robust. After deep analysis on a great deal of Chinese face images without hats, we propose a novel face detection and coarse alignment method, which is inspired by those three types of methods. It is multi-feature fusion with Simple Multiple Kernel Learning1 (Simple-MKL) algorithm. The proposed method is contrasted with competitive and related algorithms, and demonstrated to achieve promising results.
Landslide and mudflow detection is an important application of aerial images and high resolution remote sensing images, which is crucial for national security and disaster relief. Since the high resolution images are often large in size, it’s necessary to develop an efficient algorithm for landslide and mudflow detection. Based on the theory of sparse representation and, we propose a novel automatic landslide and mudflow detection method in this paper, which combines multi-channel sparse representation and eight neighbor judgment methods. The whole process of the detection is totally automatic. We make the experiment on a high resolution image of ZhouQu district of Gansu province in China on August, 2010 and get a promising result which proved the effective of using sparse representation on landslide and mudflow detection.
In recent years, earthquake and heavy rain have triggered more and more landslides, which have caused serious economic losses. The timely detection of the disaster area and the assessment of the hazard are necessary and primary for disaster mitigation and relief. As high-resolution satellite and aerial images have been widely used in the field of environmental monitoring and disaster management, the damage assessment by processing satellite and aerial images has become a hot spot of research work. The rapid assessment of building damage caused by landslides with high-resolution satellite or aerial images is the focus of this article. In this paper, after analyzing the morphological characteristics of the landslide disaster, we proposed a set of criteria for rating building damage, and designed a semi-automatic evaluation system. The system is applied to the satellite and aerial images processing. The performance of the experiments demonstrated the effectiveness of our system.
This work proposes a weighted joint sparse representation (WJSR)-based classification method for robust alignment-free face recognition, in which an image is represented by a set of scale-invariant feature transform descriptors. The proposed method considers the correlation and the reliability of the query descriptors. The reliability is measured by the similarity information between the query descriptors and the atoms in the dictionary, which is incorporated into the l0∖l2-norm minimization to seek the optimal WJSR. Compared with the related state-of-art methods, the performance is advanced, as verified by the experiments on the benchmark face databases.
KEYWORDS: Chemical species, Associative arrays, Databases, Facial recognition systems, Detection and tracking algorithms, Image classification, Target recognition, Data modeling, 3D acquisition, 3D image processing
In recent years, sparse representation-based classification (SRC) has received significant attention due to its high recognition rate. However, the original SRC method requires a rigid alignment, which is crucial for its application. Therefore, features such as SIFT descriptors are introduced into the SRC method, resulting in an alignment-free method. However, a feature-based dictionary always contains considerable useful information for recognition. We explore the relationship of the similarity of the SIFT descriptors to multitask recognition and propose a clustering-weighted SIFT-based SRC method (CWS-SRC). The proposed approach is considerably more suitable for multitask recognition with sufficient samples. Using two public face databases (AR and Yale face) and a self-built car-model database, the performance of the proposed method is evaluated and compared to that of the SRC, SIFT matching, and MKD-SRC methods. Experimental results indicate that the proposed method exhibits better performance in the alignment-free scenario with sufficient samples.
Sparse representation classification method has been increasingly used in the fields of computer vision and pattern analysis, due to its high recognition rate, little dependence on the features, robustness to corruption and occlusion, and etc. However, most of these existing methods aim to find the sparsest representations of the test sample y in an overcomplete dictionary, which do not particularly consider the relevant structure between the atoms in the dictionary. Moreover, sufficient training samples are always required by the sparse representation method for effective recognition. In this paper we formulate the classification as a group-structured sparse representation problem using a sparsity-inducing norm minimization optimization and propose a novel sparse representation-based automatic target recognition (ATR) framework for the practical applications in which the training samples are drawn from the simulation models of real targets. The experimental results show that the proposed approach improves the recognition rate of standard sparse models, and our system can effectively and efficiently recognize targets under real environments, especially, where the good characteristics of the sparse representation based classification method are kept.
This paper is aiming at applying sparse representation based classification (SRC) on face recognition with disguise or illumination variation. Having analyzed the characteristics of general object recognition and the principle of the classifier of SRC method, authors focus on evaluating blocks of a probe sample and propose an optimized SRC method based on position-preserving weighted block and maximum likelihood model. Principle and implementation of the proposed method have been introduced in the article, and experiments on Yale and AR face database have been given too. From experimental results, it can be seen that the proposed optimized SRC method works well than existing methods.
Texture classification is a fundamental and yet difficult task in machine vision and image processing. In recent years, more and more researchers' attention has been drawn to the sparse representation-based classification (SRC) method and its corresponding dictionaries designing in pattern recognition community, due to its high recognition rate, robustness to corruption and occlusion, and little dependence on the features, etc. In this paper, we present a discriminative dictionary learning approach, and apply it to the sparse representation based classification framework for image texture representation and classification. The experimental results conducted on different testing data demonstrate the promise of our new approach when compared with the previous algorithms.
This paper is aiming at applying sparse representation based classification (SRC) on general objects of a certain scale.
Authors analyze the characteristics of general object recognition and propose a position-weighted block dictionary
(PWBD) based on sparse presentation and design a framework of SRC with it (PWBD-SRC). Principle and
implementation of PWBD-SRC have been introduced in the article, and experiments on car models have been given in
the article. From experimental results, it can be seen that with position-weighted block dictionary (PWBD) not only the
dictionary scale can be effectively reduced, but also roles of image blocks taking in representing a whole image can be
embodied to a certain extent. In reorganization application, an image only containing partial objects can be identified
with PWBD-SRC. Besides, rotation and perspective robustness can be achieved. Finally, a brief description on some
remaining problems has been proposed in the article.
Since actual application of Target Recognition is often under outdoor natural circumstances, existence and variance of
illumination can not be obviously neglected. Common-used target recognition algorithm having been applied to images
under diverse illumination, the effect seems undesirable. Thus, the authors have applied Retinex Theory to amend the
target recognition algorithm based on wavelet moment. Applying the amended algorithm to marine images, experimental
results have shown a notably optimizing effect.
In recent decades, hyperspectral Images (HSI) have been widely exploited in many fields for rich information containing
in them. Many algorithms have been brought out for endmember extracting, among which, VCA algorithm performs a
better precision and lower complexity. However, endmembers of the same HIS extracted with traditional VCA algorithm
are not always the same in different runs. After deeply analyzing, the authors have proposed an improved VCA algorithm
to resolve that shortcoming. For verification, experiment and comparative study have been performed. On conclusion, the
improved VCA algorithm has manifested higher efficiency and accuracy than the traditional one.
In this paper, we use the aspect ratio of a vessel for its recognition and classification with overhead images. For aspect
ratio extraction, a morphology-based local adaptive threshold method of detection has been applied for a more accurate
outline. With Radon transforms on the minimum bounding rectangle regions of those extracted outlines, central axis of
each vessel can be got. Thus, the aspect ratio of a vessel could be accurately calculated through scanning the boundary
contours of every target by lines along and perpendicular to the direction of central axis. If remote sensing information is
also considered, such as the height and pitching angle of shooting, the real values of a vessel can also be calculated.
Container inspection system is characterized as greatly changing dynamic range, geometric distortion, counting
fluctuation and interference data, etc. This paper introduces an approach to generate two-view images for comparison by
means of image data acquisition, a method to reconstruct 3D reviewing, and processing technology with a special image
correction algorithm, that is to correct image data acquired first and then adjust image gray datum line and contrast
combining with other image processing methods, which greatly improves image quality of Cobalt-60 based inspection
system compared with ordinary image processing methods.
KEYWORDS: Personal digital assistants, Geographic information systems, General packet radio service, Global Positioning System, Signal processing, Navigation systems, Data centers, Telecommunications, Data processing, Mobile communications
Though personal navigation systems based on PDA are convenient to each individual, they are not satisfied groups for
some special purpose. Therefore, a real-time geographical information exchange system with PDA has been brought out
in the article. Structure and elements of the system have been described. Finally, an experimental example has been
given, which proved the effectiveness of the system.
In this paper, we provide a segmentation based Retinex for improving the visual quality of aerial images obtained under
complex weather conditions. With the method, an aerial image will be segmented into different regions, and then an
adaptive Gaussian based on the segmentations will be used to process it. The method addresses the problems existing in
previously developed Retinex algorithms, such as halo artifacts and graying-out artifacts. The experimental result also
shows evidence of its better effect.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.