This paper presents an innovative super-resolution (SR) method for Optical Coherence Tomography (OCT), enhancing image resolution and reducing noise without retraining for different scales. Traditional SR techniques, interpolation, reconstruction, and learning-based, are surpassed by our approach, which combines a "shifted steered mixture of experts" with an autoencoder. This method outperforms the latest algorithms in subjective and objective evaluations, including PSNR and perceptual metrics. A distinctive feature is the adjustable sharpness, enabling targeted edge sharpening or defocusing through kernel experts’ bandwidth adjustments. This adaptability negates the need for data-specific retraining, offering a robust solution to improve OCT image quality and medical imaging analysis.
In Optical Coherence Tomography (OCT), speckle noise significantly hampers image quality, affecting diagnostic accuracy. Current methods, including traditional filtering and deep learning techniques, have limitations in noise reduction and detail preservation. Addressing these challenges, this study introduces a novel denoising algorithm, Block-Matching Steered-Mixture of Experts with Multi-Model Inference and Autoencoder (BM-SMoE-AE). This method combines block-matched implementation of the SMoE algorithm with an enhanced autoencoder architecture, offering efficient speckle noise reduction while retaining critical image details. Our method stands out by providing improved edge definition and reduced processing time. Comparative analysis with existing denoising techniques demonstrates the superior performance of BM-SMoE-AE in maintaining image integrity and enhancing OCT image usability for medical diagnostics.
Research in the past years introduced Steered Mixture-of-Experts (SMoE) as a framework to form sparse, edge-aware models for 2D- and higher dimensional pixel data, applicable to compression, denoising, and beyond, and capable to compete with state-of-the-art compression methods. To circumvent the computationally demanding, iterative optimization method used in prior works an autoencoder design is introduced that reduces the run-time drastically while simultaneously improving reconstruction quality for block-based SMoE approaches. Coupling a deep encoder network with a shallow, parameter-free SMoE decoder enforces an efficent and explainable latent representation. Our initial work on the autoencoder design presented a simple model, with limited applicability to compression and beyond. In this paper, we build on the foundation of the first autoencoder design and improve the reconstruction quality by expanding it to models of higher complexity and different block sizes. Furthermore, we improve the noise robustness of the autoencoder for SMoE denoising applications. Our results reveal that the newly adapted autoencoders allow ultra-fast estimation of parameters for complex SMoE models with excellent reconstruction quality, both for noise free input and under severe noise. This enables the SMoE image model framework for a wide range of image processing applications, including compression, noise reduction, and super-resolution.
KEYWORDS: Video, Motion models, Education and training, Video coding, Image compression, Video compression, Data modeling, Video processing, Modeling, Image restoration
Steered-Mixtures-of-Experts (SMoE) present a unified framework for sparse representation and compression of image data with arbitrary dimensionality. Recent work has shown great improvements in the performance of such models for image and light-field representation. However, for the case of videos the straight-forward application yields limited success as the SMoE framework leads to a piece-wise linear representation of the underlying imagery which is disrupted by nonlinear motion. We incorporate a global motion model into the SMoE framework which allows for higher temporal steering of the kernels. This drastically increases its capabilities to exploit correlations between adjacent frames by only adding 2 to 8 motion parameters per frame to the model but decreasing the required amount of kernels on average by 54.25%, respectively, while maintaining the same reconstruction quality yielding higher compression gains. By halving the number of necessary kernels, we achieve a significant reduction in complexity on the decoder side being a crucial step towards real-time processing.
Steered Mixture-of-Experts (SMoE) is a novel framework for representing multidimensional image modalities. In this paper, we propose a coding methodology for SMoE models that is readily extendable to any dimensional SMoE model, thus representing any image modality of any dimension. We evaluate the coding performance of SMoE models of light field video, a 5D image modality, i.e. time, two angular, and two spatial dimensions. The coding consists of the exploiting the redundancy between the parameters of SMoE models, i.e. a set of multivariate Gaussian distributions. We compare the performance of three multi-view HEVC (MV-HEVC) configurations that differ in terms of random access. Each subaperture view from the light field video is interpreted as a single view in MV-HEVC. Experiments validate that excellent coding performance compared to MV-HEVC for low- to midrange bitrates in terms of PSNR and SSIM with bitrate savings up to 75%.
In this paper, we address the handling of independently moving objects (IMOs) in automatic 2D to stereoscopic 3D
conversion systems based on structure-from-motion (SfM) techniques. Exploiting the different viewing positions of a
moving camera, these techniques yield excellent 3D results for static scene objects. However, the independent motion of
any foreground object requires a separate conversion process. We propose a novel segmentation approach that estimates
the occluded static background and segments the IMOs based on advanced change detection. The background estimation
is achieved applying 2D registration and blending techniques, representing an approximation of the underlying scene
geometry. The segmentation process itself uses anisotropic filtering applied on the difference image between original
frame and the estimated background frame. In order to render the segmented objects into the automatically generated 3D
scene properly, a small amount of user interaction will be necessary, e.g. an assignment of intra-object depth or the
object's absolute z-position. Experiments show that the segmentation method achieves accurate mask results for a
variety of scenes, similar to the masks obtained manually using state-of-the-art rotoscoping tools. Though, this work
contributes to the extension of SfM-based automatic 3D conversion methods for the application on dynamic scenes.
For object analysis in videos such as in video surveillance systems, the preliminary segmentation step is very
important. Many segmentation methods using static camera have been proposed in the last decade, but they
all suffer in occurrance of object reflection especially on the ground, i.e. reflected regions are also segmented
as foregrounds. We present a new method which detects the border between the real object and its reflection.
Experiments show that an outstanding improvement of segmentation results are obtained by removing the
reflection part of the over-segmented objects.
Forward error correction (FEC) improves the quality of compressed video transmitted through a lossy network for real-time applications such as video streaming. The FEC techniques are generally applied on source video packets at the frame level. In this paper, we propose a technique where the FEC is applied on source packets at the group-of-pictures (GoP) level assuming an MPEG-like compression scheme. We derive analytically an estimate of the average playable frame rate for a given packet loss probability. Our analysis over a range of network conditions indicates that in most practical network conditions, the proposed technique provides a larger playable frame rate compared to the frame-level FEC technique. This analysis results are also validated by video streaming simulations conducted on the NS-2 network simulator.
We present a robust and computational low complex method to estimate the physical camera parameters, intrinsic and extrinsic, for scene shots captured by cameras applying pan, tilt, rotation, and zoom. These parameters are then used to split a sequence of frames into several subsequences in an optimal way to generate multiple sprites. Hereby, optimal means a minimal usage of memory while keeping or even improving the reconstruction quality of the scene background. Since wide angles between two frames of a scene shot cause geometrical distortions using a perspective mapping it is necessary to part the shot into several subsequences. In our approach it is not mandatory that all frames of a subsequence are adjacent frames in the original scene. Furthermore the angle-based classification allows frame reordering and makes our approach very powerful.
Speaker change detection (SCD) is a preliminary step for many audio applications such as speaker segmentation
and recognition. Thus, its robustness is crucial to achieve a good performance in the later steps. Especially,
misses (false negatives) affect the results. For some applications, domain-specific characteristics can be used to
improve the reliability of the SCD. In broadcast news and discussions, the cooccurrence of shot boundaries and
change points provides a robust clue for speaker changes.
In this paper, two multimodal approaches are presented that utilize the results of a shot boundary detection
(SBD) step to improve the robustness of the SCD. Both approaches clearly outperform the audio-only approach
and are exclusively applicable for TV broadcast news and plenary discussions.
Image and video-based rendering technologies are receiving growing attention due to their photo-realistic rendering capability in free-viewpoint. However, two major limitations are ghosting and blurring due to their sampling-based mechanism. The scene geometry which supports to select accurate sampling positions is proposed using global method (i.e. approximate depth plane) and local method (i.e. disparity estimation). This paper focuses on the local method since
it can yield more accurate rendering quality without large number of cameras. The local scene geometry has two difficulties which are the geometrical density and the uncovered area including hidden information. They are the serious drawback to reconstruct an arbitrary viewpoint without aliasing artifacts. To solve the problems, we propose anisotropic diffusive resampling method based on tensor theory. Isotropic low-pass filtering accomplishes anti-aliasing in scene geometry and anisotropic diffusion prevents filtering from blurring the visual structures. Apertures in coarse samples are estimated following diffusion on the pre-filtered space, the nonlinear weighting of gradient directions suppresses the amount of diffusion. Aliasing artifacts from low density are efficiently removed by isotropic filtering and the edge blurring can be solved by the anisotropic method at one process. Due to difference size of sampling gap, the resampling condition is defined considering causality between filter-scale and edge. Using partial differential equation (PDE) employing Gaussian scale-space, we iteratively achieve the coarse-to-fine resampling. In a large scale, apertures and
uncovered holes can be overcoming because only strong and meaningful boundaries are selected on the resolution. The coarse-level resampling with a large scale is iteratively refined to get detail scene structure. Simulation results show the marked improvements of rendering quality.
This paper presents a novel approach for automatic and robust object detection. It utilizes a component-based approach that combines techniques from both statistical and structural pattern recognition domain. While the component detection relies on Haar-like features and an AdaBoost trained classifier cascade, the topology verification is based on graph matching techniques. The system was applied to face detection and the experiments show its outstanding performance in comparison to other face detection approaches. Especially in the presence of partial occlusions, uneven illumination and out-of-plane rotations it yields higher robustness.
KEYWORDS: RGB color model, Video, Image segmentation, Reflectivity, Motion models, Data modeling, Video surveillance, Cameras, Space operations, Video compression
A new segmentation approach usable for fixed or motion compensated camera is described. Instead of the often used RGB color space we operate with the invariant Gaussian color model proposed by Geusebroek and temporal information which eliminates unsteady regions surrounded by the moving objects. The Gaussian color model has never been used in video segmentation. Comparison with some state of the art methods in which both subjective and objective evaluation are applied proof the good performance of the proposed method.
KEYWORDS: RGB color model, Video, Image segmentation, Video compression, Fuzzy logic, Cameras, Video surveillance, Visualization, Quality measurement, Video processing
In the case of a static or motion compensated camera, static background segmentation methods can be applied to
segment the interesting foreground objects from the background. Although a lot of methods have been proposed,
a general assessment of the state of the art is not available. An important issue is to compare various state of
the art methods in terms of quality (accuracy) and computational complexity (time and memory consumption).
A representative set of recent techniques is chosen, implemented and compared to each other. An extensive set
of videos is used to achieve comprehensive results. Both indoor and outdoor videos with different environmental
conditions are used. While visual analysis is used for subjective assessment of the quality, pixel based measures
based on available ground truth data are used for the objective assessment. Furthermore the computational
complexity is estimated by measuring the elapsed time and memory requirements of each algorithm. The paper
summarizes the experiments and considers the assets and drawbacks of the various techniques. Moreover, it will
give hints for selecting the optimal approach for a specific environment and directions for further research in this
field.
In this paper, we present an automatic extraction of goal events in soccer videos by using audio track features alone without relying on expensive-to-compute video track features. The extracted goal events can be used for high-level indexing and selective browsing of soccer videos. The detection of soccer video highlights using audio contents comprises three steps: 1) extraction of audio features from a video sequence, 2) event candidate detection of highlight events based on the information provided by the feature extraction Methods and the Hidden Markov Model (HMM), 3) goal event selection to finally determine the video intervals to be included in the summary. For this purpose we compared the performance of the well known Mel-scale Frequency Cepstral Coefficients (MFCC) feature extraction method vs. MPEG-7 Audio Spectrum Projection feature (ASP) extraction method based on three different decomposition methods namely Principal Component Analysis( PCA), Independent Component Analysis (ICA) and Non-Negative Matrix Factorization (NMF). To evaluate our system we collected five soccer game videos from various sources. In total we have seven hours of soccer games consisting of eight gigabytes of data. One of five soccer games is used as the training data (e.g., announcers' excited speech, audience ambient speech noise, audience clapping, environmental sounds). Our goal event detection results are encouraging.
KEYWORDS: Video coding, Video, Automatic repeat request, Quantization, Forward error correction, Data compression, Distortion, Video processing, Signal attenuation, Computer programming
Multiple Description Video Coding (MDC) and Layered Coding (LC) are both error-resilient source coding techniques used for transmission over error-prone channels. Both techniques generate multiple streams. The streams generated by MDC correspond to different descriptions of the same source whereas the streams produced by LC are differentiated as base and enhancement layer streams. Moreover whereas the MDC streams are independently decodable the decoding of the enhancement layer stream is dependent on the decoding of the base layer stream. In this work we concentrate on specific MDC and LC schemes, i.e. Multi-State Video Coding (MSVC) and Temporal Layered Coding (TLC). MSVC was introduced by John Apostolopoulos and it was showed that if each frame is transmitted in a separate packet and if motion information for each lost frame is also lost, MSVC outperforms Single Layer Coding (SC) in recovering from single as well as burst losses. Here we compared MSVC with TLC as an extension of SC based on transmission simulations over lossy channels under the assumption that the motion vectors are always available. Using different coding modes and specific reconstruction methods average reconstructed frame PSNR (peak signal to noise ratio) is measured and compared. Results show that when motion vectors are received TLC performs better than MSVC for every coding option tested. The performance difference is bigger for low motion sequences.
This paper presents a novel approach to human body posture recognition based on the MPEG-7 contour-based shape descriptor and the widely used projection histogram. A combination of them was used to recognize the main posture and the view of a human based on the binary object mask obtained by the segmentation process. The recognition is treated as a typical pattern recognition task and is carried out through a hierarchy of classifiers. Therefore various structures both hierachical and non-hierarchical, in combination with different classifiers, are compared to each other with respect to recognition performance and computational complexity. Based on this an optimal system design with recognition rates of 95.59% for the main posture, 77.84% for the view and 79.77% in combination is achieved.
In this paper, dimension-reduced, decorrelated spectral features for general sound recognition are applied to segment conversational speech of both broadcast news audio and panel discussion television programs. Without a priori information about number of speakers, the audio stream is segmented by a hybrid metric-based and model-based segmentation algorithm. For the measure of the performance we compare the segmentation results of the hybrid method versus metric-based segmentation with both the MPEG-7 standardized features and Mel-scale Frequency Cepstrum Coefficients (MFCC). Results show that the MFCC features yield better performance compared to MPEG-7 features. The hybrid approach significantly outperforms direct metric based segmentation.
In this paper, we present a classification and retrieval technique targeted for retrieval of home video abstract using dimension-reduced, decorrelated spectral features of audio content. The feature extraction based on MPEG-7 descriptors consists of three main stages: Normalized Audio Spectrum Envelope (NASE), basis decomposition algorithm and basis projection, obtained by multiplying the NASE with a set of extracted basis functions. A classifier based on continuous hidden Markov models is applied. For retrieval with accurate performance the system consists of a two-level hierarchy method using speech recognition and sound classification. For the measure of the performance we compare the classification results of MPEG-7 standardized features vs. Mel-scale Frequency Cepstrum Coefficients (MFCC). Results show that the MFCC features yield better performance compared to MPEG-7 features.
Multiple State Video Coding (MSVC) is a Multiple Description Coding Scheme where the video is coded into multiple independently decodable streams, each with its own prediction process and state. The system subject to this work is composed of two subsystems: 1- multiple state encoding/decoding, 2- path diversity transmission system. In [1] we discussed how to optimize the rate allocation of such a system while maximizing the average reconstructed frame PSNR at the decoder and minimizing the PSNR variations between the streams given the total bitrate RT and the balanced (equal loss probabilities) or unbalanced (unequal) loss probabilities p1 and p2 over the two paths. In our current work we establish a theoretical framework to estimate the rate-(decoder) distortion (R-Dd) function and have taken into account the MSVC-structure, the rate allocation, channel impairments and reconstruction strategies respectively. The video sequence is modeld as an AR(1) source and the distortion associated with each reconstructed frame in both threads is a lossy transmission environment is estimated recursively depending on the system parameters.
Multiple Description Coding is a forward error correction scheme where two or more descriptions of the source are sent to the receiver over different channels. If only one channel is received the signal can be reconstructed with distortion D1 or D2. On the other hand, if both channels rae received the combined information is used to achieve a lower distortion D0. Our approach is based on the Multiple State Video Coding with the novelty that we achieve a flexible unbalance rate of the two streams by varying the quantization step size while keeping the original frame rate constant. The total bitrate Rτ is fixed which is to be allocated between the two streams. If the assigned bitratres are not balanced there will be PSNR variations between neighboring frames after reconstruction. Our goal is to find the optimal rate allocation while maximizing the average reconstructed frame PSNR and minimizing the PSNR variations given the total bitrate Rτ and the packet loss probabilities p1 and p2 over the two paths. The reconstruction algorithm is also taken into account in the optimization process. The paper will report results presenting optimal system designs for balanced but also for unbalanced path conditions.
In recent years, there has been a growing interest in developing effective methods for searching large image databases based on image content. A commonly used method is search-by-query, that is often not satisfactory. Often it is difficult to find or produce good query images or repetitive queries tend to become trapped among a small group of undesirable images. To overcome these problems the user is to be provided with easy and intuitive access to information in image databases. In this paper we present a new browsing environment, which uses the metaphor of maps. Like street maps with different scales, from a world map to a city map, the image space is represented through
Multimedia database interfaces should be designed to be very user-adaptive, since there is no generally applicable model of user's search behavior or of his search intention. First, the challenging task for the interface is to present the most representative objects in an appealing and concise manner. Second, the interface has to identify the user's search intention from very few positive feedbacks. In particular for the latter there exist a lot of Relevance Feedback imple-mentations.
While most of them are considered as more or less heuristically proved parameter adjustment procedures, we treat Relevance Feedback as direct probability density estimation. Our density is defined as the
n this paper we address the user-navigation through large volumes of image data. Similarity-measures based on different MPEG-7 descriptors are introduced and multidimensional scaling is employed to display images in three dimensions according to their mutual similarities. With such a view the user can easily see similarity relations between images and understand the structure of the database. In order to cope with large volumes of images a k-means clustering technique is introduced which identifies representative image samples for each cluster. Representative images (up to 100) are then displayed in three dimensions using multidimensional scaling structuring. The clustering technique proposed produces a hierarchical structure of clusters - similar to street maps with various resolutions of details. The user can zoom into various cluster levels to obtain more or less details if required. Further a new query refinement method is introduced. The retrieval process is controlled by learning from positive examples from the user, often called the relevance feedback of the user. The combination of the three techniques 3D-visualization, relevance feedback and the hierarchical structure of the image database leads to an intuitive browsing environment. The results obtained verify the attractiveness of the approach for navigation and retrieval applications.
In this paper we address the user-navigation through large volumes of image data. A similarity-measure based on MPEG-7 color histograms is introduced and Multidimensional Scaling concepts are employed to display images in two dimensions according to their mutual similarities. With such a view the user can easily see relations and color similarity between images and understand the structure of the data base. In order to cope with large volumes of images a modified version of k-means clustering technique is introduced which identifies representative image samples for each cluster. Representative images (up to 100) are then displayed in two dimensions using MDS structuring. The modified clustering technique proposed produces a hierarchical structure of clusters--similar to street maps with various resolutions of details. The user can zoom into various cluster levels to obtain more or less details if required. The results obtained verify the attractiveness of the approach for navigation and retrieval applications.
A layered Pyramid image coding scheme suitable for interworking of videoconferencing and videophone services is presented in this paper. The scheme proposed takes advantage of inter-layer embedded motion prediction to compress image data on the upper Pyramid enhancement layer. As an important advantage of this scheme we have preserved independence of the coding in the two layers using embedded motion compensation. Thus no information about the coding procedure on the lower resolution layer is needed to encode the upper layer enhancement information. The results are encouraging, showing that the scheme reconstructs images of both CIF and QCIF resolution with good quality and that the scheme is more data efficient than a comparable Simulcast approach.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.