The ability to summarize and abstract information will be an essential part of intelligent behavior in consumer devices. Various summarization methods have been the topic of intensive research in the content-based video analysis community. Summarization in traditional information retrieval is a well understood problem. While there has been a lot of research in the multimedia community there is no agreed upon terminology and classification of the problems in this domain. Although the problem has been researched from different aspects there is usually no distinction between the various dimensions of summarization. The goal of the paper is to provide the basic definitions of widely used terms such as skimming, summarization, and highlighting. The different levels of summarization: local, global, and meta-level are made explicit. We distinguish among the dimensions of task, content, and method and provide an extensive classification model for the same. We map the existing summary extraction approaches in the literature into this model and we classify the aspects of proposed systems in the literature. In addition, we outline the evaluation methods and provide a brief survey. Finally we propose future research directions based on the white spots that we identified by analysis of existing systems in the literature.
The goal of the current paper is to introduce a novel clustering algorithm that has been designed for grouping transcribed textual documents obtained out of audio, video segments. Since audio transcripts are normally highly erroneous documents, one of the major challenges at the text processing stage is to reduce the negative impacts of errors gained at the speech recognition stage. Other difficulties come from the nature of conversational speech. In the paper we describe the main difficulties of the spoken documents and suggest an approach restricting their negative effects. In our paper we also present a clustering algorithm that groups transcripts on the base of informative closeness of documents. To carry out such partitioning we give an intuitive definition of informative field of a transcript and use it in our algorithm. To assess informative closeness of the transcripts, we apply Chi-square similarity measure, which is also described in the paper. Our experiments with Chi-square similarity measure showed its robustness and high efficacy. In particular, the performance analysis that have been carried out in regard to Chi-square and three other similarity measures such as Cosine, Dice, and Jaccard showed that Chi-square is more robust to specific features of spoken documents.
Current advanced television concepts envision data broadcasting along with the video stream, which is used by interactive applications at the client end. In this case, these applications do not proactively personalize the experience and may not allow user requests for additional information. We propose content enhancement using automatic retrieval of additional information based on video content and user interests. Our paper describes Video Retriever Genie, a system that enhances content with additional information based on metadata that provides semantics for the content. The system is based on a digital TV (Philips TriMedia) platform. We enhance content through user queries that define information extraction tasks that retrieve information from the Web. We present several examples of content enhancement such as additional movie character/actor information, financial information and weather alerts. Our system builds a bridge between the traditional TV viewing and the domain of personal computing and Internet. The boundaries between these domains are dissolving and this system demonstrates one effective approach for content enhancement. In addition, we illustrate our discussion with examples from two existing standards - MPEG-7 and TV-Anytime.
Today the consumers are facing an ever-increasing amount of television programs. The problem, however, is that the content of video programs is opaque. The existing video watching options for consumers are either to watch the whole video, fast forward to try and find the relevant portion, or to use electronic program guides to get additional information. In this paper we will present a summarization system for processing incoming video, extracting and analyzing closed caption text, determining the boundaries of program segments as well as commercial breaks and extracting a program summary from a complete broadcast to enable video transparency. The system consists of: transcript extractor, program type classifier, cue extractor, knowledge database, temporal database, inference engine, and summerizer. The main topics that will be discussed are video summary, video categorization and retrieval tools.
In this research, we studied the joint use of visual and audio information for the problem of identifying persons in real video. A person identification system, which is able to identify characters in TV shows by the fusion of audio and visual information, is constructed based on two different fusion strategies. In the first strategy, speaker identification is used to verify the face recognition result. The second strategy consists of using face recognition and tracking to supplement speaker identification results. To evaluate our system's performance, an information database was generated by manually labeling the speaker and the main person's face in every I-frame of a video segment of the TV show 'Seinfeld'. By comparing the output form our system with our information database, we evaluated the performance of each of the analysis channels and their fusion. The results show that while the first fusion strategy is suitable for applications where precision is much more critical than recall. The second fusion strategy, on the other hand, generates the best overall identification performance. It outperforms either of the analysis channels greatly in both precision an recall and is applicable to more general applications, such as, in our case, to identify persons in TV programs.
We proposed an omni-face tracking system for video annotation in this paper, which is designed to find faces from arbitrary views in complex scenes. The face detector first locates potential faces in the input by performing skin-tone detection. The subsequent processing consists of two largely independent components, the frontal face module and the side- view face module, responsible for finding frontal-view and side-view faces, respectively. The frontal face module uses a region-based approach wherein regions of skin-tone pixels are analyzed for gross oval shape and the presence of facial features. In contrast, the side-view face module follows an edge-based approach to look for curves similar to a side-view profile. To extract the trajectories of faces, the temporal continuity between consecutive frames within the video shots is considered to speed up the tracking process. The main contribution of this work is being able to find faces irrespective of their poses, whereas contemporary systems deal with frontal-view faces only. Information regarding to human faces is encoded in XML format for semantic video content representation. The effectiveness of human face for video annotation is demonstrated in a TV program classification system that categories the input video clip into predefined types. It is shown that the classification accuracy is improved saliently by the employment of face information.
Consumer digital video devices are becoming computing platforms. As computing platforms, digital video devices are capable of crunching the compressed bits into the best displayable picture and delivering enhanced services. Although these deices will primarily aim to continue their traditional functions of display and storage, there are additional functions such as content management for real- time and stored video, tele-shopping, banking, Internet connectivity, and interactive services, which the device could also handle.
Personal News Retrieval System is a client-server application that delivers news segments on demand in a variety of information networks. At the server side, the news stories are segmented out from the digitized TV broadcast then classified and filtered based on consumers' preferences. At the client side, the user can access the preferred video news through the Web and watch stored video news in preferred order. Browsing preferences can be set based on anchorperson, broadcaster, category, location, top- stories and keywords. This system can be used to set up a news service run by content providers or independent media distribution companies. However, in the news era of enhanced PC/TV appliances, it is foreseeable that the whole system can run in the living room on a personal device. This paper describes the chosen server architecture, limitation of the system and solutions that can be implemented in the future.
Abstracting video information automatically from TV broadcast, requires reliable methods for isolating program and commercial segments out of the full broadcast material. In this paper, we present the results from cut, static sequence, black frame, and text detection, for the purpose of isolating non-program segments. These results are evaluated, by comparison, to human visual inspection using more than 13 hours of varied program content. Using cut rate detection alone, produced a high recall with medium precision. Text detection was performed on the commercials, and the false positive segments. Adding text detection slightly lowers the recall. However, much higher precision is achieved. A new fast black frame detector algorithm is presented. Black frame detection is important for identifying commercial boundaries. Results indicate that adding detection of text, in addition to cut rate, to reduce the number of false positives, appears to be a promising method. Furthermore, by adding the information about position and size of text, and tracking it through an area, should further increase reliability.
KEYWORDS: Video, Visualization, Multimedia, Human-machine interfaces, Data storage, Data modeling, Data communications, Image segmentation, Internet, Whole body imaging
In the convergence of information and entertainment there is a conflict between the consumer's expectation of fast access to high quality multimedia content through narrow bandwidth channels versus the size of this content. During the retrieval and information presentation of a multimedia application there are two problems that have to be solved: the limited bandwidth during transmission of the retrieved multimedia content and the limited memory for temporary caching. In this paper we propose an approach for latency optimization in information browsing applications. We proposed a method for flattening hierarchically linked documents in a manner convenient for network transport over slow channels to minimize browsing latency. Flattening of the hierarchy involves linearization, compression and bundling of the document nodes. After the transfer, the compressed hierarchy is stored on a local device where it can be partly unbundled to fit the caching limits at the local site while giving the user availability to the content.
KEYWORDS: Video, Databases, Video compression, Video processing, Distance measurement, Video coding, Genetic algorithms, Image compression, Signal processing, Human-machine interfaces
This paper presents a novel approach for video retrieval from a large archive of MPEG or Motion JPEG compressed video clips. We introduce a retrieval algorithm that takes a video clip as a query and searches the database for clips with similar contents. Video clips are characterized by a sequence of representative frame signatures, which are constructed from DC coefficients and motion information (`DC+M' signatures). The similarity between two video clips is determined by using their respective signatures. This method facilitates retrieval of clips for the purpose of video editing, broadcast news retrieval, or copyright violation detection.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.