2    State-of-the-art in visual search of video


A. Hanjalic

2.1   Introduction

We are witnessing an immense growth in the development of digital libraries. Collected in such libraries are large volumes of digitized multimedia data, which are reachable via electronic networks by any user world-wide. Libraries containing a variety of well-organised digital data bring many advantages, such as preservation of quality of stored information and almost unlimited possibilities to manipulate and browse through data using comfortable user interfaces. However, with steadily increasing amount of data stored in such systems, the problem of efficiently navigating through these data and quickly and reliably finding and retrieving any part of it, becomes increasingly important.



Figure 2.1: Scheme with video-processing steps needed for visual search of video and links to other components of the SMASH system as well as to the user's home environment

The most efficient way to interact with huge amounts of data, stored in modern digital storage systems, is to interact not with the data itself, but with the abstract of that data, which contains a compact representation of the whole stored content. The need for an abstract is especially strong in systems with distributed storage, like SMASH, where the user-interaction with the storage part (tape) containing the actual data is not efficient enough. Much easier user-interaction would be possible on the other storage component (disk) of the SMASH system, which then does not contain the actual data, but only the abstract. This data-representation should enable the user to get an impression about the actual stored content, i.e. to enable content-based browsing/query through data without the need to interact with the data itself. By making an abstract, special problems appear when dealing with stored visual data, primarily due to their high semantical complexity. This chapter is confined to the visual search done on stored video-data.

Figure 2.1 shows the idea of making the video-abstract directly on the incoming (compressed) video-stream. There are three major steps in this process:

In the area of visual search of video considerable amount of research results can be found in literature, providing steadily more efficient and reliable ways of performing the steps mentioned above. The importance of this research comes from the fact that the efficiency of the actual browsing/query process on stored video data is highly dependent on the way how video-processing in Figure 2.1 is done. In the following, the state-of-the art for each of these steps will be discussed in more details.

2.2  Video-parsing

The first video-processing item given in Figure 2.1 divides a given video stream in certain elementary segments, called video-shots. A video-shot can be defined as an unbroken sequence of frames, e.g. a zoom of a person talking [18]. There are several reasons why this process of video-parsing must be performed. Firstly, breakpoints between consecutive shots are semantically logical entrance points for video-retrieval. Secondly, shot-boundaries determine sequence parts each having one and the same content, i.e. they are elementary content units which are objects of representation, as it will be described in Section 2.3. Just like in the case of text, where elementary index units are words and phrases, also in the video some elementary entities are needed to be indexed [25]. Since one separate frame of the sequence has practically no meaning, the next longer elemental unit is the defined specific sequence of frames - video-shot. Defining these elementary units is done by detecting transitions between them.

There are three major classes of shot-transitions to be detected by video-parsing: sharp transitions (camera breaks), gradual transitions (dissolves, fade-out followed by fade-in) and special effects (fade-in, fade-out, wipe) combined with sharp transitions (e.g. fade-out followed by a camera break).

For each of these transitions, different techniques must be applied for their detection. The first step in the detection process is generally performed by measuring content-changes between each two consecutive frames of the sequence. The result is the curve of frame-to-frame differences (in the following text referred as FFD). This curve has, in case of sharp transitions, peaks at each place the difference is measured between frames from different shots, i.e. on the actual sharp shot-break. The logic behind this effect is that content changes between consecutive frames within a shot are much smaller than content changes between frames belonging to two different shots. If the transition is not sharp, its detection becomes more problematic. The dissolve is, for instance, spread over several frames (e.g. 15) resulting in FFD values between frames which are generally bigger than those within the shot but which cannot be defined as peaks. Better in this case is to talk about plateau's, i.e. gradual increase of FFD values followed by their gradual decrease.

low res figure 2.2a(click here to see full resolution)

low res figure 2.2b  (click here to see full resolution)

Figure 2.2: Some possible cases of shot-transitions and behaviour of FFD values along different sequences. The sequence left is a part of one music video-spot, and the one on the left the start of a soccer game. Both sequences are in MPEG-1 format, whereby the analysis is done on DC images of I frames and approximated DC-images of P-frames.

Other cases including special effects result again in other ways of FFD behaviour, which are first to be modelled in order to be detected. A further complicated factor by detecting shot-transitions are cases of fast camera movements causing large difference values. This causes a very similar behaviour of FFD-values like in the case of dissolve and could therefore be mixed up with an actual (gradual) shot-change. Also the case of a strong motion or fade followed by a sharp break or vice-versa causes considerable problems by the detection. Figure 2.2 shows the behaviour of FFD-values for two different sequences and with some characteristic transitions.

Two major issues are to be considered:

2.2.1  Features

In order to ignore small changes between subsequent frames of a sequence, which are not relevant for shot-transitions, large spatial averaging on frame data before the actual video-parsing process is proposed very frequently (e.g.[30]). For the same reason, the feature used to represent a frame should be characteristic for the entire frame. Among such global frame-features a big majority of authors prefers histograms, which can be computed for each of colour components separately or for the frame-colour in combination (e.g. 3D-RGB histograms). Histograms appear to be a widely used tool for describing a general frame content also due to easiness of their computation. In spite of many different metrics used in literature for comparing histograms and obtaining FFD values, good performance of this feature is often reported [22], [25], [30], [31]. The quality of performance is measured by sharpness of peaks, by sharp transitions and by distinguishability of plateau's of gradual transitions. However, due to the fact that such plateau's can appear also by strong object or camera motion, it is necessary to distinguish between these effects to avoid false detections in high-action scenes. Therefore an additional feature is needed which is specific for action effects: the motion information. Motion can also be used for video-parsing, e.g. for detecting shot-transitions occurring on P or B-frames in MPEG-compressed streams. When working with MPEG streams this information is easily available in form of motion vectors. Usage of this feature is demonstrated in [21], [31], [32].

2.2.2  Metrics

In several recent publications [33], [34], [30], [31], [25] different metrics for comparing frame features (histograms) have been evaluated.

In [22] and [23] a good performance is reported by using the absolute difference of histograms h(i) of two consecutive frames (k, k-1) and merging characteristics of histograms of all three separate colour components. While in [22] this metric is applied to components of the RGB-space, in [23] the YUV space was used. The metric can be given as


Although many authors including [22], [25], [30] claim that it is enough to use the histogram of only one component, the experience shows that the merging effect leads to much higher detection reliability. Some experiments about this have been made in [22].

In their frequently referenced comparative study, Nagasaka and Tanaka concluded in [33] that the so-called - comparison of colour-histograms, defined by


performs best, whereby [25] and [31] claim that its performance in not necessarily better than the performance of normal absolute differences.

As explained in the last section, motion information can also be used as a feature for video-parsing. For detecting shot-transitions occurring on P or B frames of an MPEG-stream, [32] proposes to measure ratios of numbers of macro-blocks without and with motion compensation (P-frames) and numbers of backward and forward motion vectors (B-frames). By detecting dissolves, motion vectors are used to model different camera movements ([25] and [31]).

2.2.3 Thresholding

While for finding an appropriate feature and metric for comparison of consecutive frames a number of acceptable solutions already exist, the problem of interpreting difference values, i.e. actual selection of certain values to be associated with shot-changes, still remains the major obstacle in practice. The proper selection of difference values is done by setting thresholds. Furth et al. give in [25] a statistical approach for determining the threshold, based on measuring mean-value and standard deviation of frame-to-frame differences for the whole sequence. The global threshold T is then estimated as

.

Experimental results in [25] suggest that should have values between 5 and 6. Statistical parameters needed for the formula are obtained after analysing the complete sequence and claiming that the distribution of all FFD values along the sequence (not considering FFD values on shot-transitions and in segments with camera/object motion) is Gaussian. Problem with this and with other approaches with global threshold is the case where a distinguishable break-peak can be observed in one stationary part of the sequence, but whose height is similar to FFD values along a high-action-shot in an other sequence part. The concept of global thresholds can also not be used in systems where video-parsing process is to be done "on-the-fly".

The described statistical approach can be adapted for "on-the-fly" processing in a sense that all statistical parameter are reset after each detected shot-change. This provides at the same time a local threshold selection. However, every missed detection or "false alarm" influences threshold-values for coming shots in a negative way, causing a burst of detection-mistakes.

The approach given in [22] partially avoids this problem, because the investigation of frame-to-frame differences is done within a sliding window - not dependent on missed detection or false alarms before the starting point of the window. In this case the threshold is determined locally for a (non-)causal sliding window, and is therefore time variant.

2.3    Video-content representation

2.3.1   Introduction

The process of representing video-content starts with partitioning of a given sequence in elementary content units, as described in previous section. Characteristic information of all shots in a sequence form the abstract of a sequence, and will be used in the browsing or query process.

The browsing or query method to be applied on the abstract is directly dependent on the way how the shot-content is represented [31]. This representation can be done by manual annotation of the video-content using key-word ,by extracting characteristic features (e.g. shape, colour, size, texture, temporal changes along each video-shot, etc.). We consider one such representation method, which is of large practical importance and based on key-frames.

2.3.2   Key-frame based video-content representation

Representation through characteristic frames (key-frames) has been addressed very frequently in literature (e.g. [21], [22], [20]). For each detected video-shot, a number of key-frames is chosen. All extracted key-frames out of the sequence are organized semantically (clustered) before being presented to the user. Video-browsing based on key-frames is done by going through cluster-tree, each step giving a set of key-frames representing corresponding video-parts. Due to the "action-based" video-sampling, the content-development of each video part can be recognized very easily only by looking at given samples. Therefore, this concept of video-content representation appears to be very suitable for video-browsing purposes. However, also when considering query processes where the search for video-parts containing some specific objects, persons or features is performed, the key-frame based concept can be very efficient since features collected from key-frames can be used.

A simple method to select key-frames is to take the first frame of each shot [36]. More reliable content-representation requires, however, non-uniform sampling of the frames in the video-shot. Current approaches [21], [20] mainly work by setting thresholds within each shot, measuring frame-to-frame differences and comparing them to that threshold. Most of the time, the first frame of the shot is chosen as one key-frame. For the rest of the shot, each time the difference value exceeds the threshold, the coming frame is proclaimed to be a key-frame. However, the described procedure is typically a sequential process leading generally to unpredictable results. In particular, the final amount of key-frames for the whole sequence cannot be estimated for any given threshold. We can end up with a huge number of key-frames or simply with too few key-frames - not enough for browsing. This problem is also related to the available space for storing extracted key-frames (abstract). Secondly, it is rather difficult to relate any particular parameter value by threshold setting to the key-frame collection resulting from that setting. Furthermore, it is based on "subjective" thresholds which is not acceptable in fully automated and widely used systems.

Aiming at more objective and controlled key-frame based video-representation, suitable for applications in SMASH, we have developed a new key-frame allocation method ([23], [24]), which has following major properties:

Details of this method are given in Section 4.

2.4 Clustering

2.4.1 Introduction

Key-frames are chosen to give the user a compact overview of all the stored video-material. The word "compact" is used here primarily to show the relation between sizes of the abstract and the original video-data. However, when analysing the absolute size of the abstract even for only one movie of a normal length (e.g. 2h) the usage of attribute "compact" seems to be not very suitable. By a frame-rate of 25 frame/sec and an approximate shot-length of 90 frames, 2000 video-shots appear as objects of representation through key-frames. After considering this fact, two possible solutions can be approached: either to find ways how to cope with a large number of key-frames or to try to represent the whole movie through a couple of key-frames which are positioned on "right" places. The problem is that also in the second case the total amount of key-frames will not be small, if an acceptable content-representation is desired, but also concerning the fact that the COMBO capacity is far larger than to store only one movie.

Figure 2.3: DVB application in the SMASH with the clustering block. All information needed for clustering video-shots can be seen

Defining efficient methods for organising the abstract data into meaningful clusters is a very important and research intensive area, either in case of stored image or video data. In case of video, there are two basic objectives of clustering:

The Figure 2.3 shows the place of clustering in the visual search of video. This figure also shows that clustering procedures may use non-visual information, for instance supporting DVB information.

2.4.2   Methods of clustering

All clustering methods can be divided into two major classes: hierarchical methods (leading to a tree-structure of data) and partitioned methods (leading to several parallel groups of data). Depending on the way the tree-structure is obtained, hierarchical methods can be further divided in agglomerative methods and divisive methods (going successively from coarse to fine splitting or vice versa).

Each clustering method is characterised by used features and metrics for comparing these features. The objective of each clustering method is to achieve maximal similarity of all objects belonging to one and the same cluster, where the similarity corresponds to the defined metrics and applied feature(s). Detailed study about colour-feature extraction can be found in [17]. In [19] a new image matching method in the subband domain is reported.

In [26] the creation of hierarchical scene transition graphs is proposed - a method of clustering all shots in a sequence for browsing purposes. Clustering is done by using colour, simple shape information and correlation of corresponding DC-images, as well as temporal variations in each shot to measure their similarity. Also a method for evaluating these mentioned similarity features, i.e. method of building proximity matrices for video-shots is presented. In [35], an overview of clustering methods for video-browsing is presented.

2.5  Conclusions

As discussed in all previous sections, there is a number of problems to be solved in order to obtain the video-abstract, whereby all mentioned requirements ("on-the-fly" processing, minimized parameter dependency, acceptable performance for general user, etc.) must be taken into account. Current research directions in the WP320 deal with these problems, and some results - like those in section 2.3 - can be reported. Future plans include further improvement of the key-frame extraction algorithm through more sophisticated modelling of the actual content-development along the sequence, improvement of video-parsing related to its current dependency on subjective, sequence-specific parameters. However, notwithstanding the technical hurdles that remain to be taken for a fully automated key-frame extraction procedure, clustering of key-frames and the associated browsing/query process is still in its infancy but will determine the successful development of elegant user-interfaces to a great extent. Therefore, main emphasis within future plans for the WP320 will be put on research in the clustering area.


back to table of contents