A. Hanjalic (4.1), G.C. Langelaar (4.2), M. Ceccarelli (4.3)
On the basis of the state-of-the-art in visual search, copy protection, and combo architecture and performance, this section will list requirements and conceptual solutions for various aspects of the combo system. Successively we discuss the requirements and solution concept of visual search, copy protection and multimedia database.
In large public digital libraries different preparatory annotation activities are carried out on the stored data so that its representation is optimal for subsequent user browsing processes. These activities are usually time-consuming and may include manual insertion of key-words, and the selection of audio and video clips.

Figure 4.1: Automated system enabling search through stored
video-data
For consumer storage systems, however, such annotation and selection activities should be carried out fully automatically, be it at a somewhat lower reliability and flexibility. This automated concept is chosen also to be used within the SMASH project.
In order to obtain a system suitable for consumer system applications, two requirements must be taken into account: (1) minimizing of user-involvement by system setting and (2) at the same time guaranteed good performance for any video-sequence incoming into the system. Also, the system must be able to perform the processing directly on the incoming stream, which is moreover to be done in real-time (or super-real-time - depending on the speed of downloading the stream on the storage unit). This "on-the-fly" processing aspect contributes strongly to the user-friendliness of the entire storage system.
In further text, the term automated will be used for both "on-the-fly" processing requirements and generalization of the performance through minimizing (and generalizing) the parameter dependency.
Further requirements on the visual search engine are particularly related to specific SMASH applications and architecture. An important application in SMASH is the recording of DVB services containing the MPEG-2 compressed video format. Since no content-based video-processing is possible directly on the compressed stream, its partial decoding up to a certain level is necessary. For the real-time processing, dedicated hardware is required. Important interfaces needed for the system realization are the bus for sending MPEG-2 transport streams to the COMBO, access to multimedia database and the COMBO-PC interface for developing browsing and postprocessing tools.
To fully automate the process of extracting the key information, two issues
need to be studied, namely automated video parsing and automated key frame
extraction. This can be observed in Figure 4.1. The following two sections
contain proposed solutions for these two parts of the visual search engine.
For consumer application systems, a video-parsing algorithm is needed, capable of "on-the-fly" video-analysis and with high detection reliability. Due to these requirements, the concept of global thresholding - as often proposed in the literature - cannot be applied. For SMASH purposes, a modified approach from [22] with adaptive threshold will be used where the investigation of frame-to-frame differences is done within a sliding window containing only a certain number of recently computed FFD values. The results are not dependent on missed detection or false alarms which happened before the starting point of the window. By this approach not the last computed but the middle FFD-value of the window is compared with the second largest window element and a sharp shot break is proclaimed if the ratio between these two values is larger than a given thresholding parameter. FFD values along the sequence are measured by comparing histograms of all three components of the YUV colour space. Histograms are formed with 32 bins for Y values and with 64 bins for both chrominance components. The used metric can be given as
![]()
employing histograms h(i) of two consecutive DC-frames (k,k-1),
applied to Y, U and V components.
The original algorithm, as described in [22],
detects sharp changes between consecutive shots, but reacts also by gradual
transitions and special effects. The main disadvantage of this method is
the need for specification of the thresholding parameter for distinguishing
the maximal and next largest window element, as described before. The performance
of the value proposed by authors in [22] can be
guaranteed only for sequences used in tests. Our current efforts are directed
towards minimizing this parameter dependency and obtain good performance
quality for general sequences. First improved results have been obtained
for detecting sharp shot-changes, and this is planned to be used in the first
implementation phase of the project. Improved results for gradual changes
and special effects are expected soon and will be implemented into the system
in a later project phase.
For the usage in the project, we propose our novel automatic key-frame extraction procedure [23], [24].
Figure 4.1 shows the SMASH reference framework as far as the key frame extraction
and storage concept is concerned. Among other components of the automated
storage unit, two essential components of our automated approach can be seen,
namely key frame assignment to each shot and key frame distribution
along each shot. In the first component, "on-the-fly" assignment of the
number of key-frame per shot is carried out depending on the content of the
shot and on the past content development. This key frame assignment is done
such that the sum of all assigned key-frames along the sequence is close
to a given maximal number of allowable key-frames N for the entire
sequence. As a measure
of content, we have
found the sum of frame-to-frame differences along the shot i
sufficiently representative:
(1)
Here L is the number of frames within the shot and d(n,n-1) the
measured content-difference between frames n and n-1, here
based on comparison of frame-histograms h using Y, U and V
colour component:
(2)
Such histogram-based measuring of content-changes along the sequence, although shown to be a powerful tool for video-parsing (Section 4.1.1), is not the final choice when performing the key-frame extraction. Currently, the work is being done within the WP320 for defining much more sophisticated measures for content-development of a video-sequence (also called action-measures).
The "on-the-fly" assignment algorithm, using the defined content-measure
, takes on the following form:
(3)
represents the assigned number of key-frames to
the shot i, T is the total sequence length, and
is the length of the shot u.
is the content of the shot u defined with
(1) and S is the number of shots in the entire sequence.
The assignment step in (3) is followed by a threshold independent and objective procedure for content-based distribution of the assigned number of key-frames along each video-shot [23], [24]. Key-frame distribution along the shot results from minimising the following criterion function:
(4)
Here kj (j=1,...,Ki) are the temporal
positions of the key frames, while tj-1 and
tj are the breakpoints between the shot segments
that are represented by key frame kj. Note that
t0 and
are the (known) temporal
begin and endpoints of i-th shot. The non-decreasing function
represents the content-development of the shot i,
where each function value is obtained by accumulating frame-to-frame
differences (2) along the shot up to the frame x. With (4) we indicate
that we wish to approximate the actual content-development by using the curve
composed out of rectangles, each one defined by
kj and tj and each corresponding to one
key-frame. The minimisation process gives optimal positions of key-frames
along the shot, such that their (variable) density best simulates the actual
content-development. Figure 4.2a illustrates the distribution approach by
3 assigned key-frames to a shot with a given content-development. Figure
4.2b shows the result of the key-frame distribution over a single shot after
applying the described approach.
Results of the combined key-frame assignment and key-frame distribution show that automated key-frame extraction is indeed feasible. Nonetheless, like in any other automated process, errors occur. The dominant type of errors that we have to deal with are failures of the video parsing process, i.e. detection of false shot changes and the missing of shot changes. Experimental results indicate that due to the use of objective criteria in the key-frame assignment and distribution, the system is fairly robust to errors in detecting shot changes.

Figure 4.2a: Illustration of the proposed approach for key-frame
allocation within the video-shot by assigned 3 key-frames. An approximation
of the accumulation curve can be done through 3 flexible rectangles.

Figure 4.2b: Application of the approach to a fictive video-shot.
Obtained variable key-frame densities picture the actual
content-development
A clustering procedure, suitable for the SMASH system, has not been proposed
yet. Intensive research in this area is planned for the next phase of the
project.
The protection mechanism for the SMASH system must take care that service
providers can keep control over their stored data and that copyrights are
not violated. On the other hand must the system give consumers the possibility
to protect their own data against others (e.g. parental control).
The protection scheme for the SMASH recorder should meet the following requirements:
In Figure 4.3 the interface of the SMASH recorder to other devices is shown. For the interconnection a digital P1394 bus is used. Currently only the DVC camcorder and the Sony DHR-1000 VCR support the new communications protocol IEEE 1394. But in the future the Set Top Box, D-VHS, SMASH system and PC will also be equipped with this interface. Probably the rest of the devices will follow later. The PC has a special own interface to other digital storage devices. Most devices are connected to the internal PCI-bus or SCSI-bus. All devices access the data as real time bit streams. The PC is the only device that also can access the data as files.
Nowadays, only interfaces exist between digital audio devices, camcorders and digital VCR's and between PC's and peripherals.
If the SCMS protocol [2] would be used for copy protection, all data must be divided into data packets for transmission over the databus. Together with each packet a separate subcode packet must be transmitted, containing the copyright status of this packet (among others the copy prohibit bit).
Every recording device connected to the bus checks the copyright fields in order to know if a stream may be recorded or not and refuses to record a stream or file in which a copy prohibit bit occurs. It is necessary that every packet has its own subcode packet, because the storage device can start recording on a random position in the stream.
This system has two disadvantages. If the copyright status is located in a subcode packet with a fixed position in the stream, a hacker can easily trace the subcode packets and toggle the copy prohibit bit to change the status of the stream. Another problem is the PC. If PC software can use a video stream, it accesses the data as file. But if the stream is accessed a file, it is also possible to copy it to the hard disk, without using the SCMS protocol (and without checking the subcode packets, containing the copyright status). So, if a PC stores a stream the stream is always accepted by the recorder, because no copyright information was found.
Figure 4.3 Interconnection and protection concept
Therefore the copyright status must be stored in the streams itself, a label or watermark must be added to the data. In this case the copy prohibit bit is inseparable of the data. This approach has many advantages:
The reliability of the copy protection system is of course dependent on a standardisation of the system. Every storage device must be able to extract the copyright information of the data and must refuse to record data in which a copy prohibit bit occurs.
The complete implementation of the copy protection mechanism is represented
in Figure 4.4, which must be implemented in every storage device. The copy
protection system is only active if data is stored.
Figure 4.4: Copy protection scheme
To be compatible with the S/PDIF protocol the following scenario can be used. If data is offered according to the S/PDIF protocol, the copy prohibit bit is tested and the data is refused if this bit is set. If the data contained no copy prohibit bit, the data is labelled and stored. If labelled data is played-back according to the S/PDIF protocol the copy prohibit bit is set. A DAT or DCC recorder will not copy the data further in this case.
If service providers wants to prepare data in such a way that all recorders
refuse to store the data, they simply add a label to their data, in which
the copy prohibit bit is set.
If an encryption module is implemented in the system to protect the private data of the consumer and the recorder allows to play back encrypted data, the copy protection mechanism can be circumvented, because the watermark can not be detected in an encrypted stream. To avoid this, the recorder must not allow to play back such streams and must also not give PCs access to these files.
Since strong encryption algorithms are forbidden for consumer applications
in some countries, it is better to protect the consumer data by a conditional
access system. In this case there are also no problems with the labelling.
Users can protect their data for instance with a password or PIN-code.
A simple representation of the complete protection system can be found in Figure 4.3. The system has the following properties:
The strongest point of the SMASH combo is given by the combination of its three main characteristics, that make it a candidate for potential market success: large capacity, fast access and low cost. Once large capacity has been provided via a low cost tape system, the main challenge is to make the system react quickly to the user's commands, in order to transparently appear to him as a fast access device (as an 'expensive' array of disks, for example). This requires a careful and clever organisation of the stored information, in order to efficiently manage the resources the application requests.
Fast access and large amount of data are also the keywords of Database Management Systems (DBMS). A DBMS is a tool for manipulating a database, which is made available through special software for archiving, retrieving and modifying objects.
When data grow beyond a certain degree in quantity and complexity, a system for centralised management of data and applications is needed. This is certainly necessary in case of a storage system for multimedia applications as the SMASH Combo is, where plenty of data in different formats may enter the system. We can also say that, ultimately, storage would have no sense if no retrieval system was provided.
Database Management Systems provide not only data integrity, consistency, optimisation of both storage and processing resources, provide also the user with interesting features like access speed, flexibility of retrieval, total availability of the stored information. These capabilities have been reserved so far to companies with centralised computing systems, where large amount of information about items or people could be organised in alphanumeric tables. As the computer world has moved from large mainframes with centralised resources to networked, distributed and personal solutions, a new fast growing market is attracting the world of database to shift from large scale enterprise solutions to lightweight systems for small scale or distributed applications.
In the development of a Multimedia Database Management System for the SMASH Combo, the challenge is represented by the necessity of developing a fast, small footprint, embedded DBMS, capable of managing large and complex objects for applications with critical requirements.
Furthermore, the system must not need an administrator and must automatically take care of all the issues involved with management of data. It must store not only the objects, but also their attributes and their relationships. Extraction and annotations of meta data will be performed in real-time during storing operation, and queries will be embedded in the applications.
The DBMS will act as an interface between the physical repository and the client application. An interfacing mechanism for client-server communication also has to be defined.