Rich Media Retrieval on the Web – a Multi-level Indexing Approach

Jian Zhai1       Jun Yang1       Qing Li1           Liu Wenyin2      Bo Feng1

1Dept. of Computer Engineering and Information Technology

2Dept. of Computer Science

City University of Hong Kong, HKSAR, China


Rich media is experiencing a breathtaking growth in recent years. Because of the potentially very large index size, it is hard to adopt and adapt content-based method to search rich media files on the Web. In this work, we describe a multi-level indexing method to solve this problem. Specifically, we propose a novel technique of indexing rich media at different levels in a large collection. Query examples are presented to show that our approach can find more relevant rich media files and filter out noise data more effectively than traditional search engines without notable loss in efficiency.


rich media retrieval, multi-level index, XML representation.


Rich media is a new media format that usually integrates with audio, video, hypertext, graphic, vector objects, animations, and user interactions. For example, SMIL [1], Flash [2], and PowerPoint [3] are all instances of rich media. According to the latest statistics, the three major rich media formats, SMIL, Flash, and PowerPoint, are supported by 99%, 98%, and 96% [4] of the web browsers worldwide, respectively.

However, there are still many difficulties when we try to implement a practical content-based rich media retrieval system in the Web environment. Though Google can retrieve the content of PowerPoint and PDF files, it is shown by the latest statistics that these two types of files account for only 5% of all rich media files on the Web. [4]

Particularly, the index size would be too massive if we had simply applied traditional indexing techniques on top of the hundreds of thousands of rich media files (approximately 9,537,000 files, according to Google’s search result) appeared on the Web, not to mention the fact that this number keeps increasing these years. As a result, the gigantic index size entails tremendous calculation in the retrieval process.

To solve this problem, we make use of the vector based information of rich media to reduce the size of indices. The extracted feature is quite small in size but can still precisely capture the characteristics of the original rich media file. Also, several similar rich media files can share the same feature vector, which exactly outlines the characteristics of the group of rich media files. In addition, we provide a multi-level indexing approach as another means to solve the large index size problem. With properly designed data structure and algorithm, the computational complexity of finding the matched results is relatively low.


In our model, five levels are provided at which rich media files are indexed. From the top to bottom, they are domain level, category level, semantic level, structural level, and featured level, respectively.

Ÿ    Domain level index is the highest level of index in our model. Its function is similar to that of the web site directory of Yahoo! i.e., indicating the subject or topic of the media file. Index at domain level cannot be changed.

Ÿ    Category level index specifies the categories of rich media files. Indices are subject to change as the result of user feedbacks, as in the case of many existing retrieval systems. . Its classification is more fine-grained than Domain level index and can be modified according to newly inserted indices.

Ÿ    Semantic level index emphasizes the semantics of rich media files. This level and below are created automatically.

Ÿ    Structural level index focuses on the internal characteristics of rich media files.

Ÿ    Feature level is the lowest level for index of rich media files. At this level, the index describes the main objects existing in the rich media files.

At each level, the indices are all mapped to a numeric range. The query keyword is also converted to a query id by a lexicon. ª Then the remaining query can be processed in the same way as the traditional (database) multi-level index operations.

Figure 1. The systematic view of multi-level index

The organization of the multi-level index is shown in Figure 1. Each item in rich media library (Records) includes the URL and several character ids.


In order to make the abstract approach idiographic, we have implemented a prototype system to validate our algorithm. To illustrate our approach of multi-level indexing method, we present a four-tier architecture to address the representation, indexing, retrieval, and query specification of rich media files. As shown in Figure 2, the architecture is constituted by representation module, indexing module, retrieval module and user interface from bottom to top.

Figure 2. The 4 tier architecture of rich media retrieval system

To allow this architecture to support different types of rich media files, we only need to adapt/extend its representation module. Currently, our system only targets at Flash files because they account for the biggest portion of rich media on the Web (about 90%), although it is effectively applicable to the content-based retrieval of other rich media formats.

In our experiment, we don’t realize a crawler. The testing data is achieved from Google search with “.swf” in the link area.


We compared the performance of our system with other existing Web search engines, including Google and Flashkit.

4.1     Accuracy Test

The selected rich media file for testing is a Flash shooting game [6]. In this experiment, the keyword “M4A1 gun” (which is the name of a weapon in the game) was used as a query. In our test, the file is found and returned as the result expectedly.

Then we tried to search the same file using Google and FlashKit. It is shown that none of them find the file ones we are looking for. The reason is that our advanced method in multi-level index approach embeds a feature vector (such as “M4A1”) in the lexicon or even in the description field of index data, so that the required target can be located more accurately.

4.2     Noise Filter Test

When a user input “shoot game” as the searching keyword in FlashKit, more than 200 Flash movies were returned as the results but we can easily find that some of them are not a shooting game or not even a game.  Such noisy results should be filtered out in the desired scenario.

When we tried the same query in our system, the result is positive because actually there is no “shoot game” text object in this Flash movie, and our multi-level index mechanism is able to filter out such noise data.

4.3     Response Time Test

As an online content-based search engine, searching speed is an important aspect. In order to see how much the multi-level indexing algorithm can improve the searching speed, we carried out ten keyword-based experiments with and without the multi-level index algorithm being applied to the system. Table 1 gives the detailed comparisons.

Table 1. Test results for speed

Input keyword

Number of movies found

Scratch time (in second)

Stop level

Multi-level indexing

Non-multi-level indexing

age of empire





lion king










birthday card





classic music










yesterday once more










Jacky Chan










As shown above, when the multi-level index is not used the system, the query processing is based on a straightforward string comparison, which makes the response time too long to be acceptable in practice. In the case of using our multi-level indexing algorithm, the responding time is evidently shortened.

In fact, our response time is at the same level as Google’s. By making use of the vector characteristics of rich media, our multi-level index method is moreover suitable for content based retrieval on rich media.


We have presented an approach to content-based rich media retrieval using multi-level indexing method, and demonstrated its effectiveness through experimental studies. Admittedly, there are still some valuable research problems to be addressed in the future. Currently, Hoffman encoding is used in the lexicon, in which the efficiency bottleneck still exists. If the multi-level index is also applied to the lexicon, the efficiency may be improved. However, a fully-fledged algorithm is yet to be designed for maintaining the indices of the lexicon. Our future work will address this issue too.

6.     Acknowledgement

This work was partially supported by a grant from CityU (Project No.







[5]     Yannis Labrou and Tim Finin, “Yahoo! as an Ontology - Using Yahoo! Categories to Describe Documents”, Eighth International Conference on Knowledge and Information Management (CIKM-99), Kansas City, MO (October 1999).


ª Details of our algorithm description can be found at