Scalable Vector Graphics Indexing and Retrieval: A Knowledge Representation Approach

Eugenio Di Sciascio
Politecnico di Bari, Bari, Italy
Via G. Re David, 200, 70125 Bari, Italy
Second Author
Universita' della Tuscia, Viterbo, Italy
via San Carlo 32, 01100, Viterbo, Italy
Marina Mongiello
Politecnico di Bari, Bari, Italy
Via G. Re David, 200, 70125 Bari, Italy


We propose a description logic approach to the semantic retrieval by content of graphics described in Scalable Vector Graphics (SVG), the novel XML based W3C approved standard language for describing two-dimensional graphics. We provide a syntax to describe basic shapes, complex objects as compositions of basic ones, and transformations. An extensional semantics, which is compositional, is also introduced for defining retrieval, classification, and subsumption services. Using our logical approach as a formal specification, we implemented a prototype system. A set of experiments, carried out on a testbed of SVG documents to assess the retrieval capabilities of the system, is presented.


SVG, image retrieval, multimedia


Syntax. Adapting the formalism originally introduced in [2], the syntax of our language is defined through basic shapes, position of shapes, and geometric transformations. Basic shapes we consider are those proposed by SVG standard, namely circle, rectangle, ellipse, line, polyline, polygon, described through their attributes and transformations applied in the SVG language. Formally, basic shapes are denoted with the letter B, and have a contour e(B) characterizing them. We assume that e(B) is described in a space whose origin coincides with the centroid of B. The possible transformations are those defined in SVG. We globally denote a transformation as t. Transformations can be composed in sequences t1°...° tn, and they form a mathematical group. To make the syntax uniform, we consider also the pose of a basic shape as a transformation.

The basic building block of our syntax is a basic shape component B[c,t], which represents a basic shape B with color c and contour t(e(B)). With t(e(B)) we denote the pointwise transformation t of the whole contour of B. Composite shape descriptions are conjunctions of basic shape components denoted as

C = B1[c1,t1] ¼ Bn[cn,tn]

Semantics. The semantics of our language is defined as an interpretation of the syntactic elements of the language on a domain. The domain D is a set of SVG fragments and the interpretation of an object composed by basic shapes is a subset of the domain D, in which the basic shapes have the proper features. Hence, also a collection of SVG fragments is a domain of interpretation, and a complex shape C is a subset of such a domain - the fragments to be retrieved from the collection when C is viewed as a query. Formally, an interpretation is a pair (Á,D), where D is a set of SVG fragments, and Á is a mapping from shapes and components to subsets of D. We identify each fragment F with the set of shapes {s1,...,sn} it is composed by. Each shape s comes with its own edge contour e(s). A SVG fragment F belongs to the interpretation of a basic shape component B[t]Á if F contains a shape whose contour matches t(e(B)). The definition can be extended to approximate recognition as follows. Recall that the characteristic function fS of a set S is a function whose value is either 1 or 0; fS(x) = 1 if x S, fS(x) = 0 otherwise. We consider now the characteristic function of the set B[t]Á.

Let F be a SVG fragment; if F belongs to B[t]Á, then the characteristic function computed on F has value 1, otherwise it has value 0. To keep the number of symbols low, we use the expression B[t]Á also to denote the characteristic function (with an argument (F) to distinguish it from the set).

B[t]Á(F) = ì
if $s Î F : e(s) = t(e(B))

Now we reformulate this function in order to make it return a real number in the range [0,1] - as usual in fuzzy logic. Let sim(·,·) be a similarity measure from pairs of contours into the range [0,1] of real numbers (where 1 is perfect matching). We use sim(·,·) instead of equality to compare shapes. Moreover, the existential quantification can be replaced by a maximum over all possible regions in F. Then, the characteristic function for the approximate recognition in an image F of a basic component, is:

B[t]Á(F) =max{sim(e(s),t(e(B))) }
                    s Î F 

Note that sim depends on transformations, since we are looking for shapes in F whose contour matches e(B), with reference to  the position and size specified by t. The interpretation of basic shapes, instead, includes a transformation invariant recognition. We define the interpretation of a basic shape in the approximate recognition as the function

BÁ(F) =max max { sim(e(s), t(e(B)))}
                t   s Î F 

The maximization over all possible transformations maxt can be effectively computed by using a similarity measure simss that is invariant with reference to  transformations. In this way, a basic shape B can be used as a query to retrieve all fragments from which are in BÁ. Composite shape descriptions are interpreted as sets of SVG fragments that contain all components of the composite shape. Components can be anywhere in the fragment description, as long as they are in the described arrangement relative to each other. Let C be a composite shape description B1[t1] ⊓.. Bn[tn ]. For approximate matching the interpretation of a composite shape is: 

CÁ(F) =max min {{ Bi[(t° ti)]Á(F)}}  (1)
               t     i=1,..,n


We have carried out a set of experiments on a test dataset of SVG documents using a prototype system. The experimental framework is largely based on the one proposed in [3].  Figure 1: A sample of graphics used in the experiments picturing electronic circuits. 

The test data set consists of a collection of 48 SVG documents picturing chemical structures, electronic circuits and general subject graphics not concerning a specific theme; a sample of them is shown in Figure 1. We selected from the test data set 20 documents to be used as queries. We separately asked two volunteers to classify in decreasing order, according to their judgment, the 48 documents based on their similarity to each SVG graphic of the selected query set. The volunteers had never used the system and they were only briefly instructed that rank orderings had to be based on the degree of conformance of the database documents with the queries. They were allowed to group documents when considered equivalent in content, and for each query, to discard documents that were judged wholly dissimilar from the query.

Having obtained two classifications, which were not univocal, we created the final ranking merging the previous similarity rankings according to a minimum ranking criterion. The final ranking of each document with respect to a query was determined as the minimum one among the two available.

Then we submitted the same set of 20 queries to the system, whose knowledge base was loaded only with the 48 documents of the test set. The resulting classification gave us what was called a system-provided ranking. Again following the framework in [3], we adopted the Rnorm as quality measure of the retrieval effectiveness. Rnorm measure was first introduced in the LIVE-Project [1] for the evaluation of textual information retrieval systems and it has been widely used int experiments on image retrieval based on spatial relationships.

Rnorm values are in the range [0,1]; a value of 1 corresponds to a system-provided ordering of the database images that is either identical to the one provided by the human experts or has a higher degree of resolution, lower values correspond to a proportional disagreement between the two.

Figure [3] shows results for queries shown in Figure [2] the final average Rnorm=0.967. Rnorm results are nearly always equal to 1, this means that the system provided results are almost always similar to user-provided ones.

Figure 2: The set of tested queries. 

Figure 3: Detail of Rnorm values.

In order to provide also other typical information retrieval measures we report here results in terms of precision and recall, determined using the same users' provided rankings. Average measures are the following ones: Precision=0.95 Recall=0.87.

It is noteworthy that we also carried out a simple test using a full-text retrieval tool. Element types were included in the indexed terms, while various other tags were considered stop-words. We indexed the same SVG documents and carried out retrieval with the same queries previously used and a cut-off at 6 documents. Average precision and recall were the following ones: Precision=0.58 Recall=0.54.


  1. P. Bollmann, F. Jochum, U. Reiner, V. Weissmann, and H. Zuse. The LIVE-Project-Retrieval experiments based on evaluation viewpoints. In Proc. of ACM SIGIR '85, pages 213-214, 1985.
  2. E. Di Sciascio, F.M. Donini, and M. Mongiello. Structured knowledge representation for image retrieval. J. of Artificial Intelligence Research, 16:209-257, 2002.
  3. V.N. Gudivada and J.V. Raghavan. Design and evaluation of algorithms for image retrieval by spatial similarity. ACM Trans. on Information Systems, 13(2):115-144, 1995
  4. Bernhard Nebel. Reasoning and Revision in Hybrid Representation Systems. Number 422 in LNAI. Springer-Verlag, 1990.