Web Image Mining toward Generic Image Recognition

Keiji Yanai
Department of Computer Science, The University of Electro-Communications
1-5-1 Chofugaoka, Chofu-shi, Tokyo, JAPAN


We describe a generic image classification system with an automatic knowledge acquisition mechanism from the Web. The processing in the system consists of three steps. In the gathering stage, the system gathers images related to given class keywords from the Web automatically. In the learning stage, it extracts image features from gathered images and associates them with each class. In the classification stage, the system classifies an unknown image into one of the classes corresponding to the class keywords by using the association between the image features and the classes.


Web image mining, image-gathering, image classification


Due to the recent wide spread of digital imaging devices, we can easily obtain digital images of various kinds of real world scenes, so that demand for image recognition of various kinds of real world images becomes greater. It is, however, hard to apply conventional image recognition methods to such generic recognition, because most of their applicable targets are restricted. Henceforth, semantic processing of images such as automatic attaching keywords to images, classification and search in terms of semantic contents of images are desired.

So far, automatic attaching keywords [1,4,6] and semantic search [2] for an image database have been proposed. In these works, since images with correct keywords were required for learning an association between images and words, commercial image collections were used for learning, for example, Corel Image Library. However, most of images in commercial image collections are well-arranged images taken by professional photographers, and many similar images are included in them. They are different from images of real world scenes taken by the people with digital cameras.

In this paper, we propose utilizing images gathered from the Web for learning of a generic image classification system instead of commercial image collections. In other words, this research is Web image mining for generic image classification. We can easily extract keywords related to an image on the Web (Web image) from the HTML file linking to it, so that we can regard a Web image as an image with related keywords. Web images are as diverse as real world scene, since Web images are taken by a large number of people for various kinds of purpose.

The processing in our system consists of three steps. In the gathering stage, the system automatically gathers images related to given class keywords from the Web. In the learning stage, it extracts image features from gathered images and associates them with each class. In the classification stage, the system classifies an unknown image into one of the classes corresponding to the class keywords by using the association between image features and classes. The system is constructed as an assembly of three modules, which are an image-gathering module, an image-learning module, and an image classification module (Figure 1).

Figure 1: Proposed System.


At first, we need to decide some class keywords, which represent classes into which unknown images are classified. For example, ``cow'', ``dog'' and ``cat''. For each class keyword, we gather related images from the Web. To gather images from the Web, we use the Image Collector [10] we proposed previously as an image-gathering module.

In the gathering stage, an image-gathering module gathers images from the Web related to the class keywords. Note that our image-gathering module is not called image ``search'' but image ``gathering'', since it has the following properties: (1) it does not search for images over the whole Web directly, (2) it does not make an index of the Web images in advance, and (3) it makes use of search results of commercial keyword-based search engines for the class keywords. These properties are different from conventional Web image search systems such as WebSeer [5], WebSEEk [9] and Image Rover [8]. These systems search for images based on the query keywords first, and then a user selects query images from their search results. These three systems carry out their search in such an interactive manner. Our module is different from those in that our system only needs one-time input of query keywords due to automatic image selection mechanism. The details are described in [10].


In the system, image classification is performed by image-feature-based search. First, in the learning stage, an image-learning module extracts image features from gathered images and associates image features with the classes represented by the class keywords. Next, in the classification stage, we classify an unknown image into one of the classes by comparing image features.

In our method of image classification, image features of not only a target object but also non-target objects such as the background are used as a clue of classification, since non-target objects usually have strong relation to a target object. For example, a cow usually exists with a grass field and/or a fence in a farm, and a lion usually exists in a savanna or a zoo. Although the number of combinations of a target object and non-target objects is large, we think that we can deal with this largeness by gathering a large amount of image from the Web and by using them for learning. Here, we do not set up "reject", and then all test images are classified into any class.

In the experiments, we used two kinds of image features for learning and classification: color signatures and region signatures. A signature describes multi-dimensional discrete distribution, which is represented by a set of vectors and weights. In case of color signatures, a vector and a weight correspond to a mean color vector of each cluster and its ratio of pixels belonging to that cluster, respectively, where some color clusters are made in advance by clustering color distribution of a whole image. In case of region signatures, a set of feature vectors of regions and their ratio of pixels represents a region signature, where regions are made by an image segmentation method in advance. To compute dissimilarity between two signatures, Earth Mover's Distance (EMD) has been proposed [7].

In the classification stage, we employ the k-nearest neighbor (k-NN) method to classify an unknown input image into a certain class. The value of k is decided as 5 by preliminary experiments.


We made six kinds of classification experiments from no.1 to no.6 for 10 kinds of gathered images, 10 kinds of gathered images with only correct ones (selected by hand), 10 kinds of images selected from the commercial image database (Corel Image Gallery), 20 kinds of gathered images, 20 kinds of images with only correct ones, and 50 kinds of images, respectively. In the experiments, we exploited three kinds of image features, which are color signatures, region signatures using the k-means clustering method, and region signatures using the JSEG segmentation algorithm [3].

In the experiment no.1, we gathered images from the Web for 10 kinds of class keywords related to animals shown in Table 1. The total number of gathered image was 4582, and the precision(pri.) by subjective evaluation was 68.2%, which is defined to be NOK/(NOK+NNG), where NOK, NNG are the number of relevant images and the number of irrelevant images to their keywords.

Table 1: Results of image-gathering (left) and classification (right) in experiment no.1, no.2, and no.3.

Table 1 shows the classification results evaluated by 5-fold cross-validation in the experiment no.1, 2, and 3. It describes only the results by color signatures for the individual classes, since most of the results by color signatures are superior to the results by region signatures using k-means and JSEG. In the tables, ``(r1)'' and ``(r2)'' mean region signatures using the k-means clustering and region signatures using the JSEG region segmentation method, respectively. The recall(rec.) is defined to be MOK/Mtest, the precision(pri.) is defined to be MOK/(MOK+MNG) and F-measure(F) is the harmonic mean of the recall and the precision, where MOK, MNG, and Mtest are the number of correctly classified images, the number of incorrectly classified images, and the number of test images for each class, respectively. All values are represented in percentage terms. In the experiment no.1, we obtained 34.3 as the F-measure value by color signatures.

In the experiment no.2, we select only correct images for each class from gathered images by hand, and the classification experiment was carried out using them by 5-fold cross-validation. Compared to no.1, the F-measure increased. Especially, the result of ``whale'' was good, since most of ``whale'' images on the Web were ones of ``whale watching'' scene.

In the experiment no.3, we made a classification experiment not for Web images but for the 500 images of 10 classes in Corel Image Gallery as a control experiment. Since Corel Image Gallery includes many similar images to each other, a high F-measure value, 68.1, was obtained by region signatures with the k-means method.

In the experiment no.4 and no.5, we made experiments for 20 class keywords (Table 2) which include many different kinds of words. Figure 2 shows part of the images gathered from the Web. We obtained 42.3 and 46.7 as the F-measures shown in Table 3. These results are superior to the result of the experiment no.1 and no.2 for only 10 classes, because all classes used in no.1 and no.2 are related to animals and their images include many similar images even between different classes.

Table 2: 20 class keywords.
apple, bear, bike, lake, car, cat, entrance ceremony, house, Ichiro, Ferris wheel, lion, Moai, Kinkaku Temple, note PC, bullet train, park, penguin, noodle, wedding, Mt.Yari

Table 3: Results of experiment no.4, no.5 and no.6.

Figure 2: Part of the images gathered from the Web in the experiment no.6 and no.7.

In the experiment no.6, we made a classification experiment for 50 class keywords. We obtained 40.3 as the F-measure by color signatures. This result is comparable to the results of the experiment of 20 classes. This indicates that the difficulty of classification depends on the dispersion of image features of each class in the image feature space, not simply on the number of classes.


The results in all the experiments are comparable to conventional works of generic image recognition. However, unlike them, we provide images for learning not by hand, but by gathering images related to the class keywords from the Web automatically. Although it was a troublesome task to collect such various kinds of images as ones used in the experiment no.6 so far, Web image gathering changes it into an easy task. By only providing class keywords at first without human's hands during the processing, we obtained about 40 F-measure. This shows the effectiveness of our novel method.


  1. K. Barnard and D. A. Forsyth. Learning the semantics of words and pictures. In Proc. of IEEE International Conference on Computer Vision, volume II, pages 408--415, 2001.
  2. S. Belongie, C. Carson, H. Greenspan, and J. Malik. Recognition of images in large databases using a learning framework. Technical Report 07-939, UC Berkeley CS Tech Report, 1997.
  3. Y. Deng and B. S. Manjunath. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(8):800--810, 2001.
  4. P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth. Object recognition as machine translation: Learning a lexicons for a fixed image vocabulary. In Proc. of European Conference on Computer Vision, 2002.
  5. C. Framkel, M. J. Swain, and V. Athitsos. WebSeer: An image search engine for the World Wide Web. Technical Report TR-96-14, University of Chicago, 1996.
  6. Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In Proc. of First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.
  7. Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99--121, 2000.
  8. S. Sclaroff, M. LaCascia, et al. Unifying textual and visual cues for content-based image retrieval on the World Wide Web. Computer Vision and Image Understanding, 75(1/2):86--98, 1999. J. R. Smith and S. F. Chang. Visually searching the Web for content. IEEE Multimedia, 4(3):12--20, 1997. K. Yanai, M. Shindo, and K. Noshita. A fast image-gathering system on WWW using a PC cluster. In Proc. of International Conference on Web Intelligence (LNAI 2198), pages 324--334, 2001.