Image Collector II: An Over-One-Thousand-Images-Gathering System

Keiji Yanai
Department of Computer Science, The University of Electro-Communications
1-5-1 Chofugaoka, Chofu-shi, Tokyo, JAPAN


We describe a system gathering over one thousand images from the World-Wide Web for one keyword. It is called the Image Collector II. The system has the following properties: (1) integrating a sophisticated keyword-based search method and an image-feature-based search method in a non-interactive manner, (2) on-demand image gathering using commercial text search engines, and (3) second image gathering for gathering over one thousand images by extracting words with high frequency from all HTML files with embedded output images in an initial image gathering. This three properties enable us to gather a large number of images related to one keyword easily.


Web image gathering, image search engine, image database


Due to the recent explosive growth of the World-Wide Web, we can easily access a large number of images from the Web. Therefore, we can consider the Web as a huge image database. However, most of those images on the Web are not categorized in terms of their contents and are not labeled with related keywords. We can use commercial image search engines such as Google Image Search and Ditto to search the Web for image files by giving them keywords. However, most of image search engines search for images based only on keywords in HTML documents that include images, and they do so without analyzing the content of the images.

To achieve an image search for the Web based on not only keywords but also the content of the images, we proposed an automatic image-gathering system from the Web that is constructed by integrating a keyword-based search method and an image-feature-based search method, which is called the Image Collector [5]. In the system, a user first gives query keywords to the system, and then obtains images associated with the keywords. First, using the existing commercial Web search engines for HTML documents, the system gathers images embedded in HTML documents related to the query keywords. Next, the system selects output images from collected images based on extracted image features. However, the system gather only several hundreds images by such one-time processing.

Our initial objective for this system was to gather a large number of images for web image mining research, but the number of gathered images turned out to be insufficient [6].

Thus, in this paper we propose a new method to gather more and more images from the Web. First, we extract words with high frequency from all HTML files with embedded output images in the first image-gathering processing, and using them as keywords we gather images from the Web again. Finally, we can obtain more than one thousand images for one keyword. In addition, in this paper we also propose word-feature-based image selection for improving accuracy of final output images.

Some Web image search systems such as WebSeer [1], WebSEEk [4] and Image Rover [3] have been reported so far, which can be regarded as an integration of a keyword-based search and an image-based-feature-based search. However, these systems carry out two kinds of searches one after the other in an interactive manner, and require gathering images over the Web in advance and making big indices of images on the Web. In contrast to those systems, our system only needs one-time input of query keywords and does not require making a large index in advance due to exploiting existing Web text search engines. Then, we call our system not a ``search'' system but a ``gathering'' system.


The processing of the Image Collector II consists of collection and selection stages. In addition, it extracts additional keywords and repeats two kinds of stages during a second image gathering. Figure 1 shows the processing flow.

Figure 1: Processing flow of the Image Collector II.

2.1 Collection Stage

In the collection stage, the system obtains URLs using some commercial text Web search engines, and by using those URLs, it gathers images from the Web. The algorithm is as follows:
  1. A user provides the system with two kinds of query keywords. One is a main keyword that best represents an image, and the other is an optional subsidiary keyword. For example, when we gather ``lion'' images, we use ``lion'' as a main keyword and ``animal'' as a subsidiary keyword.
  2. The system sends the main and subsidiary keywords as queries to the commercial search engines and obtains the URLs of the HTML documents related to the keywords.
  3. It fetches the HTML documents indicated by the URLs.
  4. It analyzes the HTML documents, and extracts the URLs of images embedded in the HTML documents with image-embedding tags (`` IMG SRC'' and `` A HREF''). For each of those images, the system calculates a score that represents the intensity of the relation between the image and the query keywords [3].
  5. Based on the score, the system classifies an image into three groups, A, B and C in the order from highly-evaluated one. The system fetches only image-files whose image belongs to either group A or B.

2.2 Selection Stage

In the selection stage, the system selects more appropriate images for the query keywords out of ones gathered in the collection stage. The selection is based on the image features described below.

  1. The system first makes image feature vectors for all the collected images. Our system uses a 6x6x6 color histogram in the Lu*v* color space. The dimensions of two kinds of features are 216.
  2. For images in group A, the distance (dissimilarity) between two images is calculated based on the quadratic form distance [2].
  3. Based on the distance between images, images in group A are grouped by the hierarchical cluster analysis method. Our system uses the farthest neighbor method (FN). In the beginning, each cluster has only one image, and the system repeats merging clusters until all distances between them are more than a certain threshold.
  4. Based on the distance between images, images in group A are grouped by the cluster analysis method.
  5. It throws away small clusters that have fewer images than a certain threshold value, regarding them as being irrelevant. It stores all images in the remaining clusters as output images.
  6. It selects images from group B whose distances to the images in the remaining clusters of group A are small and adds them to the output images.

After this image-feature-based selection, our system carries out the second selection for group B images by using word vectors extracted from the HTML documents with embedded images. Introducing the word vectors enables it to eliminate images embedded in the HTML documents whose topics are irrelevant and to ignore them.

  1. The system eliminates HTML tags and extracts words (only nouns, adjectives, and verbs) from HTML documents with embedded images selected by the aforementioned image feature selection. It counts the frequency of appearance of each word in the HTML documents, selects the top 500 words in terms of the frequency, and makes a 500-dimensional word vector whose elements are word frequencies weighted by Term Frequency and Inverse Document Frequency (TFIDF) for each of the images. All word vectors $W_i$ are normalized so that $|W_i|=1$. \verb| |In addition, we also made experiments using the Latent Semantic Indexing (LSI) methods. These methods compress word vectors with singular value decomposition like Principal Component Analysis (PCA). We compressed a 500-dimensional word vector into one of 100 dimensions.
  2. For selected images in group A, it clusters their word vectors based on the $k$-means clustering method. We used the inner product as the dissimilarity (distance) between word vectors.
  3. From selected images in group B, it picks up images whose distance to the masses of clusters in group A is less than a certain threshold in terms of word vectors. They are output images of group B found by the word-feature-based selection.

2.3 Second Image Gathering

In our image-gathering method, the more URLs of HTML documents we obtained, the more images we could gather. However, for one set of query keywords, the number of URLs obtained from Web search engines was limited because commercial search engines restrict the maximum number of URLs returned for one query. Thus, we propose a method to generate automatically new sets of query keywords for search engines.

The system extracts the top ten words (only nouns, adjectives, and verbs) with high frequency except for initial query keywords from all HTML files with embedded output images of the initial image gathering, and regards them as subsidiary query keywords. It generates ten sets of query keywords by adding each of ten subsidiary words to a main keyword, and then obtains a large number of URLs for the ten sets of query keywords. Then, for the second image gathering, using obtained URLs, the system goes through the collection and selection stages again.


Table 1 shows experimental results of the initial and second image gathering for eight keywords. It describes the number of URLs of HTML documents obtained from search engines, the number of images collected from the Web, and the number of selected images by three methods, image-feature-based, combination of image-feature-based and word-feature-based, and combination of image-feature-based and LSI-compressed-word-feature-based. It also describes the precision rate of the collected and selected images in parentheses. The precision represents the ratio of relevant images and was computed by the subjective evaluation.

Table 1: Results of the initial image gathering (in the second row) and the second image gathering (below the third row).

Thanks to introducing the second image gathering, the number of URLs obtained from the Web search engine and the number of selected images are 5.1 times and 2.2 times as many as ones in case of the initial image gathering on average. For ``lion'', we extracted ten words as new subsidiary keywords. They were ``safari'', ``zoo'', ``park'', ``elephant'', ``tiger'', ``Africa'', ``group'', ``giraffe'', ``mane'', and ``head''. Finally, we obtained 1171 images with a 64.4% precision on average for the LSI-based selection.


In this paper, we described a method, an implementation, and the experiments of an automatic image-gathering system for the web. We gathered more than one thousand images for one keyword and achieved the high precision of about 70% without any knowledge about target images by using word vectors and the LSI method.

In the present implementation, we use simple image features for the image selection. In future work, we plan to exploit more sophisticated image features to improve the precision rate. Moreover, we will apply gathered images for our web image mining project.


  1. C. Framkel, M. J. Swain, and V. Athitsos. WebSeer: An image search engine for the World Wide Web. Technical Report TR-96-14, University of Chicago, 1996.
  2. J. Hafner, H. Sawhney, et al. Efficient color histogram indexing for quadratic form distance functions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(7):729--736, 1995.
  3. S. Sclaroff, M. LaCascia, et al. Unifying textual and visual cues for content-based image retrieval on the World Wide Web. Computer Vision and Image Understanding, 75(1/2):86--98, 1999.
  4. J. R. Smith and S. F. Chang. Visually searching the Web for content. IEEE Multimedia, 4(3):12--20, 1997.
  5. K. Yanai. Image collector: An image-gathering system from the World-Wide Web employing keyword-based search engines. In Proc. of IEEE International Conference on Multimedia and Expo, pages 704--707, 2001.
  6. K. Yanai. An experiment on generic image classification using Web images. In Proc. of the Third IEEE Pacific-Rim Conference on Multimedia (Springer LNCS no.2532), pages 303--310, 2002.