Framing image description as a ranking task:

data, models and evaluation metrics


The ability to associate images with natural language sentences that describe what is depicted in them is a hallmark of image understanding, and a prerequisite for applications such as sentence-based image search. In analogy to image search, we propose to frame sentence-based image annotation as the task of ranking a given pool of captions. We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. We introduce a number of systems that perform quite well on this task, even though they are only based on features that can be obtained with minimal supervision. Our results clearly indicate the importance of training on multiple captions per image, and of capturing syntactic (word order-based) and semantic features of these captions. We also perform an in-depth comparison of human and automatic evaluation metrics for this task, and propose strategies for collecting human judgments cheaply and on a very large scale, allowing us to augment our collection with additional relevance judgments of which captions describe which image. Our analysis shows that metrics that consider the ranked list of results for each query image or sentence are significantly more robust than metrics that are based on a single response per query. Moreover, our study suggests that the evaluation of ranking-based image description systems may be fully automated.




“NN-5”... corresponds to a Nearest Neighbor baseline that incorporates all 5 available captions for each image of our dataset at training time and utilizes IDF weighing on words. “BoW1” corresponds to KCCA with a simple Bag of Words representation using 1 training caption for each image. “BoW5” uses all 5 training captions instead. “TagRank” uses the text kernel of the related word of (Hwang and Grauman,2011), Tri5 uses a kernel that incorporates sequences of up to 3 words. The final line corresponds to our final model that incorporates IDF weighing, trigram sequences, and our semantic kernels which results in a significant increase in performance on all retrieval metrics and most annotation metrics

Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.

Hardoon, D. R., Szedmak, S. R., and Shawe-taylor, J. R. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16, 2639–2664.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

Hwang, S. and Grauman, K. (2011). Learning the relative importance of objects from tagged images for retrieval and cross-modal search. International Journal of Computer Vision, pages 1–20. 10.1007/s11263-011-0494-3.

Sample Results

In order to evaluate the quality of the learned induced semantic space, we evaluate annotation and retrieval as IR tasks. A better semantic space should place an image closer to its gold annotated caption. For annotation, we take each test image and independently rank the distance to each test caption (and reverse the rolls of images and captions for retrieval). For the first experiment, we measure how often the gold caption (or image) is among the top 1/5/10 closest captions. However there may be more than 1 caption that could describe the same image, therefore for the second experiment we had people judge the quality of an image and caption pair and measure how often there is an acceptable caption among the top 1/5/10 results.   (see the paper for more experiments, analysis, and details)

Canonical Correlation Analysis (CCA) (Hotelling, 1936) takes a set of paired data (i.e the representation of an image and its corresponding caption) and learns a linear projection into a new induced space for both types of data to maximize the correlation of corresponding points in the new space. Kernel Canonical Correlation Analysis (Bach and Jordan, 2002; Hardoon et al., 2004) allows the use of a higher dimensional space for CCA without having to explicitly compute the space.

We utilize KCCA to induce a common “semantic space” between images and captions in order to produce captions for images and to retrieve related images for a given caption.

Our final models extend beyond the standard basic bag-of-words representation of the captions by utilizing subsequence kernels and kernels that capture semantic similarity to increase the quality of the induced space.


The paper

Downloading The Flickr 8k Dataset

The Flickr 8K dataset includes images obtained from the “flickr” website. Use of such images must respect the flickr terms of use

M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artificial Intelligence Research, Volume 47, pages 853-899

We do not own the copyright of the images. We solely provide the link below for researchers and educators who wish to use the dataset for non-commercial research and/or educational purposes.

Model Demos