Framing image description as a ranking task:

data, models and evaluation metrics

Framing image description as a ranking task:

data, models and evaluation metrics

Abstract

Models

References

Bach, F. R. and Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.

Hardoon, D. R., Szedmak, S. R., and Shawe-taylor, J. R. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16, 2639–2664.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

Hwang, S. and Grauman, K. (2011). Learning the relative importance of objects from tagged images for retrieval and cross-modal search. International Journal of Computer Vision, pages 1–20. 10.1007/s11263-011-0494-3.

Sample Results

Canonical Correlation Analysis (CCA) (Hotelling, 1936) takes a set of paired data (i.e the representation of an image and its corresponding caption) and learns a linear projection into a new induced space for both types of data to maximize the correlation of corresponding points in the new space. Kernel Canonical Correlation Analysis (Bach and Jordan, 2002; Hardoon et al., 2004) allows the use of a higher dimensional space for CCA without having to explicitly compute the space.

We utilize KCCA to induce a common “semantic space” between images and captions in order to produce captions for images and to retrieve related images for a given caption.

Our final models extend beyond the standard basic bag-of-words representation of the captions by utilizing subsequence kernels and kernels that capture semantic similarity to increase the quality of the induced space.

Data

The paper

Downloading The Flickr 8k Dataset

Model Demos