Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections

Yunchao Gong1,  Liwei Wang2,  Micah Hodosh2Julia Hockenmaier2and  Svetlana Lazebnik2

1Department of Computer Science, University of North Carolina at Chapel Hill

2Department of Computer Science, University of Illinois at Urbana-Champaign

ECCV 2014

Description: Macintosh HD:Users:ycgong:Desktop:Screen Shot 2014-10-02 at 8.59.34 PM.png


Abstract: This paper studies the problem of associating images with descriptive sentences by embedding them in a common latent space. We are interested in learning such embeddings from hundreds of thousands or millions of examples. Unfortunately, it is prohibitively expensive to fully annotate this many training images with ground-truth sentences. Instead, we ask whether we can learn better image-sentence embeddings by augmenting small fully annotated training sets with millions of images that have weak and noisy annotations (titles, tags, or descriptions). After investigating several state-of-the-art scalable embedding methods, we introduce a new algorithm called Stacked Auxiliary Embedding that can successfully transfer knowledge from millions of weakly annotated images to improve the accuracy of retrieval-based image description.





Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier and Svetlana Lazebnik. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. In ECCV 2014. [PDF LINK]

Code and Data

To download the whole implementation of our method in this paper, please click the link below. It contains 1). the code to reproduce the results; 2). deep activation features for Flickr1M and Flickr30K; 3). training data indexes. The package size is around 20GB.


Also, the link to original Flickr30K dataset [1] webpage is here. [Flickr30K]

[1] Peter Young, Alice Lai, Micah Hodosh and Julia Hockenmaier From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, 2(Feb):67-78, 2014.