The 2016 Placing Task: Multimodal geo-location prediction
Register to participate in this challenge on the MediaEval 2016 registration site.

The placing task requires participants to estimate the locations where multimedia items (photos or videos) were captured solely by inspecting the content and metadata of these items, and optionally exploiting additional knowledge sources such as gazetteers. The purpose of this challenge is to further the area of multimedia retrieval. For example, the developed methods could help a rescue team to infer where exactly a family disappeared in a remote area by discovering the locations shown in videos uploaded to a social network before they lost contact.
We maintain the focus on h​uman geography in this year’s task, where we consider not only geographic coordinates, but also geographic places like neighborhoods and cities. The placing task integrates all aspects of multimedia ­ text, audio, photo, video, location, time, users and context.

The following subtasks are offered:
• Estimation­-based placing task Participants are given a hierarchy of places across the world, ranging from neighborhoods to continents, and are asked to pick a node from the hierarchy in which they most confidently believe the photo or video was taken. While the ground truth locations of the photos and videos will be associated with the most accurate nodes (i.e. leaves) in the hierarchy, the participants can express a reduced confidence in their location estimates by selecting nodes at higher levels in the hierarchy. If their confidence is sufficiently high, participants may naturally directly estimate the geographic coordinate of the photo/video instead of choosing a node from the hierarchy.
• Verification-­based placing task. Participants are given a set of photos and videos and each of them has a corresponding node in the hierarchy where it was supposedly captured. The participants are asked to determine whether or not a photo or video was really taken in the place the node corresponds with and, if so, at which exact geographic coordinate.​

Target group
The task is of interest to researchers in the area of geographic multimedia information retrieval, social media, human mobility, and media analysis.

The dataset for this year’s task is a subset of the Y​FCC100M collection. You will find all data related to the task on our Google Drive, as well as instructions on how to read and interpret the data. We provide several visual, aural, and motion features to the participants so they can focus on solving the task rather than spending time on reinventing the wheel.
We have only included those photos and videos that are taken within any of the G​ADM boundaries, supplemented with neighborhood data for several cities obtained from ClickThatHood;​ photos taken in or above international waters will therefore be excluded; these are generally challenging to accurately predict anyway.

Ground truth and evaluation
The training set for the estimation-­based task contains on the order of 16 million photos and 100 thousand videos, while the test set contains about 75 thousand photos and 1 thousand videos. The verification-­based task will be formed by a subset of the estimation-­based training and test sets. Note that no user appears both in the training set and in the test set, and to minimize user and location bias, each user was limited to contributing at most 250 photos and 50 videos, where no photos/videos have been included that were taken by a user less than 10 minutes apart.

The evaluation of the runs submitted by participating groups will be similar to last year, although this time we will separate out the evaluation of the photos from the videos. We will measure the distances between the predicted and the actual geographic nodes/coordinates using Karney’s algorithm; ​this algorithm is based on the assumption that the shape of the Earth is an oblate spheroid, which therefore produces more accurate distances than methods such as the great­-circle distance that assume the shape of the Earth is a sphere​.

We will provide several baseline methods (source code + performance evaluation) to the participants so they have a starting point. In addition, we will contact previous participants who have open sourced their approaches to ask them to contribute them as additional baselines. We will inform all participants when the baselines are released.

We will have a running leaderboard system, where participants can submit up to two runs a day and can view their relative standing towards others, as evaluated on a representative development set (i.e. part of, but not the complete, test set). Participants are not required to submit their runs to the leaderboard, and may hide their identity if they so desire.

Recommended reading

[1] Hays, J., Efros, A. A. “I​M2GPS: Estimating Geographic Information from a Single Image”​. In Proceedings of the IEEE Computer Vision and Pattern Recognition Conference, 2008.
[2] Cao, L., Yu, J., Luo, J., Huang, T. “E​nhancing Semantic and Geographic Annotation of Web Images Via Logistic Canonical Correlation Regression​”. In Proceedings of the ACM International Conference on Multimedia, 2009, pp. 125­134.
[3] Yin, Z., Cao, L., Han, J., Zhai, C., Huang, T. “G​eographical Topic Discovery and Comparison​”. In Proceedings of the ACM International Conference on World Wide Web, 2011, pp. 247­-256.
[4] Larson, M., Soleymani, M., Serdyukov, P., Rudinac, S., Wartena, C., Murdock, V., Friedland, G., Ordelman, R., Jones, G. J.F. “A​utomatic Tagging and Geotagging in Video Collections and Communities”​. In Proceedings of the ACM International Conference on Multimedia Retrieval, 2011, pp. 51­54.
[5] Luo, J., Joshi, D., Yu, J., Gallagher, A. “G​eotagging in Multimedia and Computer Vision ­ A Survey”​. In Springer Multimedia Tools and Applications, Special Issue: Survey Papers in Multimedia by World Experts, 51(1), 2011, pp. 187–211.
[6] Van Laere, O., Schockaert, S., Dhoedt, B. “G​eoreferencing Flickr resources based on textual meta­data”​. In Journal of Information Sciences, 238, 2013, pp. 52­73.
[7] Penatti, O.A.B., Li, L. T., Almeida, J., Torres, R. da S. “A​Visual Approach for Video Geocoding using Bag­of­Scenes”​, In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM, 2012, article 53.
[8] Choi, J., Lei, H., Ekambaram, V., Kelm, P., Gottlieb, L., Sikora, T., Ramchandran, K., Friedland, G. “H​uman vs. Machine: Establishing a Human Baseline for Multimodal Location Estimation​”. In Proceedings of the ACM International Conference on Multimedia, 2013, pp. 866-­867.
[9] Kelm, P., Schmiedeke, S., Choi, J., Friedland, G., Ekambaram, V., Ramchandran, K., Sikora, T. “A​ Novel Fusion Method for Integrating Multiple Modalities and Knowledge for Multimodal Location Estimation”​. In Proceedings of the ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia, 2013, pp. 7­12.
[10] Trevisiol, M., Jégou, H., Delhumeau, J., Gravier, G. “R​etrieving Geo­location of Videos with a Divide & Conquer Hierarchical Multimodal Approach”​. In Proceedings of the ACM International Conference on Multimedia Retrieval, 2013.

Task organizers
General contact: ​
Bart Thomee, Yahoo Labs, San Francisco, CA, USA
Olivier Van Laere, Blueshift Labs, San Francisco, CA, USA
Claudia Hauff, TU Delft, Netherlands
Jaeyoung Choi, ICSI, Berkeley, CA, USA / TU Delft, Netherlands

Task schedule
9 May 2016: Data (development + test) released.
2 Sept. 2016: Run submission deadline.
16 Sept. 2016: Results returned.
30 Sept. 2016: Working notes paper deadline
20-21 Oct. 2016: MediaEval 2016 Workshop, Right after ACM MM 2016 in Amsterdam.