Invited Speaker

Dr. Karan Sikka
Center for Vision Technologies, SRI International, Princeton, New Jersey

Karan Sikka is a Computer Vision Scientist at Center for Vision Technologies, SRI International in Princeton, New Jersey. He graduated with a PhD degree from Machine Perception Lab at UCSD and was advised by Dr. Marian Bartlett. Before joining UCSD, he completed his bachelor’s in ECE at Indian Institute of Technology Guwahati. His research interest in general spans joint multimodal analytics and computer vision problems related to classification and detection in both images and videos. During his PhD he primarily worked on problem related to action classification in videos for both recognizing human facial behavior and human actions. At SRI he have developed innovative prototypes and algorithms pertaining to deep multimodal (vision, language and audio) learning for understanding social media structure and content under the DARPA M3I, AFRL MESA and ONR CEROSS programs.

Abstract
Food classification is a fine-grained classification problem and obtaining manually curated training data for a large number of classes is prohibitive. In this talk I will discuss our prior work on using noisy images from the web for training such models. We tackle a key problem with food images from the web where they often have multiple co-occurring food types but are weakly labeled with a single label. We first demonstrate that by sequentially adding a few manually curated samples to a larger uncurated dataset from two web sources, the top-1 classification accuracy increases from 50.3% to 72.8%. To tackle the issue of weak labels, we augment the deep model with Weakly Supervised learning (WSL) that further increases the performance to 76.2%. I will then discuss our efforts and outcome of the first large-scale food classification challenge in images (iFood challenge) being held as part of the fifth Fine Grained Visual Classification Workshop at CVPR18. We introduce a new dataset of 211 fine-grained (prepared) food categories with 101733 training images collected from the web, and human verified labels for both validation set (10323 images) and the test set (24088 images). The challenge is currently ongoing and has a participation of 20 teams with the best team obtaining 91.5% top-3 accuracy on the public test set.