Publications

My research broadly lies at the intersection of vision and language. Specifically, I am interested in grounding language in images and videos which entails associating language phrases to visual concepts. Such visual-linguistic associations encompass objects, actions, their relations and are crucial to rich image and video understanding.

2024

WACV 2024

Vision-Language Pre-training Generalization: From Image-Text Pairs to Diverse Vision-Text Tasks

Arka Sadhu , Ram Nevatia

Studies how task-specific vision-language pre-training can transfer from image-text supervision into broader image and video reasoning tasks.

Presented at WACV 2024.

2023

Preprint 2023

Unaligned Video-Text Pre-training using Iterative Alignment

Arka Sadhu , Licheng Yu , Animesh Sinha , Yu Chen , Ram Nevatia , Ning Zhang

Explores iterative alignment strategies for learning from video and text pairs that are only weakly or noisily aligned.

2021

NeurIPS 2021

Gradient-based Memory Editing for Task-Free Continual Learning

Xisen Jin , Arka Sadhu , Junyi Du , Xiang Ren

Introduces memory editing mechanisms for task-free continual learning to better control forgetting without relying on explicit task boundaries.

ICIP 2021

Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction

Zhaoheng Zheng , Arka Sadhu , Ram Nevatia

Investigates how disentangling feature representations can jointly improve object detection and attribute recognition.

WACV 2021

Utilizing Every Image Object for Semi-supervised Phrase Grounding

Haidong Zhu , Arka Sadhu , Zhaoheng Zheng , Ram Nevatia

Shows how object-level structure can improve semi-supervised phrase grounding by making fuller use of image content.

NAACL 2021

Video Question Answering with Phrases via Semantic Roles

Arka Sadhu , Kan Chen , Ram Nevatia

Connects phrase-level supervision and semantic-role structure to improve video question answering beyond surface text matching.

CVPR 2021

Visual Semantic Role Labeling for Video Understanding

Arka Sadhu , Tanmay Gupta , Mark Yatskar , Ram Nevatia , Aniruddha Kembhavi

Presents a semantic-role-centered formulation for video understanding, pairing structured roles with action-centric video reasoning.

Built into the VidSitu benchmark and project site.

2020

CVPR 2020

Video Object Grounding using Semantic Roles in Language Description

Arka Sadhu , Kan Chen , Ram Nevatia

Grounds video objects through the semantic structure of language, tying textual roles to video evidence over time.

EMNLP 2020

Visually Grounded Continual Learning of Compositional Phrases

Xisen Jin , Junyi Du , Arka Sadhu , Ram Nevatia , Xiang Ren

Combines visually grounded language learning with continual learning, focusing on compositional phrase understanding over time.

2019

ICCV 2019

Zero-Shot Grounding of Objects from Natural Language Queries

Arka Sadhu , Kan Chen , Ram Nevatia

Addresses phrase grounding in the zero-shot setting, focusing on transferring to unseen queries and concepts.

Accepted as an oral presentation at ICCV 2019.