Publications

My research broadly lies at the intersection of vision and language. Specifically, I am interested in grounding language in images and videos which entails associating language phrases to visual concepts. Such visual-linguistic associations encompass objects, actions, their relations and are crucial to rich image and video understanding.

2026

Representative figure from Agentic Very Long Video Understanding
ACL 2026

Agentic Very Long Video Understanding

Aniket Rege , Arka Sadhu , Yuliang Li , Kejie Li , Ramya Korlakai Vinayak , Yuning Chai , Yong Jae Lee , Hyo Jin Kim

Presents an agentic framework for very long video understanding that combines visual search, transcript search, and entity-graph reasoning.

Accepted to ACL 2026.

Representative figure from DiVE-k: Differential Visual Reasoning for Fine-Grained Image Recognition
ICLR 2026

DiVE-k: Differential Visual Reasoning for Fine-Grained Image Recognition

Raja Kumar , Arka Sadhu , Ram Nevatia

Introduces a differential visual reasoning framework that uses preference-driven rollouts to improve fine-grained image recognition.

Published as a conference paper at ICLR 2026.

2024

Representative figure from Vision-Language Pre-training Generalization: From Image-Text Pairs to Diverse Vision-Text Tasks
WACV 2024

Vision-Language Pre-training Generalization: From Image-Text Pairs to Diverse Vision-Text Tasks

Arka Sadhu , Ram Nevatia

Studies how task-specific vision-language pre-training can transfer from image-text supervision into broader image and video reasoning tasks.

Presented at WACV 2024.

2023

Representative figure from Unaligned Video-Text Pre-training using Iterative Alignment
Preprint 2023

Unaligned Video-Text Pre-training using Iterative Alignment

Arka Sadhu , Licheng Yu , Animesh Sinha , Yu Chen , Ram Nevatia , Ning Zhang

Explores iterative alignment strategies for learning from video and text pairs that are only weakly or noisily aligned.

2021

Representative figure from Gradient-based Memory Editing for Task-Free Continual Learning
NeurIPS 2021

Gradient-based Memory Editing for Task-Free Continual Learning

Xisen Jin , Arka Sadhu , Junyi Du , Xiang Ren

Introduces memory editing mechanisms for task-free continual learning to better control forgetting without relying on explicit task boundaries.

Representative figure from Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction
ICIP 2021

Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction

Zhaoheng Zheng , Arka Sadhu , Ram Nevatia

Investigates how disentangling feature representations can jointly improve object detection and attribute recognition.

Representative figure from Utilizing Every Image Object for Semi-supervised Phrase Grounding
WACV 2021

Utilizing Every Image Object for Semi-supervised Phrase Grounding

Haidong Zhu , Arka Sadhu , Zhaoheng Zheng , Ram Nevatia

Shows how object-level structure can improve semi-supervised phrase grounding by making fuller use of image content.

Representative figure from Video Question Answering with Phrases via Semantic Roles
NAACL 2021

Video Question Answering with Phrases via Semantic Roles

Arka Sadhu , Kan Chen , Ram Nevatia

Connects phrase-level supervision and semantic-role structure to improve video question answering beyond surface text matching.

Representative figure from Visual Semantic Role Labeling for Video Understanding
CVPR 2021

Visual Semantic Role Labeling for Video Understanding

Arka Sadhu , Tanmay Gupta , Mark Yatskar , Ram Nevatia , Aniruddha Kembhavi

Presents a semantic-role-centered formulation for video understanding, pairing structured roles with action-centric video reasoning.

Built into the VidSitu benchmark and project site.

2020

Representative figure from Video Object Grounding using Semantic Roles in Language Description
CVPR 2020

Video Object Grounding using Semantic Roles in Language Description

Arka Sadhu , Kan Chen , Ram Nevatia

Grounds video objects through the semantic structure of language, tying textual roles to video evidence over time.

Representative figure from Visually Grounded Continual Learning of Compositional Phrases
EMNLP 2020

Visually Grounded Continual Learning of Compositional Phrases

Xisen Jin , Junyi Du , Arka Sadhu , Ram Nevatia , Xiang Ren

Combines visually grounded language learning with continual learning, focusing on compositional phrase understanding over time.

2019

Representative figure from Zero-Shot Grounding of Objects from Natural Language Queries
ICCV 2019

Zero-Shot Grounding of Objects from Natural Language Queries

Arka Sadhu , Kan Chen , Ram Nevatia

Addresses phrase grounding in the zero-shot setting, focusing on transferring to unseen queries and concepts.

Accepted as an oral presentation at ICCV 2019.