Publications

My research broadly lies at the intersection of vision and language. Specifically, I am interested in grounding language in images and videos which entails associating language phrases to visual concepts. Such visual-linguistic associations encompass objects, actions, their relations and are crucial to rich image and video understanding.

Google Scholar Semantic Scholar

2026

ACL 2026

Agentic Very Long Video Understanding

Aniket Rege , Arka Sadhu , Yuliang Li , Kejie Li , Ramya Korlakai Vinayak , Yuning Chai , Yong Jae Lee , Hyo Jin Kim

Presents an agentic framework for very long video understanding that combines visual search, transcript search, and entity-graph reasoning.

Accepted to ACL 2026.

Paper Code Website

ICLR 2026

DiVE-k: Differential Visual Reasoning for Fine-Grained Image Recognition

Raja Kumar , Arka Sadhu , Ram Nevatia

Introduces a differential visual reasoning framework that uses preference-driven rollouts to improve fine-grained image recognition.

Published as a conference paper at ICLR 2026.

Paper Code

2024

WACV 2024

Vision-Language Pre-training Generalization: From Image-Text Pairs to Diverse Vision-Text Tasks

Arka Sadhu , Ram Nevatia

Studies how task-specific vision-language pre-training can transfer from image-text supervision into broader image and video reasoning tasks.

Presented at WACV 2024.

Paper

2023

Preprint 2023

Unaligned Video-Text Pre-training using Iterative Alignment

Arka Sadhu , Licheng Yu , Animesh Sinha , Yu Chen , Ram Nevatia , Ning Zhang

Explores iterative alignment strategies for learning from video and text pairs that are only weakly or noisily aligned.

Paper

2021

NeurIPS 2021

Gradient-based Memory Editing for Task-Free Continual Learning

Xisen Jin , Arka Sadhu , Junyi Du , Xiang Ren

Introduces memory editing mechanisms for task-free continual learning to better control forgetting without relying on explicit task boundaries.

Paper Code

ICIP 2021

Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction

Zhaoheng Zheng , Arka Sadhu , Ram Nevatia

Investigates how disentangling feature representations can jointly improve object detection and attribute recognition.

Paper

WACV 2021

Utilizing Every Image Object for Semi-supervised Phrase Grounding

Haidong Zhu , Arka Sadhu , Zhaoheng Zheng , Ram Nevatia

Shows how object-level structure can improve semi-supervised phrase grounding by making fuller use of image content.

Paper

NAACL 2021

Video Question Answering with Phrases via Semantic Roles

Arka Sadhu , Kan Chen , Ram Nevatia

Connects phrase-level supervision and semantic-role structure to improve video question answering beyond surface text matching.

Paper Code

CVPR 2021

Visual Semantic Role Labeling for Video Understanding

Arka Sadhu , Tanmay Gupta , Mark Yatskar , Ram Nevatia , Aniruddha Kembhavi

Presents a semantic-role-centered formulation for video understanding, pairing structured roles with action-centric video reasoning.

Built into the VidSitu benchmark and project site.

Paper Code Website

2020

CVPR 2020

Video Object Grounding using Semantic Roles in Language Description

Arka Sadhu , Kan Chen , Ram Nevatia

Grounds video objects through the semantic structure of language, tying textual roles to video evidence over time.

Paper Code

EMNLP 2020

Visually Grounded Continual Learning of Compositional Phrases

Xisen Jin , Junyi Du , Arka Sadhu , Ram Nevatia , Xiang Ren

Combines visually grounded language learning with continual learning, focusing on compositional phrase understanding over time.

Paper Code Website

2019

ICCV 2019

Zero-Shot Grounding of Objects from Natural Language Queries

Arka Sadhu , Kan Chen , Ram Nevatia

Addresses phrase grounding in the zero-shot setting, focusing on transferring to unseen queries and concepts.

Accepted as an oral presentation at ICCV 2019.

Paper Code