Arka Sadhu | Senior Research Scientist, Meta Reality Labs

About me

I am a Senior Research Scientist at Meta Reality Labs. I completed my PhD at the University of Southern California (USC), Los Angeles under the supervision of Prof. Ram Nevatia. My primary field of research was at the intersection of Computer Vision and Natural Language Processing. Specifically, I focused on grounding language in vision, with an emphasis on videos.

My research broadly lies at the intersection of vision and language, with a focus on grounding language in images and videos. Such visual-linguistic associations encompass objects, actions, and their relations, and are important for richer image and video understanding.

Prior to this, I completed my undergraduate from the Department of Electrical Engineering (EE), Indian Institute of Technology Bombay in 2018. I did my BTech project with Prof. Subhasis Chaudhuri on Graph CNN for disease detection using ECG signals.

Throughout my academic journey, I have gained valuable experience through several internships, including Meta AI, PRIOR@AI2, Wadhwani AI, USC, and Aalto University.

News And Updates

June 2024

Joined Meta in the Surreal Team.
May 2024

Attended the hooding ceremony and officially submitted the thesis.
April 2024

Defended the thesis.

Slides
Jan 2024

Presented "Leveraging Task-Specific Pre-Training To Reason Across Images and Videos" at WACV 2024.

Paper Venue
2023

Served as a reviewer for ICML, ACL, ICCV, EMNLP, NeurIPS, BMVC, WACV, CVPR, AURO, and TPAMI.
Jun 2021

Presented "Visual Semantic Role Labeling for Video Understanding" at CVPR 2021.

Paper
Jun 2021

Presented "Video Question Answering with Phrases via Semantic Roles" at NAACL 2021.

Paper
May 2021

Recognized as an Outstanding Reviewer at CVPR 2021.

Recognition
April 2021

Released Video-QAP and VidSitu materials publicly.

Video-QAP VidSitu
June 2020

Presented "Video Object Grounding using Semantic Roles in Language Description" at CVPR 2020.

Talk video
Oct 2019

Presented "Zero-Shot Grounding of Objects from Natural Language Queries" at ICCV 2019.

Video Slides

Selected publications

WACV 2024

Vision-Language Pre-training Generalization: From Image-Text Pairs to Diverse Vision-Text Tasks

Arka Sadhu , Ram Nevatia

Studies how task-specific vision-language pre-training can transfer from image-text supervision into broader image and video reasoning tasks.

Presented at WACV 2024.

Paper

Preprint 2023

Unaligned Video-Text Pre-training using Iterative Alignment

Arka Sadhu , Licheng Yu , Animesh Sinha , Yu Chen , Ram Nevatia , Ning Zhang

Explores iterative alignment strategies for learning from video and text pairs that are only weakly or noisily aligned.

Paper

NAACL 2021

Video Question Answering with Phrases via Semantic Roles

Arka Sadhu , Kan Chen , Ram Nevatia

Connects phrase-level supervision and semantic-role structure to improve video question answering beyond surface text matching.

Paper Code

CVPR 2021

Visual Semantic Role Labeling for Video Understanding

Arka Sadhu , Tanmay Gupta , Mark Yatskar , Ram Nevatia , Aniruddha Kembhavi

Presents a semantic-role-centered formulation for video understanding, pairing structured roles with action-centric video reasoning.

Built into the VidSitu benchmark and project site.

Paper Code Website