About me

I am a Senior Research Scientist at Meta Reality Labs. I completed my PhD at the University of Southern California (USC), Los Angeles under the supervision of Prof. Ram Nevatia. My primary field of research was at the intersection of Computer Vision and Natural Language Processing. Specifically, I focused on grounding language in vision, with an emphasis on videos.

My research broadly lies at the intersection of vision and language, with a focus on grounding language in images and videos. Such visual-linguistic associations encompass objects, actions, and their relations, and are important for richer image and video understanding.

Prior to this, I completed my undergraduate from the Department of Electrical Engineering (EE), Indian Institute of Technology Bombay in 2018. I did my BTech project with Prof. Subhasis Chaudhuri on Graph CNN for disease detection using ECG signals.

Throughout my academic journey, I have gained valuable experience through several internships, including Meta AI, PRIOR@AI2, Wadhwani AI, USC, and Aalto University.

Portrait of Arka Sadhu

News And Updates

  • June 2024

    Joined Meta in the Surreal Team.

  • May 2024

    Attended the hooding ceremony and officially submitted the thesis.

  • April 2024

    Defended the thesis.

  • Jan 2024

    Presented "Leveraging Task-Specific Pre-Training To Reason Across Images and Videos" at WACV 2024.

  • 2023

    Served as a reviewer for ICML, ACL, ICCV, EMNLP, NeurIPS, BMVC, WACV, CVPR, AURO, and TPAMI.

  • Jun 2021

    Presented "Visual Semantic Role Labeling for Video Understanding" at CVPR 2021.

  • Jun 2021

    Presented "Video Question Answering with Phrases via Semantic Roles" at NAACL 2021.

  • May 2021

    Recognized as an Outstanding Reviewer at CVPR 2021.

  • April 2021

    Released Video-QAP and VidSitu materials publicly.

  • June 2020

    Presented "Video Object Grounding using Semantic Roles in Language Description" at CVPR 2020.

  • Oct 2019

    Presented "Zero-Shot Grounding of Objects from Natural Language Queries" at ICCV 2019.

Selected publications

WACV 2024

Vision-Language Pre-training Generalization: From Image-Text Pairs to Diverse Vision-Text Tasks

Arka Sadhu , Ram Nevatia

Studies how task-specific vision-language pre-training can transfer from image-text supervision into broader image and video reasoning tasks.

Presented at WACV 2024.

Preprint 2023

Unaligned Video-Text Pre-training using Iterative Alignment

Arka Sadhu , Licheng Yu , Animesh Sinha , Yu Chen , Ram Nevatia , Ning Zhang

Explores iterative alignment strategies for learning from video and text pairs that are only weakly or noisily aligned.

NAACL 2021

Video Question Answering with Phrases via Semantic Roles

Arka Sadhu , Kan Chen , Ram Nevatia

Connects phrase-level supervision and semantic-role structure to improve video question answering beyond surface text matching.

CVPR 2021

Visual Semantic Role Labeling for Video Understanding

Arka Sadhu , Tanmay Gupta , Mark Yatskar , Ram Nevatia , Aniruddha Kembhavi

Presents a semantic-role-centered formulation for video understanding, pairing structured roles with action-centric video reasoning.

Built into the VidSitu benchmark and project site.