We introduce the task of audio-visual scene-aware dialog (AVSD). In AVSD, an agent task is to answer, in natural language, questions about a short video. In Scene-Aware Dialog (AVSD) Challenge at DSTC7 the agent grounds its responses on the dynamic scene, the audio, and the history (previous rounds) of the dialog.


  • Audio Visual Scene-Aware Dialog

  • End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

  • Audio visual scene-aware dialog (avsd) challenge at dstc7


Audio-Visual Scene Aware Dialog Dataset V-0.1

Samples of the collected dialogs can be viewed here


Please cite this paper if you will use the shared data sets

             		 title={Audio-Visual Scene-Aware Dialog},
              		author={Huda Alamri and Vincent Cartillier and Abhishek Das and Jue Wang and Anoop Cherian and Irfan Essa and Dhruv Batra and Tim K. Marks and Chiori Hori and Peter Anderson and Stefan Lee and Devi Parikh},
              		booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},


  • AVSD Dataset consists of text-based human converstaions about short videos from the Charades Dataset
  • Charades Videos can be downloaded here
  • Each dialog consists of 10 round of questions/answeres
  • We followed the data split provided in the Charades dataset.
  • File annotations for train and validation sets are available to download

  • AVSD Dataset

  • AVSD_train download
  • AVSD_val download
  • AVSD_train_options download
  • AVSD_val_options download

AVSD_train/val format
-  {  "Dialogs":
		[  "image_id" : ""YSE1G",
		   "Summary": "the girl walks into a room with a dog with a towel around her neck . she does some stretches and then drops the towel ",
		   "Caption": "a person walked through a doorway into the living room with a towel draped around their neck , and closed the door . the person stretched and threw the towel on the floor."

		   "Dialog": [
		 {    " Question": "is there only one person ?"
     		      " Answer": "there is only one person and a dog .",
     	        	" Question 2": ....
     		        " Answer 2": .....
     			 " Question 10": ....
     			 " Answer 10": .....
   		 ] }.
AVSD_train/val_options format
		    ‘Data’ {
			‘questions’ {
					'Is the guy in the red shirt dancing?'
 					'Do you hear any audio at all?'
				     … }
			‘answers’: {
					'No music is heard there'
					'She is looking at the glass of water in her hands'
				    … }

			‘dialogs’ {

						 {    'question': the index of the question in questions list
	   					      'answer': the index of the answer in the answers list
	 				     	      'answer_options': 100 candidate answers indices from the answers list
	  				              'gt_index': index of the groundtruth answer in the answer_options
	   					      'id': index of the question and answer in the dialog
					10 rounds of QAs
			‘split’: ‘train’
			‘version’: ‘1.0’
Dataset Stats
Training Validation Test
# of Dialogs 7985 1863 1968
# of Turns 123,480 14,680 14,660
# of Words 1,163,969 138,314 138,790


Dialog System Technology Challenges 7 (DSTC7) Track 3

We introduce the Audio Visual Scene-Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input video.

Dialog System Technology Challenges 7 (DSTC7)

Challenge Details can be found: GitHub

Contact Information

halamri3@gatech.edu & chori@merl.com


Huda Alamri

Georgia Tech

Vincent Cartillier

Georgia Tech

Abhishek Das

Georgia Tech

Jue Wang


Stefan Lee

Georgia Tech

Peter Anderson

Georgia Tech

Irfan Essa

Georgia Tech

Dhruv Batra

Georgia Tech