Audio Visual Scene-Aware Dialog

Data

Audio-Visual Scene Aware Dialog Dataset V-0.1

Samples of the collected dialogs can be viewed here

Citation

Please cite this paper if you will use the shared data sets

   @inproceedings{alamri2019audiovisual,
             		 title={Audio-Visual Scene-Aware Dialog},
              		author={Huda Alamri and Vincent Cartillier and Abhishek Das and Jue Wang and Anoop Cherian and Irfan Essa and Dhruv Batra and Tim K. Marks and Chiori Hori and Peter Anderson and Stefan Lee and Devi Parikh},
              		booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
              		year={2019}
 			 }

ReadMe

AVSD Dataset consists of text-based human converstaions about short videos from the Charades Dataset
Charades Videos can be downloaded here
Each dialog consists of 10 round of questions/answeres
We followed the data split provided in the Charades dataset.
File annotations for train and validation sets are available to download

AVSD Dataset

AVSD_train download
AVSD_val download
AVSD_train_options download
AVSD_val_options download

AVSD_train/val format

-  {  "Dialogs":
		[  "image_id" : ""YSE1G",
		   "Summary": "the girl walks into a room with a dog with a towel around her neck . she does some stretches and then drops the towel ",
		   "Caption": "a person walked through a doorway into the living room with a towel draped around their neck , and closed the door . the person stretched and threw the towel on the floor."

		   "Dialog": [
		 {    " Question": "is there only one person ?"
     		      " Answer": "there is only one person and a dog .",
		 },
		 {
     	        	" Question 2": ....
     		        " Answer 2": .....
    			 ..
    			 ..
		 }
     			 " Question 10": ....
     			 " Answer 10": .....
		 }
   		 ] }.

AVSD_train/val_options format

		 {
		    ‘Data’ {
			‘questions’ {
					'Is the guy in the red shirt dancing?'
 					'Do you hear any audio at all?'
				     … }
			‘answers’: {
					'No music is heard there'
					'She is looking at the glass of water in her hands'
				    … }

			‘dialogs’ {

					'image_id': 
					‘caption’: 
					‘dialog’:
						 {    'question': the index of the question in questions list
	   					      'answer': the index of the answer in the answers list
	 				     	      'answer_options': 100 candidate answers indices from the answers list
	  				              'gt_index': index of the groundtruth answer in the answer_options
	   					      'id': index of the question and answer in the dialog
							……
	 						……
						 }
					10 rounds of QAs
				}
			‘split’: ‘train’
			‘version’: ‘1.0’
		}
		}

Dataset Stats

	Training	Validation	Test
# of Dialogs	7985	1863	1968
# of Turns	123,480	14,680	14,660
# of Words	1,163,969	138,314	138,790

Challenge

Dialog System Technology Challenges 7 (DSTC7) Track 3

We introduce the Audio Visual Scene-Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input video.

Dialog System Technology Challenges 7 (DSTC7)

Challenge Details can be found: GitHub