We introduce the task of audio-visual scene-aware dialog (AVSD). In AVSD, an agent task is to answer, in natural language, questions about a short video. In Scene-Aware Dialog (AVSD) Challenge at DSTC7 the agent grounds its responses on the dynamic scene, the audio, and the history (previous rounds) of the dialog.
Samples of the collected dialogs can be viewed here
Please cite this paper if you will use the shared data sets
@inproceedings{alamri2019audiovisual, title={Audio-Visual Scene-Aware Dialog}, author={Huda Alamri and Vincent Cartillier and Abhishek Das and Jue Wang and Anoop Cherian and Irfan Essa and Dhruv Batra and Tim K. Marks and Chiori Hori and Peter Anderson and Stefan Lee and Devi Parikh}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2019} }
- { "Dialogs": [ "image_id" : ""YSE1G", "Summary": "the girl walks into a room with a dog with a towel around her neck . she does some stretches and then drops the towel ", "Caption": "a person walked through a doorway into the living room with a towel draped around their neck , and closed the door . the person stretched and threw the towel on the floor." "Dialog": [ { " Question": "is there only one person ?" " Answer": "there is only one person and a dog .", }, { " Question 2": .... " Answer 2": ..... .. .. } " Question 10": .... " Answer 10": ..... } ] }.
{ ‘Data’ { ‘questions’ { 'Is the guy in the red shirt dancing?' 'Do you hear any audio at all?' … } ‘answers’: { 'No music is heard there' 'She is looking at the glass of water in her hands' … } ‘dialogs’ { 'image_id':‘caption’: ‘dialog’: { 'question': the index of the question in questions list 'answer': the index of the answer in the answers list 'answer_options': 100 candidate answers indices from the answers list 'gt_index': index of the groundtruth answer in the answer_options 'id': index of the question and answer in the dialog …… …… } 10 rounds of QAs } ‘split’: ‘train’ ‘version’: ‘1.0’ } }
Training | Validation | Test | |
---|---|---|---|
# of Dialogs | 7985 | 1863 | 1968 |
# of Turns | 123,480 | 14,680 | 14,660 |
# of Words | 1,163,969 | 138,314 | 138,790 |
We introduce the Audio Visual Scene-Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input video.
Dialog System Technology Challenges 7 (DSTC7)
Challenge Details can be found: GitHub
Georgia Tech
Georgia Tech
Georgia Tech
Georgia Tech