Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model and a joint embedding model. In our language model, we propose a dependency-tree structure model that embeds sentence into a continuous vector space, which preserves visually grounded meanings and word order. In the visual model, we leverage deep neural networks to capture essential semantic information from videos. In the joint embedding model, we minimize the distance of the outputs of the deep video model and compositional language model in the joint space, and update these two models jointly. Based on these three parts, our system is able to accomplish three tasks: 1) natural language generation, and 2) video retrieval and 3) language retrieval. In the experiments, the results show our approach outperforms SVM, CRF and CCA baselines in predicting Subject-Verb-Object triplet and natural sentence generation, and is better than CCA in video retrieval and language retrieval tasks.

Related collections

Author and article information

Journal

Title: Proceedings of the AAAI Conference on Artificial Intelligence

Abbreviated Title: AAAI

Publisher: Association for the Advancement of Artificial Intelligence (AAAI)

ISSN (Electronic): 2374-3468

ISSN (Print): 2159-5399

Publication date Created: March 01 2015

Publication date (Electronic): February 19 2015

Volume: 29

Issue: 1

Article

DOI: 10.1609/aaai.v29i1.9512

SO-VID: 91aa66b0-4796-4e82-b6f3-5267c8b99cbc

History

Data availability:

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Read this article at

Abstract

Related collections

Reviews of mathematical modeling in cancer

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 46

Cited by 14