neuroscience-ai-reading-course

Visually Grounded Neural Syntax Acquisition

Shi et al. ACL 2019 Visually Grounded Neural Syntax Acquisition

Visually Grounded Neural Syntax Learner (VG-NSL) is an approach for learning syntactic representations and structures without explicit supervision. The modellearns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images.

Intuition

Consider the figure below paired with the descriptive texts (captions) in English. Given no prior knowledge of English, and sufficient such pairs, one can infer

This intuition motivates the use of image-text pairs to facilitate automated language learning, including both syntax and semantics. This paper focuses on learning the syntactic structures, and proposes Visually Grounded Neural Syntax Learner (VG-NSL) that acquires syntax, in the form of constituency parsing, by making use of images and their captions.

Methodology

VG-NSL consists of 2 modules.

  1. Given an input caption, build a constituency parse tree, and recursively compose representation for every constituent.
  2. Match textual representations with visual inputs, using Visual-Semantic embeddings. Both are jointly optimized in an alternating mechanism.

Textual Representation and Structures

Visual-Semantic Embeddings

Training

Optimization of the visual-semantic representations (phi) and constituency structures (theta) was done in an alternating approach. At each iteration, given constituency parsing results of caption, phi is optimized for matching the visual and the textual representations. Next, given the visual grounding of constituents, theta is optimized for producing constituents that can be better matched with images.

Evaluation and Results