1805.11592

2018

NeurIPS

[Arxiv 1805.11592] Playing hard exploration games by watching [PdF] [notes]

Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas

read 2019/07/03

Objective

Towards reinforcement learning objectives, learn embeddings that are robust to visual variation using self-supervised learning

Motivation

Domain shift exist in arcade video games and their video recordings at the pixel level (color shift, resizing, game/display version, ...). The learnt embedding should therefore be robust to changes in pixel space, while still learning something about the underlying world model (/rules)

Synthesis

Temporal distance classification

Surrogate task: learn to predict the (temporal) distance between two frames of the same video, which forces the model to learn how the "world" evolves over time

This objective is framed as a classification task between frame distance categories [0], [1], [2], [3-4], [5-20], [21-200]. The embedding + classification networks are trained with a cross-entropy loss.

Cross-modal temporal distance classification : aligning video and sound

Sounds reflect salient events in the game, therefore correlating visual inputs with video should allow the network to learn about important occurences.

The surrogate task in this case is also to predict a temporal distance, but between visual and audio inputs in this case.

Final loss

Weighted combination of the two orignal ones

Model selection

Cycle-consistency is used to select the best model by favoring models that have high temporal consistency. For this they check if a video frame maps back close (less than 1 time step away) to itself if the nearest neighbour query is applied (in the embedding space) from this frame to another video sequence, and then from the selected frame to back to the original video sequence.