-
Notifications
You must be signed in to change notification settings - Fork 7
1805.11592
[Arxiv 1805.11592] Playing hard exploration games by watching [PdF] [notes]
Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas
read 2019/07/03
Towards reinforcement learning objectives, learn embeddings that are robust to visual variation using self-supervised learning
Domain shift exist in arcade video games and their video recordings at the pixel level (color shift, resizing, game/display version, ...). The learnt embedding should therefore be robust to changes in pixel space, while still learning something about the underlying world model (/rules)
Surrogate task: learn to predict the (temporal) distance between two frames of the same video, which forces the model to learn how the "world" evolves over time
This objective is framed as a classification task between frame distance categories [0], [1], [2], [3-4], [5-20], [21-200]. The embedding + classification networks are trained with a cross-entropy loss.
Sounds reflect salient events in the game, therefore correlating visual inputs with video should allow the network to learn about important occurences.
The surrogate task in this case is also to predict a temporal distance, but between visual and audio inputs in this case.
Weighted combination of the two orignal ones
Cycle-consistency is used to select the best model by favoring models that have high temporal consistency. For this they check if a video frame maps back close (less than 1 time step away) to itself if the nearest neighbour query is applied (in the embedding space) from this frame to another video sequence, and then from the selected frame to back to the original video sequence.
They use this embedding to provide dense supervision for a reinforcement learning model