Deep Learning of Invariant Features via Simulated Fixations in Video

Authors: Will Y. Zou, Shenghuo Zhu, Andrew Y. Ng, Kai Yu.
AUTHORED BY
Will Y. Zou
Shenghuo Zhu
Andrew Y. Ng
Kai Yu.
wzou@cs.stanford.edu
<a href="mailto:zsh@sv.nec-labs.com">zsh@sv.nec-labs.com
ang@cs.stanford.edu
kyu@sv.nec-labs.com

Abstract

We apply salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision. With tracked sequences as input, a hierarchical network of modules learns invariant features using a temporal slowness constraint. The network encodes invariance which are increasingly complex with hierarchy. Although learned from videos, our features are spatial instead of spatial-temporal, and well suited for extracting features from still images. We applied our features to four datasets (COIL-100, Caltech 101, STL-10, PubFig), and observe a consistent improvement of 4% to 5% in classification accuracy. With this approach, we achieve state-of-the-art recognition accuracy 61% on STL-10 dataset.

Download PDF

Related Projects

Leave a Reply

You must be logged in to post a comment