This year Deep Vision Consulting was among the lucky ones that made it to the European Conference on Computer Vision (ECCV) in Munich, see https://eccv2018.org. The conference sold out quickly and unfortunately many people couldn’t attend. Our field is witnessing an unprecedent and unpredictable growth and it’s exiting as a company to be on the edge of research and industry (ECCV 2016 counted 1500 attendees, this year 3400 + a 1000 waiting list).
As always there was an outstanding and overwhelming number of great works and it is never easy to take home the right messages. Below is our take on writing a list of topics trending upwards this year plus our favourite papers from the conference.
AUTOMATIC NETWORK ARCHITECTURE TUNING
A large number of interesting paper addressed the topic of auto ML, i.e. how can I (meta)learn the network design and not only its parameters? How many and what layers should I stack? What’s the most appropriate block design for a 50 layer net? Google showed a genetic approach (the DNA code is the network structure) were the exploration of new models was driven by an external regressor, i.e. only predicted-to-be good models were actually trained and evaluated; based on the result of the trained models, the external regressor could improve its accuracy to select truly best methods. Other methods didn’t select a structure at train time but dynamically routed the information at inference time by switching the paths of active layers within the network. Eventually, many works on pruning an already trained network were presented, claiming impressive speed gain at the cost of a low performance drop, if any.
- Coreset-Based Neural Network Compression
- MaskConnect: Connectivity Learning by Gradient Descent
- Progressive Neural Architecture Search
- SkipNet: Learning Dynamic Routing in Convolutional Networks
- Quantization Mimic: Towards Very Tiny CNN for Object Detection
- ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design
GEOMETRY AND DEEP LEARNING
Eventually deep learning also got into the geometry business. There were works on learning to SLAM but the majority predicted the 6-DOF (3 pos + 3 rot) of an object with respect to the camera. While the problem has been tipycally approached through descriptor matching, alignment with outlier robustness and so on, this was the year of differentiable rendering. If you have a differentiable renderer, than you can let the network learn camera parameters, position and rotation of the object, render it from such predicted values and then compare your input image with your rendering. If you don’t have an exact model of the object (e.g. let’s say a car), you can still work on the silhouette and drop the texture. Worth mentioning is also the increased number of papers on deep learning models that work directly on point clouds. Previous approaches to 3D parsing were bsed either on voxelation or multiple view images, both somehow inefficient representations. Instead, by working or raw 3D points, you can really get down to the semantic of the represented 3D object and build more fast and effective models.
- Fully-Convolutional Point Network for Large Scale Point Cloud
- SpidenCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters
- Deep Fundamental Matrix Estimation
- Deep Virtual Stereo Odometry
- CubeNet: Equivariance to 3D Rotation and Translation
- Learning SO(3) Equivariant Representations with Spherical CNNs
Semantic segmentation was everywhere. From streets to dresses to natural images. It just seems natural to overcome the bounding box representation of objects, and people are really pushing into that direction. The list is so long that we don’t even try to cite just a few ones. Differently from a few years ago, where semantic segmentation was a task per se, today we see it more and more coupled together to other tasks, and that totally makes sense. If you have a mask, you can extract better visual representation of you object for classification, matching, pose estimation and so on. And if you have a better representation you eventually make the other task easier to solve.
CLOSING THE REALITY GAP
The reality gap made his way back this year: how do you go from synthetic data (training) to real world (testing)? How can you align the two obviously different distributions? If you have a constrained environment, papers successfully applied GAN (e.g. pix2pixHD) with cyclic embedding mapping like in the context of people re-identification where you have a tight bounding box around the person. But let this work aside, with respect to ICCV 2017 where we saw a lot of fancy ideas, this year the problem was taken more seriously from an engineering perspective, reasoning about first and second order image statistics, fourier spectrum, absolute values and so on. Did we solve it yet? No.
- Modeling Camera Effects to Improve Visual Learning from Synthetic Data
- DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation
- Effective Use of Synthetic Data for Urban Scene Semantic Segmentation
- Domain Adaptation through Synthesis for Unsupervised Person Re-Identification
HUMAN PARSING AND GENERATION AT LARGE
Human parsing has probably been the most tacklet topic at this conference. Joint position estimation (aka pose estimation) both on the image plane or in a 3D BB positioned around the person, dense pose estimation (mapping texture from image to a 3D body model), limbs segmentation and action recognition have made significant steps forward. We could almost say we have the technologies to solve these tasks nowadays. Even if trained on separate datasets, this kind of model is increadibly good at generalization and work almost out of the box in many cases. One of the most awaited works of ECCV also falls into this category: “Everybody Dance Now” from A. Efros group at Berkely. How does it work? You input the model with two videos of a person: the first one act as a body pose controller for the person in the second video. You can record a video of yourself and let the network change your dance step into the ones performed by Michael Jackson. No need to say a GAN architecture was the key to produce real looking output.
- Inner Space Preserving Generative Pose Machine
- BodyNet: Volumetric Inference of 3D Human Body Shapes
- Integral Human Pose Regression
- SwapNet: Image-based Garment Transfer
- Towards Learning a Realistic Rendering of Human Behavior
- Recycle-GAN: Unsupervised Video Retargeting
- Everybody Dance Now
SELF SUPERVISION FOR FEATURE REPRESENTATION
When you don’t have enough labeled data to train a model from scratch, but you have tons without supervision (e.g. youtube videos), a good idea is learning the representation part of your network in a self supervised way and then train the classification part at the end with much less data. It’s a little bit like a pre-training without labels. Specifically, self supervised means devising a task for which the label can be easily and automatically extracted from the original data with almost no ambiguity. An old but good examples is colorization: you take an image, turn it to gray scale and train a network to predict the original colors. In learning to colorize the image, the network must also learn a good representation of objects as it should never predict a red elephant or a blue face (probably not gonna work with the blue man group). Interesting approaches to self supervision this year came from video audio disentaglment, video colorization and object tracking, future frames prediction.
- Learning to Separate Object Sounds by Watching Unlabeled Videos
- The Sound of Pixels
- Objects that Sound
- Joint Discovering Visual Objects and Spoken Words from Raw Sensory Input
- Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
- Deep Clustering for Unsupervised Learning of Visual Features
- Fighting Fake News: Image Splice Detection via Learned Self-Consistency
- Tracking Emerges by Colorizing Videos