DVC @ ECCV 2018 in Munich – Deep Vision Consulting

This year Deep Vision Consulting was among the lucky ones that made it to the European Conference on Computer Vision (ECCV) in Munich, see https://eccv2018.org. The conference sold out quickly and unfortunately many people couldn’t attend. Our field is witnessing an unprecedent and unpredictable growth and it’s exiting as a company to be on the edge of research and industry (ECCV 2016 counted 1500 attendees, this year 3400 + a 1000 waiting list).
As always there was an outstanding and overwhelming number of great works and it is never easy to take home the right messages. Below is our take on writing a list of topics trending upwards this year plus our favourite papers from the conference.

AUTOMATIC NETWORK ARCHITECTURE TUNING

A large number of interesting paper addressed the topic of auto ML, i.e. how can I (meta)learn the network design and not only its parameters? How many and what layers should I stack? What’s the most appropriate block design for a 50 layer net? Google showed a genetic approach (the DNA code is the network structure) were the exploration of new models was driven by an external regressor, i.e. only predicted-to-be good models were actually trained and evaluated; based on the result of the trained models, the external regressor could improve its accuracy to select truly best methods. Other methods didn’t select a structure at train time but dynamically routed the information at inference time by switching the paths of active layers within the network. Eventually, many works on pruning an already trained network were presented, claiming impressive speed gain at the cost of a low performance drop, if any.

Coreset-Based Neural Network Compression
MaskConnect: Connectivity Learning by Gradient Descent
Progressive Neural Architecture Search
SkipNet: Learning Dynamic Routing in Convolutional Networks
Quantization Mimic: Towards Very Tiny CNN for Object Detection
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

GEOMETRY AND DEEP LEARNING

Eventually deep learning also got into the geometry business. There were works on learning to SLAM but the majority predicted the 6-DOF (3 pos + 3 rot) of an object with respect to the camera. While the problem has been tipycally approached through descriptor matching, alignment with outlier robustness and so on, this was the year of differentiable rendering. If you have a differentiable renderer, than you can let the network learn camera parameters, position and rotation of the object, render it from such predicted values and then compare your input image with your rendering. If you don’t have an exact model of the object (e.g. let’s say a car), you can still work on the silhouette and drop the texture. Worth mentioning is also the increased number of papers on deep learning models that work directly on point clouds. Previous approaches to 3D parsing were bsed either on voxelation or multiple view images, both somehow inefficient representations. Instead, by working or raw 3D points, you can really get down to the semantic of the represented 3D object and build more fast and effective models.

Fully-Convolutional Point Network for Large Scale Point Cloud
SpidenCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters
Deep Fundamental Matrix Estimation
Deep Virtual Stereo Odometry
CubeNet: Equivariance to 3D Rotation and Translation
Learning SO(3) Equivariant Representations with Spherical CNNs

SEMANTIC SEGMENTATION

Semantic segmentation was everywhere. From streets to dresses to natural images. It just seems natural to overcome the bounding box representation of objects, and people are really pushing into that direction. The list is so long that we don’t even try to cite just a few ones. Differently from a few years ago, where semantic segmentation was a task per se, today we see it more and more coupled together to other tasks, and that totally makes sense. If you have a mask, you can extract better visual representation of you object for classification, matching, pose estimation and so on. And if you have a better representation you eventually make the other task easier to solve.

CLOSING THE REALITY GAP

The reality gap made his way back this year: how do you go from synthetic data (training) to real world (testing)? How can you align the two obviously different distributions? If you have a constrained environment, papers successfully applied GAN (e.g. pix2pixHD) with cyclic embedding mapping like in the context of people re-identification where you have a tight bounding box around the person. But let this work aside, with respect to ICCV 2017 where we saw a lot of fancy ideas, this year the problem was taken more seriously from an engineering perspective, reasoning about first and second order image statistics, fourier spectrum, absolute values and so on. Did we solve it yet? No.

Modeling Camera Effects to Improve Visual Learning from Synthetic Data
DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation
Effective Use of Synthetic Data for Urban Scene Semantic Segmentation
Domain Adaptation through Synthesis for Unsupervised Person Re-Identification

HUMAN PARSING AND GENERATION AT LARGE

Human parsing has probably been the most tacklet topic at this conference. Joint position estimation (aka pose estimation) both on the image plane or in a 3D BB positioned around the person, dense pose estimation (mapping texture from image to a 3D body model), limbs segmentation and action recognition have made significant steps forward. We could almost say we have the technologies to solve these tasks nowadays. Even if trained on separate datasets, this kind of model is increadibly good at generalization and work almost out of the box in many cases. One of the most awaited works of ECCV also falls into this category: “Everybody Dance Now” from A. Efros group at Berkely. How does it work? You input the model with two videos of a person: the first one act as a body pose controller for the person in the second video. You can record a video of yourself and let the network change your dance step into the ones performed by Michael Jackson. No need to say a GAN architecture was the key to produce real looking output.

Inner Space Preserving Generative Pose Machine
BodyNet: Volumetric Inference of 3D Human Body Shapes
Integral Human Pose Regression
SwapNet: Image-based Garment Transfer
Towards Learning a Realistic Rendering of Human Behavior
Recycle-GAN: Unsupervised Video Retargeting
Everybody Dance Now

SELF SUPERVISION FOR FEATURE REPRESENTATION

When you don’t have enough labeled data to train a model from scratch, but you have tons without supervision (e.g. youtube videos), a good idea is learning the representation part of your network in a self supervised way and then train the classification part at the end with much less data. It’s a little bit like a pre-training without labels. Specifically, self supervised means devising a task for which the label can be easily and automatically extracted from the original data with almost no ambiguity. An old but good examples is colorization: you take an image, turn it to gray scale and train a network to predict the original colors. In learning to colorize the image, the network must also learn a good representation of objects as it should never predict a red elephant or a blue face (probably not gonna work with the blue man group). Interesting approaches to self supervision this year came from video audio disentaglment, video colorization and object tracking, future frames prediction.

Learning to Separate Object Sounds by Watching Unlabeled Videos
The Sound of Pixels
Objects that Sound
Joint Discovering Visual Objects and Spoken Words from Raw Sensory Input
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Deep Clustering for Unsupervised Learning of Visual Features
Fighting Fake News: Image Splice Detection via Learned Self-Consistency
Tracking Emerges by Colorizing Videos