CVPR 2018 as reported by Assaf Shocher

By: Assaf Shocher, Weizmann Institute

Hi everyone,
I was asked to highlight some trends and papers from CVPR 2018 that just ended. Here are some insights I can share, all from my point of view and according to my interests and opinions.
Hope you will find it useful.
(This is sent to all vision lab, Advanced Topics In Computer Vision And Deep Learning mailing list and some others).

1. Biggest trend: Ways to use less labeled data:
The biggest trend in this year’s conference as I see it, is the search for ways to practice deep learning without having to train on huge labeled datasets.
Most common approach was using some learning-regime that is located somewhere along the continuum between “Supervised” and “Unsupervised” learning.
55 papers with “Something-supervised” (or similar) in their title, out of which: 30 weakly-supervised, 5 semi-supervised, 9 self-supervised, 2 webly-supervised (interesting one of them), 1 omni-supervised (an interesting approach by Kaiming He’s group at FAIR) and some others. (some overlap). 30 papers with “Unsupervised” in their title can also be found.
A different approach was to generate training data or transform training data from one domain to another. A common example is using rendering of graphics and then translate it to real-world scenes. 23 papers had “Domain adaptation” (or similar) in their title. Naturally, there is also the approach of transfer learning. 10 papers have “Transfer-Learning” or “Knowledge-Transfer” in their title, among which is this year’s best paper award winner: “Taskonomy: Disentangling Task Transfer Learning” . Lastly for this category, there is also one paper with the approach of “Deep Internal Learning” (shameless plug). A similar approach can also be found in the Deep Image Prior paper.
All together, 119 papers trying to escape from traditional supervised learning on huge labeled datasets which make 12% of all accepted papers! On the other hand, one should also be aware of this paper by Ian Goodfellow’s group at Google-brain (not a CVPR paper) that argues that semi-supervised learning fails to address many issues that these algorithms would face in real-world applications.

2. This year’s improved architectures:
These are some elegant papers to my taste, that address the classical challenges with clever improvements to the architecture of the network.

First example, Decorrelated Batch Normalization, is improving batch normalization by whitening its output so that there is no correlation between the elements. It also shows by interesting analysis why PCA fails on such task.
Another great example, Learning Steerable Filters for Rotation Equivariant CNNs, is trying to create rotation equivariance by having different rotated versions of the same filters. Most common approaches use data-driven rotation-invariance, which basically means that there is a lot of data covering all possible rotations, this will fail for missing instances and moreover denies the network from valuable understanding of pose. Just like conv-layers are translation equivariant (keep track of the location of objects), this work achieves equivariance under rotations.
A nice mini-trend: Non-locality in convnets. for exploiting self-correlations and use information beyond the receptive field. Some really great papers are a part of this:
Nonlocal Neural Networks (Wang, Girshick, Gupta, He) (FAIR+CMU)
Squeeze-and-Excitation Networks is the winner of ILSVRC’2017 challenge (Imagenet classification)
Image Transformer Although not a CVPR paper, is worth noticing too. uses a mechanism that is called self-attention, originally created for text.
For those interested in efficiency, CondenseNet: An Efficient DenseNet using Learned Group Convolutions is a clever way of implementing DenseNet which can be applied at test time using efficient group-convolutions and can be used on mobile devices.
A paper with an idea that I really liked is Interpretable Convolutional Neural Networks that forces filters to create feature maps that are well localized, so that eventually it is possible to interpret each filter by the area in the image that activates it.
Decoupled Networks is not yet a practical tool to use, but a very intriguing approach that splits the simple dot product to its algebraic-geometric interpretation (multiplications of norms with cosine of the angle between the vectors) and argues that intuitively each element has a different meaning- angle accounts for semantic/label difference and feature norm accounts for intra-class variation.

3. Another important trend: Image quality assessment
There is much criticism on traditional evaluation for image restoration or image generation tasks. L2, SSIM and others seem to not capture the human perception of quality. moreover, in the recent years there is a growing amount of methods that generate novel images so that there is no ground-truth to compare to. Assessing the quality of generated images is now a challenge that many try to solve. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric Is a paper by a great set of authors that uses distances between learned semantic features to assess quality. They also used human-study to evaluate how close their methods are to human perception and got impressive results.
The Perception-Distortion Tradeoff By Yochai Blau and Tomer Michaeli from the Technion, That got an oral presentation shows mathematically that distortion and perceptual quality are at odds with each other, which I think is an important fundamental fact about image-restoration. Some more nice examples I bumped into: Hallucinated-IQA: No-Reference Image Quality Assessment via Adversarial Learning, Blind Predicting Similar Quality Map for Image Quality Assessment, PieAPP: Perceptual Image-Error Assessment through Pairwise Preference

4. Emerging: Interactive deep networks
Still not very common, but to my opinion promising direction is letting humans take a part of the process by smart interactivity. To my taste this is highly practical and I think this will become a powerful trend in AI. Learning by Asking Questions by some known authors from FAIR and CMU addresses the VQA challenge by generating its own questions during training and by that does something that is more similar to the human way of learning. Interactive segmentation challenge involves cleverly inferring segmentation from human clicks, here is an impressive approach: Interactive Image Segmentation with Latent Diversity Image description can be costumed to a user by asking him questions to reflect his interests in the image: Customized Image Narrative Generation via Interactive Visual Question Generation and Answering.

5. 10 more great papers (in random order):

Embodied Question Answering is an amazing work by Dhruv Batra’s group (FAIR + Georgia IT), an agent is spawned at a random location in a 3D environment and asked a question (‘What color is he car?’). In order to answer, the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question (‘orange’).
Tell Me Where to Look creates attention maps as explicit part of the network by comparing last layer of regular classification to the one created when classification is applied to a masked version of the image.
Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos is a paper by Fei-Fei Lee’s group at Stanford that got an oral presentation. It addresses the challenging tasks of translating instructional videos to instructions-text that has temporal context and dependency.
Iterative Visual Reasoning Beyond Convolutions Combines conv-nets with graph models for visual reasoning. It is refreshing to see such novel architectures.
Neural Baby Talk by Dhruv Batra’s group (FAIR + Georgia IT) demonstrates image captioning where text words correspond to objects detected in the image.
SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text Does image captioning while controlling style properties of the generated text.
Low-Shot Learning from Imaginary Data makes machine vision systems perform better low-shot learning by generating similar instances to that image. Inspired by human way of using imagination to understand.
SPLATNet by Jan Kautz’s group at NVIDIA won an honorable mention at the best papers ceremony. It is a network architecture for processing point-clouds in a sparse representation for 3D reconstruction.
Extreme 3D Face Reconstruction: Seeing Through Occlusions by some authors including Tal Hassner and Gerard Medioni, demonstrates 3D face reconstruction that includes fine details and robust to occlusions.
Pix2PixHD by NVIDIA improves the famous Pix2Pix. The results and applications are breathtaking (Watch the video!).

6. More big trends I haven’t mentioned yet:
Of course, this does not cover all. There some more big trends that I did not mention:

Deep 3D: I used to think geometry will stay relatively neural-clean, I was wrong. 85 papers include “3D” in their title. Hard to characterize how much use deep learning but I would carefully approximate that at least 75% do.
Visual Qustion Answering that has grown to be one of the most popular challenges (Classification is to easy?). 25 papers relate to VQA.
Last year’s trends such as Attention and RL for non-typical RL-problems maintain their strength.

Legal Disclaimer:

You understand that when using the Site you may be exposed to content from a variety of sources, and that SagivTech is not responsible for the accuracy, usefulness, safety or intellectual property rights of, or relating to, such content and that such content does not express SagivTech’s opinion or endorsement of any subject matter and should not be relied upon as such. SagivTech and its affiliates accept no responsibility for any consequences whatsoever arising from use of such content. You acknowledge that any use of the content is at your own risk.

CVPR 2018 as reported by Assaf Shocher