This blog explains the first Deep Learning based method to tackle the task of image matting.
The Deep Image Matting paper by Xu et al was pivotal in image matting research using deep neural networks to solve some key challenges for the task and setting some important design choices that further works would take inspiration from. The goal of the paper is to take in an input image and a user-specified refinement region and predict an alpha matte, which represents the opacity values in the refinement region. For a in-depth introduction to this task and some of it’s application, have a look at this blog. The method presented in this paper is not fully automatic and requires a user-generated trimap.
In the figure below, we can see the input image in the first column and an image indicating the refinement region by grey values in the second. Both these images are input to the DIM model. The third and fourth columns are the predicted alpha mattes of a classical method, Closed-form Matting, and their proposed method. We see that the classical method has difficulty with the similar dark colors of the hair and tower.
The existing approaches at this time, like Bayesian Matting or Matting Laplacian, largely relied on color information to solve the matting problem. Purely color based approaches are bound to fail in many natural scenes as foregrounds and backgrounds often have similar colors. Instead, the authors propose to use deep neural networks which are capable of considering color information as well as structural and textural information. In cases where the foreground and background share similar colors, the network can rely on structural and textural information to predict the correct matte. In Fig 1 above, we see that the classical method has difficulty with the similar dark colors of the hair and tower while the deep network relies on its structural knowledge of hair to predict the correct alpha matte. The major contributions of this paper are:
- Formulating the task of image matting as a deep learning problem
- Creating a dataset and augmentation techniques for the problem
To create the dataset, simple images on simple background are found and ground truth alpha mattes and foreground images are manually created in Photoshop. The foreground image represents the true colors of the subject if it were non-transparent. Authors propose a dataset of 493 unique foreground subject image. Each subject has a foreground image 2c) and a ground truth alpha matte 2b).
Using the matting equation, we can then extract the foreground object and place on new a background with a selected background B as:
The second row in fig 2 shows examples of such composition on different backgrounds.
During training, the network takes in a user generated-trimap which specifies a refinement region for which the network should output its prediction. This means that the network does not need to learn about object classes like humans or dogs; the semantic information of the image is provided by the trimap to the network. The authors train the model on random crops as the network only needs to predict transparency values which can be determined locally.
As creating matting data is labour intensive and expensive, they follow another work and create a training scheme which uses synthetic data made from the small dataset they created.
The training dataset is created as follows:
- A random crop of size 320X320, 480X480 or 640X640 centred around a pixel in the refinement region is sampled from the image
- Each crop is resized to a fixed input size of 320X320
- The crop is the composed on a COCO background to create a synthetic image using the matting equation
For each image, they sample 100 crops, creating a total dataset of 493*100=49300 images
Their model is a fairly straight forward Encoder-Decoder Architecture. The network takes in a 4-channel input of the image and trimap concatenated. The encoder is initialised as the first 14 layers of a pretrained VGG16 with 5 max pools. The decoder has six convolution layers and outputs an alpha matte as a single channel map.
The large receptive filed size of the encoder-decoder architecture will lead to smooth and continuous prediction rather than sharp ones. Thus the authors include a small refinement block after the encoder-decoder to further refine it’s predicted alpha matte. The refinement network contains 4 convolutional layers operating at the full input resolution so that low level details are not compressed. The input to this block is a 4 channel concatenation of the original image and predicted alpha matte of the encoder-decoder. The refinement block is formulated as a residual learning problem to sharpen the previously predicted matte.
The paper uses two losses - the Alpha Prediction Loss and the Compositional Loss.
The Alpha Prediction Loss is a fairly straightforward L1 loss between the predicted and growth truth alpha matte:
Once our network predicts an alpha matte we can use it with the group truth foreground and a random background to make a new composition. We can use the ground truth alpha matte to do the same as well. The Compositional Loss then computes the L1 loss between the composition by the predicted and ground truth alpha matte forcing the network to predict the alpha which is correct for the foreground map.
The final loss is a linear combination of the two losses. Both losses are computed only over the trimap region.
The paper reports excellent results compared to other techniques at the time. Their method is more robust to foreground and background colors and can predict sharper alpha mattes. The method also obtains high SAD scores on the alphamatting.com test set. Kindly check the paper for more metrics and visual comparisons.
Overall this paper introduced several ideas that have greatly contributed to the field of image matting research. Kindly check the paper for more details.
To see image matting in action, try out our background remover EraseBG which is set to launch a big upgrade on 10th March 2021.