Video Matting Benchmark

Contact person: Mikhail Erofeev

Team: Dmitriy Vatolin, Mikhail Erofeev


Video matting refers to a problem of accurate decomposition of given video sequence to background layer, foreground layer and transparency map. The VideoMatting project is the first online benchmark of video matting methods. The goal of the project is to provide better understanding of current progress in the field of video matting and to aid in developing new methods.

Full comparison results are available at the project website

Key Contributions

  • 10 test sequences with ground-truth transparency maps
  • Objective evaluation of 12 image and video matting methods
  • Subjective evaluation of 12 matting methods
  • Perceptually motivated transparency map’s temporal coherency metric


Our evaluation data set consists of 10 test cases with various levels of complexity. Each test case includes the following parts:

  • Background
  • Trimap
  • Ground-truth alpha (transparency map)
  • Ground-truth foreground
  • Composite

To obtain trimaps, we perform morphological operations on ground-truth (GT) alpha mattes and composites by placing ground-truth foregrounds over new backgrounds. Having ground-truth alpha and foreground information makes obtaining trimaps and composites easy; the problem, however, is to obtain high-quality ground-truth alpha and foreground sequences. To this end, we use two approaches: chroma keying and stop-motion capture.

Chroma Keying

Image example

Image example

Image example

Image example

Green screen Stop motion

Alpha mattes from chroma keying and stop-motion capture for the same image region. The stop-motion result is significantly better at preserving details.

Chroma keying is a common practice of the cinema industry: the cinematographer captures an actor in front of a green or blue screen, then the VFX expert replaces the background using special software. Our evaluation uses five green-screen video sequences with a significant amount of semitransparency (e.g., hair or motion blur), provided to us by Hollywood camera work. We extract alpha mattes and corresponding foregrounds using The Foundry Keylight. Chroma keying enables us to get alpha mattes of natural-looking objects with arbitrary motion. Nevertheless, this technique can’t guarantee that the alpha maps are natural, because it assumes the screen color is absent from the foreground object. To get alpha maps that have a very natural appearance, we use the stop-motion method.

Stop Motion

We designed the following procedure to perform stop-motion capture: A fuzzy toy is placed on the platform in front of an LCD monitor. The toy rotates in small, discrete steps along a predefined 3D trajectory, controlled by two servos connected to a computer. After each step the digital camera in front of the setup captures the motionless toy against a set of background images. At the end of this process, the toy is removed and the camera again captures all of the background images.

We paid special attention to avoiding reflections of the background screen in the foreground object. These reflections can lead to false transparency that is especially noticeable in nontransparent regions. To reduce the amount of reflection we used checkerboard background images instead of solid colors, thereby adjusting the mean color of the screen to be the same for each background.

Subjective Evaluation

To support our choice of quality metric, we collected subjective pairwise comparisons of 12 matting methods applied to the videos from our data set. In particular, we showed participants a sequence of video pairs; for each pair, we asked them to choose the video with better quality or to indicate that the videos are approximately equal. For the sake of contrast, we composed the results of the methods over blue and yellow checkerboard images, and since many pairs differ only during brief intervals, we divided long sequences into shorter segments of 50 to 60 frames. Among our viewers were 101 volunteers and 442 paid participants from Amazon Mechanical Turk. We offered paid participants $0.05 for every 23 pairs, 3 of which were hidden quality-control comparisons between ground truth and a low-quality method; to accept the data from a given individual, we required correct choices for all control comparisons. In total, we collected 12,629 comparisons (3,789 from volunteers and 8,840 from paid participants), equally distributed among methods and sequences.

Objective Evaluation

Our comparison includes both image- and video-matting methods. We apply each matting method to the videos in our data set. We also provide trimap and composite sequences for each method and use the ground-truth alpha and foreground to measure the method’s performance. In fact, we perform a two-dimensional quality comparison: by mean squared per-frame error (estimate of accuracy):

and by a motion-aware squared difference of temporal derivatives (estimate of temporal coherency):

Here denotes the number of frames, and denote transparency values of video matting under consideration and ground truth correspondingly at pixel of frame , and denotes motion vector at pixel.

In all experiments in this section we use block-based motion estimation algorithm computed for ground-truth sequence composed over solid-green background. Figure below shows method ranks for all of our test sequences.