Abstract and 1. Introduction
Related Works
MaGGIe
3.1. Efficient Masked Guided Instance Matting
3.2. Feature-Matte Temporal Consistency
Instance Matting Datasets
4.1. Image Instance Matting and 4.2. Video Instance Matting
Experiments
5.1. Pre-training on image data
5.2. Training on video data
Discussion and References
\ Supplementary Material
Architecture details
Image matting
8.1. Dataset generation and preparation
8.2. Training details
8.3. Quantitative details
8.4. More qualitative results on natural images
Video matting
9.1. Dataset generation
9.2. Training details
9.3. Quantitative details
9.4. More qualitative results
There are many ways to categorize matting methods, here we revise previous works based on their primary input types. The brief comparison of others and our MaGGIe is shown in Table 1.
\ Image Matting. Traditional matting methods [4, 24, 25] rely on color sampling to estimate foreground and background, often resulting in noisy outcomes due to limited high-level object features. Advanced deep learningbased methods [9, 11, 31, 37, 46, 47, 54] have significantly improved results by integrating image and trimap inputs or focusing on high-level and detailed feature learning. However, these methods often struggle with trimap inaccuracies and assume single-object scenarios. Recent approaches [5, 6, 22] require only image inputs but face challenges with multiple salient objects. MGM [56] and its extension MGM-in-the-wild [39] introduce binary maskbased matting, addressing multi-salient object issues and reducing trimap dependency. InstMatt [49] further customizes this approach for multi-instance scenarios with a complex refinement algorithm. Our work extends these developments, focusing on efficient, end-to-end instance matting with binary mask guidance. Image matting also benefits from diverse datasets [22, 26, 27, 29, 33, 50, 54], supplemented by background augmentation from sources like BG20K [29] or COCO [35]. Our work also leverages currently available datasets to concretize a robust benchmark for human-masked guided instance matting.
\ Video Matting. Temporal consistency is a key challenge in video matting. Trimap-propagation methods [17, 45, 48] and background knowledge-based approaches like BGMv2 [33] aim to reduce trimap dependency. Recent techniques [28, 32, 34, 53, 57] incorporate ConvGRU, attention memory matching, or transformer-based architectures for temporal feature aggregation. SparseMat [50] uniquely focuses on fusing outputs for consistency. Our approach builds on these foundations, combining feature and output fusion for enhanced temporal consistency in alpha maps. There is a lack of video matting datasets due to the difficulty in data collecting. VideoMatte240K [33] and VM108 [57] focus on composited videos, while CRGNN [52] is the only offering natural videos for human matting. To address the gap in instanceaware video matting datasets, we propose adapting existing public datasets for training and evaluation, particularly for human subjects.
\ 
\ \
:::info Authors:
(1) Chuong Huynh, University of Maryland, College Park ([email protected]);
(2) Seoung Wug Oh, Adobe Research (seoh,[email protected]);
(3) Abhinav Shrivastava, University of Maryland, College Park ([email protected]);
(4) Joon-Young Lee, Adobe Research ([email protected]).
:::
:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.
:::
\


