๐Ÿˆ CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

1Sun Yat-Sen University, 2Pixocial Technology, 3Peng Cheng Laboratory, 4SIAT
teaser

CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 resolution).

Abstract

Virtual try-on methods based on diffusion models achieve realistic try-on effects but replicate the backbone network as a ReferenceNet or leverage additional image encoders to process condition inputs, resulting in high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person, proposing CatVTON, a simple and efficient virtual try-on diffusion model. It facilitates the seamless transfer of in-shop or worn garments of arbitrary categories to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of our model is demonstrated in three aspects: (1) Lightweight network. Only the original diffusion modules are used, without additional network modules. The text encoder and cross attentions for text injection in the backbone are removed, further reducing the parameters by 167.02M. (2) Parameter-efficient training. We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters (~5.51% of the backbone networkโ€™s parameters). (3) Simplified inference. CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.

Architecture

Our method achieves the high-quality try-on by simply concatenating the conditional image (garment or reference person) with the target person image in the spatial dimension, ensuring they remain in the same feature space throughout the diffusion process. Only the self-attention parameters, which provide global interaction, are learnable during training. Unnecessary cross-attention for text interaction is omitted, and no additional conditions, such as pose and parsing, are required. These factors result in a lightweight network with minimal trainable parameters and simplified inference.

Structure Comparison

We illustrate simple structure comparison of different kinds of try-on methods below. Our approach neither relies on warped garments nor requires the heavy ReferenceNet for additional garment encoding; it only needs simple concatenation of the garment and person images as input to obtain high-quality try-on results.

Efficiency Comparison

We represent each method by two concentric circles, where the outer circle denotes the total parameters and the inner circle denotes the trainable parameters, with the area proportional to the parameter number. CatVTON achieves lower FID on the VITONHD dataset with fewer total parameters, trainable parameters, and memory usage.

Online Demo

Since GitHub Pages does not support embedded web pages, please jump to our Demo .

Acknowledgement

Our code is modified based on Diffusers. We adopt Stable Diffusion v1.5 inpainitng as base model. We use SCHP and DensePose to automatically generate masks in our Gradio App. Thanks to all the contributors!

BibTeX


        @misc{chong2024catvtonconcatenationneedvirtual,
          title={CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models}, 
          author={Zheng Chong and Xiao Dong and Haoxiang Li and Shiyue Zhang and Wenqing Zhang and Xujie Zhang and Hanqing Zhao and Xiaodan Liang},
          year={2024},
          eprint={2407.15886},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2407.15886}, 
        }