Virtual try-on methods based on diffusion models achieve realistic try-on effects but replicate the backbone network
as a ReferenceNet or leverage additional image encoders to process condition inputs, resulting in high training and
inference costs.
In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment
and person, proposing CatVTON, a simple and efficient virtual try-on diffusion model. It facilitates the seamless
transfer of in-shop or worn garments of arbitrary categories to target persons by simply concatenating them in spatial
dimensions as inputs. The efficiency of our model is demonstrated in three aspects:
(1) Lightweight network. Only the original diffusion modules are used, without additional network modules. The text
encoder and cross attentions for text injection in the backbone are removed, further reducing the parameters by 167.02M.
(2) Parameter-efficient training. We identified the try-on relevant modules through experiments and achieved
high-quality try-on effects by training only 49.57M parameters (~5.51% of the backbone networkโs parameters).
(3) Simplified inference. CatVTON eliminates all unnecessary conditions and preprocessing steps, including
pose estimation, human parsing, and text input, requiring only garment reference, target person image, and mask for
the virtual try-on process.
Extensive experiments demonstrate that CatVTON achieves superior qualitative and
quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore,
CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.