TL;DR: IMAGEdit is a training-free and plug-and-play framework that aligns prompts and retargets masks to enable any-subject video editing.
Abstract
We presented IMAGEdit, a training free framework for video editing with any number of subjects that changes designated categories. IMAGEdit provides robust multimodal conditioning and precise mask motion sequences through two key components, a prompt guided multimodal alignment module and a prior based mask retargeting module. By leveraging the understanding and generation capabilities of large pretrained models, these components produce aligned multimodal signals and time consistent masks that effectively remedy insufficient prompt side conditioning and overcome mask boundary entanglement in crowded scenes. The framework then conditions a pretrained mask driven video generator to synthesize the edited video. IMAGEdit is plug and play with a wide range of mask driven backbones and consistently improves overall performance. Extensive experiments on the new multi subject benchmark MSVBench verify that IMAGEdit surpasses state of the art methods.
How does it work?
IMAGEdit is a training-free, plug-and-play framework that combines prompt-guided multimodal alignment with prior-based mask retargeting to enable any-subject video editing. It first grounds user prompts with strong textual–visual alignment, then refines instance masks using depth and temporal priors to ensure smooth and consistent motion boundaries. These enhanced conditions are fed into a pretrained mask-driven video generator, producing subject-accurate, temporally coherent, and background-preserving editing results without the need for additional training.
Comparison with Other Video Editing Methods
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
Two [Puppies -> Kittens] together on a weighing scale.
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
One black sheepdog herding four [Ducks -> Robot Ducks] on green field.
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
Four [Female Runners -> Spider-Men] sprint on track.
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
Seven [Kabaddi Players -> Gokus] facing each other on a purple mat in stadium lighting.
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
[Players -> Spider-Men] on trampolines throwing dodgeballs during an intense match.
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
[Relay Runners -> Astronauts] sprinting on track.
Original
FateZero
TokenFlow
VideoPainter
VideoGrain
DMT
IMAGEdit
[Volleyball Players -> Robots] compete on indoor court with net.
Main Video Results
Three [People -> Super Mario] sitting in car backseat.
Four [People -> Robots] standing on football court.
Four [Hungry Dogs -> Robot Wolves] surrounding a bowl of food outdoors.
A group of [People -> Astronauts] practicing boxing in a fitness studio.
A team of [Men -> Spider-Men] rowing together on a river.
Eight [Hurdlers -> Iron Men] leap mid-race over purple hurdles.
Multi-Scenario Applications
Automn Forest -> Winter Forest
Snowy Forest -> Lunar Surface
The Eiffel Tower -> The Space Needle
Glasses -> Sunglasses
Left -> Ultraman; Right -> Robot
Left -> Gorilla; Right -> Polar Bear
Left -> Lightning McQueen; Right -> Yellow Cartoon Porsche
Two People (arm wrestling) -> Two Supermen
Turn 1: Horse Riders -> Gokus
Turn 2: The two above (Gokus -> Iron-Men)
Add Glasses
Change Face To "Kevin Durant"
Change Face To "LeBron James"
Remove Glasses
Plaid Shirt -> Business Suit
Plaid Shirt -> Hawaiian Shirt
Disclaimer
We make this project openly available to the research community. Most of the datasets and resources involved are either generated by us or obtained under proper licenses. If you believe any material infringes your rights, please contact us, and we will take immediate action to remove or replace the content. Any components derived from third-party models or datasets must strictly follow their original license agreements.
This work is intended to promote academic exploration and innovation in the area of controllable generation. Users are welcome to experiment with the released code and models under the condition that they act in accordance with local regulations and use them responsibly. The authors and contributors disclaim any liability for potential misuse or unintended applications of this tool.
BibTeX
@article{shen2025imagedit,
title={IMAGEdit: Let Any Subject Transform},
author={Shen, Fei and Xu, Weihao and Yan, Rui and Zhang, Dong and Shu, Xiangbo and Tang, Jinhui},
journal={arXiv preprint arXiv:2510.01186},
year={2025}
}