IMAGEdit : Let Any Subject Transform

Fei Shen¹, Weihao Xu², Rui Yan², Dong Zhang³, Xiangbo Shu², Jinhui Tang^2,4

¹National University of Singapore ²Nanjing University of Science and Technology

³Hong Kong University of Science and Technology ⁴Nanjing Forestry University

arXiv Code Dataset

TL;DR: IMAGEdit is a training-free and plug-and-play framework that aligns prompts and retargets masks to enable any-subject video editing.

Abstract

We presented IMAGEdit, a training free framework for video editing with any number of subjects that changes designated categories. IMAGEdit provides robust multimodal conditioning and precise mask motion sequences through two key components, a prompt guided multimodal alignment module and a prior based mask retargeting module. By leveraging the understanding and generation capabilities of large pretrained models, these components produce aligned multimodal signals and time consistent masks that effectively remedy insufficient prompt side conditioning and overcome mask boundary entanglement in crowded scenes. The framework then conditions a pretrained mask driven video generator to synthesize the edited video. IMAGEdit is plug and play with a wide range of mask driven backbones and consistently improves overall performance. Extensive experiments on the new multi subject benchmark MSVBench verify that IMAGEdit surpasses state of the art methods.

How does it work?

IMAGEdit is a training-free, plug-and-play framework that combines prompt-guided multimodal alignment with prior-based mask retargeting to enable any-subject video editing. It first grounds user prompts with strong textual–visual alignment, then refines instance masks using depth and temporal priors to ensure smooth and consistent motion boundaries. These enhanced conditions are fed into a pretrained mask-driven video generator, producing subject-accurate, temporally coherent, and background-preserving editing results without the need for additional training.

Comparison with Other Video Editing Methods

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

Two [Puppies -> Kittens] together on a weighing scale.

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

One black sheepdog herding four [Ducks -> Robot Ducks] on green field.

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

Four [Female Runners -> Spider-Men] sprint on track.

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

Seven [Kabaddi Players -> Gokus] facing each other on a purple mat in stadium lighting.

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

[Players -> Spider-Men] on trampolines throwing dodgeballs during an intense match.

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

[Relay Runners -> Astronauts] sprinting on track.

Original

FateZero

TokenFlow

VideoPainter

VideoGrain

DMT

IMAGEdit

[Volleyball Players -> Robots] compete on indoor court with net.

Main Video Results

Three [People -> Super Mario] sitting in car backseat.

Four [People -> Robots] standing on football court.

Four [Hungry Dogs -> Robot Wolves] surrounding a bowl of food outdoors.

A group of [People -> Astronauts] practicing boxing in a fitness studio.

A team of [Men -> Spider-Men] rowing together on a river.

Eight [Hurdlers -> Iron Men] leap mid-race over purple hurdles.

Multi-Scenario Applications

Automn Forest -> Winter Forest

Snowy Forest -> Lunar Surface

The Eiffel Tower -> The Space Needle

Glasses -> Sunglasses

Left -> Ultraman; Right -> Robot

Left -> Gorilla; Right -> Polar Bear

Left -> Lightning McQueen; Right -> Yellow Cartoon Porsche

Two People (arm wrestling) -> Two Supermen

Turn 1: Horse Riders -> Gokus

Turn 2: The two above (Gokus -> Iron-Men)

Add Glasses

Change Face To "Kevin Durant"

Change Face To "LeBron James"

Remove Glasses

Plaid Shirt -> Business Suit

Plaid Shirt -> Hawaiian Shirt

Disclaimer

We make this project openly available to the research community. Most of the datasets and resources involved are either generated by us or obtained under proper licenses. If you believe any material infringes your rights, please contact us, and we will take immediate action to remove or replace the content. Any components derived from third-party models or datasets must strictly follow their original license agreements. This work is intended to promote academic exploration and innovation in the area of controllable generation. Users are welcome to experiment with the released code and models under the condition that they act in accordance with local regulations and use them responsibly. The authors and contributors disclaim any liability for potential misuse or unintended applications of this tool.

BibTeX

@article{shen2025imagedit,
  title={IMAGEdit: Let Any Subject Transform},
  author={Shen, Fei and Xu, Weihao and Yan, Rui and Zhang, Dong and Shu, Xiangbo and Tang, Jinhui},
  journal={arXiv preprint arXiv:2510.01186},
  year={2025}
}

Abstract

How does it work?

Comparison with Other Video Editing Methods

Main Video Results

Multi-Scenario Applications

Disclaimer

BibTeX

Visitors