1HKUST·
2ZODA·
3UCF·
4BAAI·
5CUHK·
6HKUST-GZ
†Corresponding author
Baseline text-guided image-to-video models (here, FramePack) exhibit semantic negligence — failing to realize prompt-specified edits. In (a), the sunflower mentioned in the prompt is entirely missing. In (b), the person remains static instead of climbing onto the tank as instructed. AlignVid corrects both, without any retraining.
Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (e.g., object addition, removal, or modification). Empirically, our analysis reveals that this stems from visual dominance, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose AlignVid, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (ASM) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (GS) to maintain generation stability. To rigorously assess this capability, we present OmitI2V, a comprehensive benchmark for evaluating prompt adherence across object modification, addition, and deletion. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead.
AlignVid recalibrates attention inside the model — no fine-tuning, no extra parameters, negligible inference overhead. It has two complementary components:
Rescales query/key representations by a single coefficient γ (default 1.35) to sharpen the attention energy landscape — lowering entropy and amplifying text tokens over visual priors.
Applies ASM selectively across foreground-sensitive transformer blocks and early denoising steps, stabilizing generation and limiting visual-quality degradation.
Unlike input-level perturbations such as blurring — which visibly corrupt the reference image — AlignVid performs this reallocation entirely within the model, giving a tunable semantic–quality trade-off without input-level corruption. The default coefficient γ = 1.35 transfers across backbones without per-model search (analogous to choosing CFG strength).
Videos and attention maps generated from the original input image (top) and from the same image after Gaussian blur (bottom). In the original setting the model exhibits visual dominance — excessive focus on the reference image suppresses text constraints and temporal dynamics. Blurring weakens this dominance, shifting attention toward text tokens and temporal neighbors, increasing attention scores, and lowering entropy across all modalities. AlignVid mimics this sharpening effect internally, with no input corruption.
ASM sharpens attention (lower entropy), boosts focus on text tokens, and suppresses the dominant image regions — exactly the effect blurring achieves at the input, but realized inside the model.
Baseline vs. AlignVid (ours) across three text-guided image-to-video backbones.
OmitI2V is a new benchmark of 367 human-annotated samples spanning modification, addition, and deletion scenarios. Each sample ships with VQA-style yes/no questions for fine-grained edit compliance.
Sample distribution across categories and domains.
An OmitI2V sample, with its reference image and prompt-specified edit.
@article{liu2025alignvid,
title = {AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation},
author = {Liu, Yexin and Shu, Wen-Jie and Huang, Zile and Zheng, Haoze and Wang, Yueze and Zhang, Manyuan and Lim, Ser-Nam and Yang, Harry},
journal = {arXiv preprint arXiv:2512.01334},
year = {2025}
}