When we try to generate timelapse videos from single images, we find that existing generative video models struggle to preserve subject consistency over long time horizons. We introduce a technique called first-order representation alignment (dREPA) that, with limited finetuning data, substantially improves subject consistency across a range of photographic and artistic styles.
Long-range coherence is a key problem in image to video generation. Unlike text-to-video cases, conditioning on an input image imposes the greater challenge of faithfully maintaining image content, often requiring a model to understand semantics and geometry in order to avoid unnatural morphing and sudden substitutions. Advancing long-range coherence to handle hundreds or thousands of frames would unlock new applications in world modeling, simulation, and interactive agents.
Some methods aim to extend temporal duration by stitching short clips together—often autoregressively—while trying to maintain the alignment of subjects and scenes
A potential alternative would be to generate a fast-forward (i.e., timelapse) version of long video as a single clip, and then to use inbetweening and interpolation to create finer temporal details. This approach is becoming more feasible thanks to video generation backbones that can produce clips with around 50–90 frames in a single pass.
This raises the question: Can existing image-to-video models generate convincing long-horizon content within a single clip? To study this, we focus on generating timelapse videos, which compress hours or days of change (plant growth, dough proofing, melting, etc.) into seconds. Subject appearance and geometry can evolve substantially during a timelapse video, so success requires the ability to model physically plausible transitions and stable long-horizon dependencies. This makes image-to-timelapse generation a convenient benchmark for improving long-horizon consistency in general.
We show that current models often fail when generating timelapse clips, and we introduce an alignment technique that leads to improvements. Our technique is called first-order representation alignment (dREPA), and it builds on the representation alignment scheme of Yu et al.
There has been relatively little work on generating timelapse videos. The closest we know is MagicTime
We tested some current image-to-video models
Limited shape dynamics: The generated content is largely static, failing to realize continuous deformations like a blooming flower or rising bread.
Lack of subject consistency: There are sudden appearance shifts, with abrupt, irrelevant substitutions, even in a strong model like Veo3.
These failure cases highlight the core challenge of timelapse and in general long-range video generation: achieving meaningful progression over time while preserving subject identity and scene layout.
We show that we can improve both shape dynamics and subject consistency with our proposed training-time regularization method dREPA. Besides learning better first-order spatio-temporal dynamics, a key advantage of dREPA is that it only applies during training and so does not affect inference time. We find that dREPA is helpful even with limited training data, and that it reduces reliance on detailed text prompts for image-to-video generation, often allowing short and generic text to suffice.
We also extend our framework to allow any-frame conditioning during image-to-video generation, by allowing the input image to be used as a keyframe at any temporal location within a clip. This is critical for timelapse generation because an input image may capture, say, a flower at any moment of the lifecyle that we want the generated timelapse to portray.
Before we train the generation model, we first collect a set of timelapse video data. We curate from two data sources. The first is the ChronoMagic-ProH dataset
Our goal is to filter for timelapse-style content where objects undergo gradual, long-horizon changes, such as plant growth or baking processes. We use the following curation pipeline:
Curation pipeline
Content verification. For each candidate video, we sample a five-frame snapshot spanning its duration. We then query a multimodal LLM (Gemini-2.5-pro) to verify that the video actually depicts a timelapse-like process, ensuring that the subject remains consistent while undergoing meaningful change.

After preprocessing, we are left with around 3,800 unique timelapse videos. From each, we extract subclips at three different time intervals as data augmentation, yielding around 11k short clips in total for training, each with a paired text caption.
Given the limited amount of domain-specific data, we finetune from a pretrained video generation model, CogVideoX-5B-I2V
At the latent space, the video generation backbone is a diffusion transformer with 42 multimodal transformer blocks, each applying self-attention over a concatenation of video latents and text-prompt latents. During training, the model predicts the velocity $\hat{v}$ of the diffusion process, which is then converted into the clean latents $\hat{x}_0$ for optimization with respect to ground truths.

Extending first-frame to any-frame conditioning allows the model to generate an entire timelapse from a single image taken at any point in the lifecycle, while preserving a sufficient degree of physical morphing across time.
The backbone model implements first-frame conditioning by passing the conditioning image through the same 3D VAE used to encode and decode the video latents. The resulting conditioning-latent is then padded with zeros across the remaining 12 latent frames.

To instead achieve conditioning on any frame $k$, we start with a tensor that has the same shape as the pixel-space video clip (49×480×720). We insert the conditioning image as the $k$th frame while masking all other frames with randomly sampled zero-mean Gaussian noise of variance 0.07. (Empirically, we observed that directly masking the remaining frames with zeros leads to poor VAE reconstruction quality.) Note that this approach allows reusing the pretrained 3D VAE without retraining.
During training, we randomly select the $k$th frame (with integer $i \in {1\ldots 49}$) from a ground truth video as the conditioning frame, and train the diffusion model to predict the entire video given the text prompt. Additionally, we introduce a fallback mechanism: when the selected conditioning frame is the first frame, the remaining latents are zero-padded. This ensures consistency with the original pretraining scheme of CogVideoX-5B-I2V and stabilizes training.
As described in the introduction, a critical limitation of existing video generation models is the lack of intra-clip subject consistency. We hypothesize that this limitation is caused by the per-frame loss that is used for optimizing video diffusion models.
The typical regression-style diffusion loss computes the distance (e.g., $L1$ or $L2$) between the predicted clean frames and ground-truth frames at each timestep, and then it averages over spatial locations and across all frames. This formulation primarily penalizes 0th-order patterns and often ignores temporal dynamics.
A toy 1D example can intuitively illustrate the issue. Consider three signals (green lines) compared against a ground truth (orange line) over four frames $f_0$ to $f_3$:

Despite their clear differences in temporal behavior, all three lines are favored equally by the model under per-frame aggregated regression loss. In practice, the shifted linear signal should be preferred because it better preserves the first-order dynamics of the ground truth. This motivates additional regularization that specifically target first-order spatio-temporal pattern.
dREPA: First-Order Alignment via VFMs
To better learn spatio-temporal dynamics, we align the first-order pattern of the model’s latent features with those extracted from pretrained Vision Foundation Models (VFMs), inspired by prior work REPA

To represent the spatio-temporal changes within the VFM’s features, we use the ground-truth pixel-space video as input to the VFM and compute pairwise cosine similarities densely across all pairs of patch features $y_i, y_j$ with $i, j \in [1, 2, \dots,N]$ from different frames and spatial locations ($i \neq j$).
For the spatio-temporal pattern of the video generation model, we apply a lightweight MLP that projects the generative model’s latents (from an intermediate DiT block) into the VFM feature resolution. We then compute pairwise cosine similarities across all patch features $h_i, h_j$.
The alignment loss is then computed as the distance between the pairwise patch similarities of the generation model and the VFM features:
\[\mathcal{L}_{\text{dREPA}} = \frac{1}{N(N-1)} \sum_{i \ne j} \big( \langle h_i, h_j \rangle - \langle y_i, y_j \rangle \big)^2\]where the features $h_i, h_j, y_i, y_j$ are normalized before computing the inner products. During training, we optimize both the MLP and backbone transformer over the original diffusion loss together with a weighted alignment (dREPA) loss:
\[\mathcal{L} = \mathcal{L}_{\text{diffusion}} + \lambda \cdot \mathcal{L}_{\text{dREPA}}\]During inference, we detach the MLP, and thus inference incurs the same cost as the original backbone model.
Practical Considerations and Findings
Following prior work
To evaluate the effectiveness of dREPA, we finetuned the same backbone models for any-frame conditioning on our curated timelapse dataset under two settings.
Without dREPA (baseline finetuning)
With dREPA (our proposed method)
Both training settings use identical random seeds and learning hyperparameters. At inference time, we compare sampled videos generated with the same conditioning frame and random seeds from the two models.
When trained without dREPA, the model frequently produces discontinuities in a single generated clip, such as the sudden introduction of new content, as seen in the tulip video. In contrast, finetuning with dREPA leads to improved temporal smoothness and more realistic morphing behavior.
The benefits of dREPA are especially pronounced for out-of-distribution inputs such as paintings. Although the model never observes painting-style timelapse during training, dREPA-regularized models generalize well, producing consistent and realistic blooming sequences from paintings. In this artistic domain, our approach even surpasses several closed-source commercial models (e.g., Google Veo3, Runway Gen4) in terms of subject consistency and faithful preservation of the painting’s texture.
Beyond real photos, the model generalizes well to different artistic styles and requires only simple text prompt such as “timelapse of flower blooming”. We demonstrate results across watercolor, anime, and oil paintings, showing the benefit of dREPA regularized training.
Despite these improvements, the model still exhibits some limitations:
It is unstable for complex backgrounds. Since most timelapse training videos have simple backgrounds, conditioning on images with cluttered background sometimes leads to unstable generations.
It has limited physical accuracy for unfamiliar objects. Due to the small size of the of the timelapse dataset, unseen objects (e.g., blue hydrangea) are not always generated with correct physical dynamics.
For future work, it may be possible to improve the results with synthetic training data. It may also be useful to extend the conditioning mechanism to multiple keyframes, which would enable users to provide sparse temporal anchors from which the model can generate longer-context video.
Our key findings are:
Existing first-frame conditioning image-to-video models can be easily extended to anyframe conditioning, which allows greater flexibility, enabling generation from arbitrary points in time.
dREPA regularization can improve subject consistency and the physical realism of temporal dynamics with limited finetuning data.
dREPA generalizes well beyond training data, producing high-quality results from images that are out-of-distribution, such as artistic styles.
Overall, dREPA highlights the benefit of aligning first-order spatial-temporal representations rather than relying solely on zeroth-order, frame-wise alignment. We believe this direction opens the door to more controllable, physically realistic video generation systems, with applications ranging from world modeling to digital art.
Pytorch code for dREPA loss.
def dREPA_loss(
yv: Tensor, # [B, f, hw, D] foundation-model features
h: Tensor, # [B, f, hw, D] generation-model features
reshape = True,
) -> Tensor:
if reshape:
yv = rearrange(yv, 'B F C H W -> B F (H W) C')
h = rearrange(h, 'B F C H W -> B F (H W) C')
B, f, hw, D = yv.shape
device = yv.device
# 1) Normalize features along D
yv_norm = F.normalize(yv, p=2, dim=-1) # [B, f, hw, D]
h_norm = F.normalize(h, p=2, dim=-1)
# 2) Spatial similarities per frame
y_spatial = torch.einsum('bfid,bfjd->bfij', yv_norm, yv_norm)
h_spatial = torch.einsum('bfid,bfjd->bfij', h_norm, h_norm)
L_spatial = (h_spatial - y_spatial).pow(2).mean()
# 3) Full cross-frame similarities
y_temp = torch.einsum('bfid,bgjd->bfgij', yv_norm, yv_norm)
h_temp = torch.einsum('bfid,bgjd->bfgij', h_norm, h_norm)
diff_temp = (h_temp - y_temp).pow(2)
# 4) Mask out same-frame pairs (keep only e != d)
diag = torch.eye(f, dtype=torch.bool, device=device)
mask = ~diag
# 5) Expand mask to match diff_temp
mask = mask.view(1, f, f, 1, 1).expand(B, f, f, hw, hw)
# 6) Apply mask & mean
L_temporal = diff_temp[mask].mean()
return L_spatial + L_temporal
PLACEHOLDER FOR BIBTEX