Cross-view completion (CroCo) has proven effective as pre-training for geometric downstream tasks such as stereo depth, optical flow, and point cloud prediction. In this paper we show that it also learns photometric understanding due to training pairs with differing illumination. We propose a method to disentangle CroCo latent representations into a single latent vector representing illumination and patch-wise latent vectors representing intrinsic properties of the scene. To do so, we use self-supervised cross-lighting and intrinsic consistency losses on a dataset two orders of magnitude smaller than that used to train CroCo. This comprises pixel-wise aligned, paired images under different illumination. We further show that the lighting latent can be used and manipulated for tasks such as interpolation between lighting conditions, shadow removal, and albedo estimation. This clearly demonstrates the feasibility of using cross-view completion as pre-training for photometric downstream tasks where training data is more limited.
Left: data flow through CroCo. Right: the relighting that we hypothesise CroCo must implicitly perform. To predict a masked patch (?), the target illumination must be estimated from unmasked patches (green). Patches containing the same scene content (blue) must be delit using the source lighting estimated from the second view (purple) and relit (orange) using the estimated illumination.
For shadow removal and albedo estimation, we train components S and A to learn the transformations in the lighting latent space from the input latent to the desired output latent, which is derived from the ground truth output image.
Drag the divider to compare input (left) with predicted albedo (right)
IIW (597)
IIW (2504)
IIW (10291)
IIW (104664)
IIW (117618)
IIW (118411)
Here we show the outputs of using the shadow removal component S to transform the lighting latents of input images to remove the shadows. S was jointly trained on the SRD, ISTD+, and WSRD+ datasets to produce a single general-purpose shadow removal model. We then compare to other major shadow removal methods. Despite our model being trained once on multiple datasets, our visual results demonstrate comparatively strong shadow removal even when compared to models fine-tuned on specific datasets.
Drag the divider to compare input (left) with shadow-free output (right)
ISTD+ (113-2)
ISTD+ (116-11)
SRD (MG_6507)
WSRD+ (0006)
WSRD+ (0009)
WSRD+ (0051)
Each pair of rows shows the method outputs (top) and a signed difference heatmap against the ground truth (bottom).
| Input / GT | CroCoDiLight (Ours) | StableShadowRemoval | OmniSR | HomoFormer | ||
|---|---|---|---|---|---|---|
![]() |
|
|
![]() |
|
||
![]() |
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
|
||
![]() |
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
|
|
![]() |
![]() |
||
![]() |
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
|
|
![]() |
![]() |
||
![]() |
![]() |
![]() |
|
![]() |
![]() |
|
Intrinsic patches from a single frame are kept fixed and then relit with the lighting latents from subsequent frames. While in the original timelapse, the clock hands move, the frozen intrinsics ensure that the clock hands remain static while the shadows change in the relit frames. Original timelapse from this video.
The lighting latent is extracted from one reference frame and applied to all other frames. Each frame retains its own intrinsic content (geometry, materials) but is relit to match the reference lighting. Original timelapse from this video.
Reference
@inproceedings{foggin2026crocodilight,
title={{CroCoDiLight}: Repurposing Cross-View Completion Encoders for Relighting},
author={Foggin, Alistair J and Smith, William A P},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=GKvb3HCyNk}
}