A self-supervised autoencoder trained without reconstruction loss.
No direct reconstruction loss. The core idea is simple: the representation of the output should look like the representation of the input.
This project began with one question:
Can an autoencoder learn to reconstruct images without any reconstruction loss at all?
Autoencoders are usually trained the obvious way: reconstruct the image, compare it to the original, punish the error. In practice, people often add extra supervision on top to improve fidelity, whether that is pixel losses, feature-space losses like LPIPS, or adversarial setups with a discriminator.
Here I wanted to try the opposite approach. Can the model teach itself to reconstruct without any image-space supervision at all, using only its own latent agreement as the training signal?
TLDR; YES!
An autoencoder is just an encoder and a decoder. The encoder compresses an image into a latent representation, and the decoder maps that latent back into image space. In ordinary autoencoders, the interesting question is not whether this loop can reconstruct, but which direct supervision makes the reconstructions good.
In the standard setup, the forward pass is
What changes is the objective. Usually it looks something like
This post removes all of that. The image goes in, the decoder produces an image, and after that the only question is whether the model internally treats the output as matching the input.
So the real proposal is not a new architecture. It is a new source of pressure on an otherwise ordinary encoder-decoder.
Formally, the model is still just \(z = E(x)\) and \(\hat{x} = D(z)\). The core self-teaching objective is
That is the whole game. SIGReg prevents the trivial constant-code collapse by pushing latents toward a Gaussian distribution (see this research note). Everything else has to emerge from latent consistency under transformations.
At first glance, the transformation looks like an odd extra. Why not just require \(E(x)\) and \(E(\hat{x})\) to match directly? If the decoder output lands at the same latent point as the input, shouldn’t that already mean it has reconstructed the image?
Without transformations, the objective barely constrains the decoder. The untransformed loss \(\lVert E(x) - E(D(E(x))) \rVert\) only requires the decoder output to land somewhere the encoder treats as equivalent to the input. It does not require a faithful reconstruction.
Without transformations, the model does not need to reconstruct the world. It only needs to reconstruct something the encoder recognises as one of its own.
That is the private-language loophole. The encoder and decoder can collude on codes that survive re-encoding without corresponding to real images. SIGReg stops the degenerate constant solution, but it does not stop this collusion.
The cleanest way to formalise this is with equivalence classes. The encoder partitions images into sets it treats as the same, so without transformations the decoder only has to produce something in the same class as the input.
Without transformations, the relevant set is
and that set can be enormous. With transformations, the requirement becomes
Each transformation checks a different aspect of the image, so the intersection gets tighter. In the ideal case, the only thing that keeps passing all those checks is the real image itself. That is why transformations are necessary: they turn latent agreement from an easily hacked objective into meaningful pressure toward faithful reconstruction. Once I started thinking about it this way, the experimental question became much clearer: which transformations tighten the right constraints without pushing the corrupted view too far off the natural image distribution?
At this point the transformation is the method, so the next step was really a refinement process. Start on toy data, scale up to CIFAR-10, see what stops working, and keep only the variants that survive that transition.
The lesson from the toy setting was that sparse pixel masking works when structure is almost all that matters. The lesson from CIFAR-10 was that this is not enough for natural images, where colour and local texture also need to be constrained.
On grayscale shapes over black backgrounds, sparse pixel masking worked well. It preserved enough local structure for the model to recover the shape without introducing obvious artefacts, so in that regime the basic transformed objective already looked clean.
But that only solved the toy problem. It showed that the idea could work when structure is almost all that matters.
CIFAR-10 exposed the limitation immediately. Pixel masking did not fail completely, but it asked the wrong question for natural images. The model mostly learned luminance and coarse layout, while colour and local texture remained underconstrained.
In practice, the grayscale shape of the scene was often roughly right while colour remained weak or washed out. So the synthetic result was informative, but it did not scale.
Crop-resize was the first version that really worked on CIFAR-10. Small natural crops force agreement on colour and local texture as well as coarse shape, so they constrain exactly the parts that pixel masking was missing. It was the first latent-consistency variant that produced reconstructions with both sensible structure and sensible colour.
The MSE baseline is still slightly better here, but that was not the point. Crop-resize made the approach look plausible on real images, so it became the first version worth scaling further.
But crop-resize still left a second loophole. Once reconstructions became decent, the encoder could become invariant to the decoder's remaining artefacts instead of forcing the decoder to fix them. The important method change was therefore not a new corruption by itself, but a new judge: freeze the encoder during the comparison step so the decoder has to absorb more of the mismatch.
On CIFAR-10, step-freezing worked much better:
Within that step-frozen setup, crop-resize was still the strongest transformation overall. I also tried several other transformations in the same setup, including a small detail-sensitive stack plus a plain full-view term, but the main takeaway was unchanged: step-freezing improved the method, and crop-resize remained the most reliable corruption.
Once the method looked plausible on CIFAR-10, the next question was whether the step-frozen crop-resize variant would survive on a more realistic dataset. On CelebA, that naturally split into two settings: ordinary autoencoding, where the input still determines the target, and masked autoencoding, where the target is only partially observed.
In ordinary autoencoding, the job is still basically compression: preserve enough information in the latent to reconstruct the image. In masked autoencoding, that is no longer true. If the masked observation is \(y = m \odot x\), then the visible pixels do not determine a unique target, so the problem becomes conditional prediction rather than compression.
Formally, the conditional distribution \(p(x \mid y,m)\) has nonzero entropy, and under squared error the optimal predictor is
That is exactly why ambiguous regions become blurry under MSE: different plausible completions get averaged together. This is the point where the setup starts to look much closer to leJEPA. The masked input is being asked to land at the same latent target as the clean input, while the decoder, viewed as just another internal layer, is still being constrained onto the dataset distribution.
So leAutoencoder compares both the clean reconstruction \(\hat{x}(x)\) and the masked reconstruction \(\hat{x}(y)\) against the same transformed clean view. The masked branch is pulled toward the clean latent target, while the decoded branch is still pressured to stay on the image manifold.
The practical question, then, is how the same method behaves in those two settings on CelebA: first in ordinary autoencoding, then in masked autoencoding.
First, the easier setting: ordinary autoencoding at roughly 6x compression. Here the question is just whether the step-frozen crop-resize method still reconstructs cleanly once the data are more structured and higher resolution.
This mostly confirmed the CIFAR story. The step-frozen crop-resize variant remained stable enough to push into the genuinely harder masked setting.
Here I tried two much harder masked-autoencoding variants. With latent size 128, the model has about 11M parameters and the bottleneck is roughly 400x compressed. With latent size 512, it has about 36M parameters and the bottleneck is still roughly 96x compressed. So this is a very hard conditional reconstruction problem in both cases.
For the masking task I also switch to global pooling in the latent pathway. Once the input is only partially observed, forcing the representation through a globally pooled bottleneck made more sense than keeping a more spatially local code.
One important caveat is that the baseline autoencoder is trained directly from masked input to clean output, so its encoder is not really trained to handle clean inputs at all.
The results are a slight victory for our method, but it is competitive. The fact that it natively handles the input distribution is a bonus. It also seems to handle hair a bit better when it covers the face (latent size 512, columns 7 and 8). Unlike the baseline, it never creates isolated strands of hair. Still, I would not say it fully resolves the issue of averaging instead of choosing a mode. That may also be partly due to architectural constraints.
The main point is that this is basically a leJEPA-style objective with one extra ingredient: an internal layer is constrained to stay on the input distribution. That matters because JEPA-style methods usually do not come with a decoder for free, and adding one later often damages the representation by forcing reconstruction into a latent space that was never meant to be invertible in the first place. I wrote more about that trade-off in this post on leJEPA and SIGReg. Here, the latent objective and the image-distribution constraint are learned together rather than bolted together afterward.
The other important point is why this behaves differently from a plain MSE autoencoder. It is not chasing weak pixel averages. It is chasing an abstract representation of what the image should contain under a family of transformations. That changes the failure mode. When it misses, in theory it is more likely to hallucinate something plausible than to wash everything into a blurry average, because the objective is organised around semantic agreement rather than direct pixel matching. The fact that those semantics are learned during training, without requiring a pretrained model, is the cherry on the cake.