Self-Teaching Autoencoder

A self-supervised autoencoder trained without reconstruction loss.

No direct reconstruction loss. The core idea is simple: the representation of the output should look like the representation of the input.

This project began with one question:

Can an autoencoder learn to reconstruct images without any reconstruction loss at all?

Autoencoders are usually trained the obvious way: reconstruct the image, compare it to the original, punish the error. In practice, people often add extra supervision on top to improve fidelity, whether that is pixel losses, feature-space losses like LPIPS, or adversarial setups with a discriminator.

Here I wanted to try the opposite approach. Can the model teach itself to reconstruct without any image-space supervision at all, using only its own latent agreement as the training signal?

TLDR; YES!

To painting demo

01 — Setup

What is being removed?

An autoencoder is just an encoder and a decoder. The encoder compresses an image into a latent representation, and the decoder maps that latent back into image space. In ordinary autoencoders, the interesting question is not whether this loop can reconstruct, but which direct supervision makes the reconstructions good.

In the standard setup, the forward pass is

\[ z = E(x), \qquad \hat{x} = D(z) \]

What changes is the objective. Usually it looks something like

\[ \mathcal{L}_{\mathrm{AE}} = \mathcal{L}_{\mathrm{pixel}}(x,\hat{x}) + \lambda_{\mathrm{perc}} \, \mathcal{L}_{\mathrm{perc}}(x,\hat{x}) + \lambda_{\mathrm{adv}} \, \mathcal{L}_{\mathrm{adv}}(\hat{x}) \]

This post removes all of that. The image goes in, the decoder produces an image, and after that the only question is whether the model internally treats the output as matching the input.

So the real proposal is not a new architecture. It is a new source of pressure on an otherwise ordinary encoder-decoder.

02 — Core Objective

Self-Teaching Autoencoder

Formally, the model is still just \(z = E(x)\) and \(\hat{x} = D(z)\). The core self-teaching objective is

\[ \mathcal{L} = \mathbb{E}_T \left[\lVert E(T(x)) - E(T(\hat{x})) \rVert\right] + \mathcal{L}_{\mathrm{SigReg}} \]

That is the whole game. SIGReg prevents the trivial constant-code collapse by pushing latents toward a Gaussian distribution (see this research note). Everything else has to emerge from latent consistency under transformations.

At first glance, the transformation looks like an odd extra. Why not just require \(E(x)\) and \(E(\hat{x})\) to match directly? If the decoder output lands at the same latent point as the input, shouldn’t that already mean it has reconstructed the image?

03 — Why Transformations Are Necessary

Private language and equivalence classes

Without transformations, the objective barely constrains the decoder. The untransformed loss \(\lVert E(x) - E(D(E(x))) \rVert\) only requires the decoder output to land somewhere the encoder treats as equivalent to the input. It does not require a faithful reconstruction.

Without transformations, the model does not need to reconstruct the world. It only needs to reconstruct something the encoder recognises as one of its own.

That is the private-language loophole. The encoder and decoder can collude on codes that survive re-encoding without corresponding to real images. SIGReg stops the degenerate constant solution, but it does not stop this collusion.

The cleanest way to formalise this is with equivalence classes. The encoder partitions images into sets it treats as the same, so without transformations the decoder only has to produce something in the same class as the input.

Without transformations, the relevant set is

\[ [x]_E = \{y : E(y) = E(x)\} \]

and that set can be enormous. With transformations, the requirement becomes

\[ [x]_{E,T} = \{y : E(T(y)) = E(T(x))\}, \qquad \hat{x} \in \bigcap_T [x]_{E,T} \]

Each transformation checks a different aspect of the image, so the intersection gets tighter. In the ideal case, the only thing that keeps passing all those checks is the real image itself. That is why transformations are necessary: they turn latent agreement from an easily hacked objective into meaningful pressure toward faithful reconstruction. Once I started thinking about it this way, the experimental question became much clearer: which transformations tighten the right constraints without pushing the corrupted view too far off the natural image distribution?

04 — Refining The Method

From toy data to CIFAR-10

At this point the transformation is the method, so the next step was really a refinement process. Start on toy data, scale up to CIFAR-10, see what stops working, and keep only the variants that survive that transition.

The lesson from the toy setting was that sparse pixel masking works when structure is almost all that matters. The lesson from CIFAR-10 was that this is not enough for natural images, where colour and local texture also need to be constrained.

Synthetic dataset

On grayscale shapes over black backgrounds, sparse pixel masking worked well. It preserved enough local structure for the model to recover the shape without introducing obvious artefacts, so in that regime the basic transformed objective already looked clean.

Synthetic grayscale-shape reconstructions showing the pixel transformation as the only clean method
On the synthetic grayscale data, the pixel transformation was the only method that reconstructed cleanly without obvious artefacts.

But that only solved the toy problem. It showed that the idea could work when structure is almost all that matters.

CIFAR-10

CIFAR-10 exposed the limitation immediately. Pixel masking did not fail completely, but it asked the wrong question for natural images. The model mostly learned luminance and coarse layout, while colour and local texture remained underconstrained.

In practice, the grayscale shape of the scene was often roughly right while colour remained weak or washed out. So the synthetic result was informative, but it did not scale.

CIFAR-10 reconstructions under the pixel transformation showing luminance structure without correct colour detail
On CIFAR-10, the pixel transformation mostly preserved luminance and coarse structure, but it did not constrain colour and finer detail well enough.

Crop-resize

Crop-resize was the first version that really worked on CIFAR-10. Small natural crops force agreement on colour and local texture as well as coarse shape, so they constrain exactly the parts that pixel masking was missing. It was the first latent-consistency variant that produced reconstructions with both sensible structure and sensible colour.

CIFAR-10 reconstructions under crop-and-resize showing improved colour and structure
On CIFAR-10, crop-and-resize produced reconstructions with substantially better colour and overall fidelity than the other transformation variants.

The MSE baseline is still slightly better here, but that was not the point. Crop-resize made the approach look plausible on real images, so it became the first version worth scaling further.

Step-frozen judging

But crop-resize still left a second loophole. Once reconstructions became decent, the encoder could become invariant to the decoder's remaining artefacts instead of forcing the decoder to fix them. The important method change was therefore not a new corruption by itself, but a new judge: freeze the encoder during the comparison step so the decoder has to absorb more of the mismatch.

On CIFAR-10, step-freezing worked much better:

CIFAR-10 reconstructions under the step-frozen crop-resize formulation
Click through to see different transformations with step-frozen encoder.

Within that step-frozen setup, crop-resize was still the strongest transformation overall. I also tried several other transformations in the same setup, including a small detail-sensitive stack plus a plain full-view term, but the main takeaway was unchanged: step-freezing improved the method, and crop-resize remained the most reliable corruption.

05 — Scaling The Task

CelebA autoencoding and masked autoencoding

Once the method looked plausible on CIFAR-10, the next question was whether the step-frozen crop-resize variant would survive on a more realistic dataset. On CelebA, that naturally split into two settings: ordinary autoencoding, where the input still determines the target, and masked autoencoding, where the target is only partially observed.

leAutoencoder

In ordinary autoencoding, the job is still basically compression: preserve enough information in the latent to reconstruct the image. In masked autoencoding, that is no longer true. If the masked observation is \(y = m \odot x\), then the visible pixels do not determine a unique target, so the problem becomes conditional prediction rather than compression.

Formally, the conditional distribution \(p(x \mid y,m)\) has nonzero entropy, and under squared error the optimal predictor is

\[ \hat{x}(y,m) = \mathbb{E}[x \mid y,m]. \]

That is exactly why ambiguous regions become blurry under MSE: different plausible completions get averaged together. This is the point where the setup starts to look much closer to leJEPA. The masked input is being asked to land at the same latent target as the clean input, while the decoder, viewed as just another internal layer, is still being constrained onto the dataset distribution.

So leAutoencoder compares both the clean reconstruction \(\hat{x}(x)\) and the masked reconstruction \(\hat{x}(y)\) against the same transformed clean view. The masked branch is pulled toward the clean latent target, while the decoded branch is still pressured to stay on the image manifold.

The practical question, then, is how the same method behaves in those two settings on CelebA: first in ordinary autoencoding, then in masked autoencoding.

Ordinary autoencoding

First, the easier setting: ordinary autoencoding at roughly 6x compression. Here the question is just whether the step-frozen crop-resize method still reconstructs cleanly once the data are more structured and higher resolution.

CelebA step-frozen crop-resize result
All CelebA results here use the same 2M parameter architecture at roughly 6x compression. I kept the bottleneck fairly gentle because this stage was only meant to validate the method.

This mostly confirmed the CIFAR story. The step-frozen crop-resize variant remained stable enough to push into the genuinely harder masked setting.

Masked autoencoding

Here I tried two much harder masked-autoencoding variants. With latent size 128, the model has about 11M parameters and the bottleneck is roughly 400x compressed. With latent size 512, it has about 36M parameters and the bottleneck is still roughly 96x compressed. So this is a very hard conditional reconstruction problem in both cases.

For the masking task I also switch to global pooling in the latent pathway. Once the input is only partially observed, forcing the representation through a globally pooled bottleneck made more sense than keeping a more spatially local code.

CelebA masked autoencoding comparison at latent 512, showing original images, masked inputs, our method masked and clean reconstructions, and baseline masked and clean reconstructions
Masked-autoencoding comparison on CelebA. Switch between the `latent=128` and `latent=512` runs; each figure already contains the side-by-side baseline versus leAutoencoder comparison.

One important caveat is that the baseline autoencoder is trained directly from masked input to clean output, so its encoder is not really trained to handle clean inputs at all.

The results are a slight victory for our method, but it is competitive. The fact that it natively handles the input distribution is a bonus. It also seems to handle hair a bit better when it covers the face (latent size 512, columns 7 and 8). Unlike the baseline, it never creates isolated strands of hair. Still, I would not say it fully resolves the issue of averaging instead of choosing a mode. That may also be partly due to architectural constraints.

Takeaway

What Actually Matters

The main point is that this is basically a leJEPA-style objective with one extra ingredient: an internal layer is constrained to stay on the input distribution. That matters because JEPA-style methods usually do not come with a decoder for free, and adding one later often damages the representation by forcing reconstruction into a latent space that was never meant to be invertible in the first place. I wrote more about that trade-off in this post on leJEPA and SIGReg. Here, the latent objective and the image-distribution constraint are learned together rather than bolted together afterward.

The other important point is why this behaves differently from a plain MSE autoencoder. It is not chasing weak pixel averages. It is chasing an abstract representation of what the image should contain under a family of transformations. That changes the failure mode. When it misses, in theory it is more likely to hallucinate something plausible than to wash everything into a blurry average, because the objective is organised around semantic agreement rather than direct pixel matching. The fact that those semantics are learned during training, without requiring a pretrained model, is the cherry on the cake.

Comments