[2024-06-15] SSL cond indep is easy and fun
so we've all heard of Lee et al. 2020
by now where yadda yadda if you train a model to predict one part of
the input x1 from another part of the input x2
... (the authors give an example of, x1 might be the foreground of an image
while x2 is the background) ...
then this model is automatically good at predicting any variable y as
long as x1 and x2 are conditionally independent given y, which means
self-supervised learning is great
and this sounds great at first, except some like to point out that this conditional independence assumption doesn't seem reasonable given that practically speaking, x1 and x2 are usually heavily correlated inputs like two augmentations of the same input image
Lee at al. 2020 tries to get around this with some analysis of e.g. what if x and x' are approximately cond indep given y and whatnot
but tbh that's totally unnecessary---everything is fine because, uh, if x1, x2 are augmentations of some image x, we can just take y=x.
at first glance that might seem a bit sus, like, how does that even lead to a useful representation
but think about contrastive learning algs, like SimCLR.
what simclr and most other contrastive algs do is basically just, make the representation of different images have low cosine alignment
it's basically the same as trying to assign each image in the training dataset its own direction in representation space
of course, if the dataset size is much larger than the representation dimensionality, then the best you can do is some kind of superposition
but as long as you've achieved some sort of approximate orthogonalish superposition of the training dataset in representation space, then you can shard out arbitrary binary labelings of the training dataset with a hyperplane
i.e. you can learn any classifier with a linear probe
simclr attempting to map each point of the dataset to its own approximate direction / dimension in representation space, from a certain perspective, is the same as trying to make it easy to determine the identity of the original image in the training dataset, given an augmentation of that image. if anything the simclr objective is basically just a direct translation of, "make the nearest neighbor classifier for 'identity of original image' as accurate as possible"
of course this doesn't at all explain why simclr representations generalize outside of the training dataset---that likely comes down to smoothness / continuity constraints on the encoder, arising from its parameterization as a neural network and how the loss function encourages invariance on positive pairs
i guess this is all to say, the downstream task variable, y, discussed in Lee et al. doesn't need to be thought of as some kind of high-level info like class label, nor does the conditional independence assumption need to be thought of as approximate. this conclusion is empirically supported from the fact that practical self-supervised learning algs seem to work fine while essentially just learning to be able to predict the "identity of original image" variable
as a sidenote, i suppose simclr can intuitively be thought of as the same thing as learning a representation with a VAE, except the noise in latent space is a nonparametric standard normal rather than a gaussian with parameterized diagonal. the main difference being, simclr no longer parameterizes a high dimensional decoder, but instead decodes as a weighted mean according to the nearest images in the training dataset. to make the decoder as accurate as possible, simclr simply tries to make the representations of different images consistently far apart from each other, where "far" means euclidean / cosine distance in this context
and BYOL is that but the neural network predictor parameterizes some nonlinear distance function as a drop-in replacement for euclidean.