[back to home]

[2024-06-13] jepa == byol

like sure jepa uses masking, byol samples two global augmentations

but predicting the embedding of the disjoint mask is exactly as difficult as predicting the embedding of the unmasked image

to be fair masking is possibly a marginally better operator than standard image augmentations for this kind of representation learning

but it still feels disingenuous to describe jepa as a completely new paradigm

"what is this guy talking about" uhh have some links

  1. I-JEPA paper
  2. V-JEPA paper
  3. BYOL paper
  4. probably the most succinct explanation for why BYOL-likes collapsen't: just ignore the RL part and read P^pi like the augmentation operator
  5. what "augmentation operator"?
  6. "oh but jepa conditions on the thing" and it doesn't really matter

non-contrastive learning with a nonlinear predictor is just contrastive learning with a spicier nonlinear distance function [citation needed]

it's all about finding a low dimensional representation which is unique-ish to your pre-augmentation / pre-masking input

unfortunately you can say the same thing about classic reconstruction based models, contrastive algorithms, or even literal next-token prediction

they're still cool algorithms tho don't get me wrong i heckin love byol-likes / jepa-likes

but for real jepa is literally just byol, but the encoder instancewise shards the masking operator instead of the standard image augmentation operator