[2024-06-11] reading [:3]
i wonder how many people actually read the first 3 sections of a deep learning research paper
by 'first 3' i mean introduction, background / related work, and motivation (/ derivation, although this post is more directed towards empirical research)
yeah yeah they aren't always actually located in the first 3 sections but anyway
i generally skip straight to the ~4th section where the actual practical algorithm is described
i would guesstimate that, conditioned on making it past the abstract, the optimal policy spends at most 20 seconds reading during the first skim of a paper, and doesn't read anything more than that 90% of the time
i probably drop around the same proportion of papers after reading just the abstract, so given that i've clicked on a link to a paper, the probability that i'll read it in detail is maybe around 1%?
which i'm guessing is pretty standard
anyway i'm getting sidetracked. the point of this post is, i would recommend trying to derive the motivation of the algorithm from the algorithm itself, before reading the author's motivation
e.g. today i reread the iq learn paper great paper btw
boiled down to the minimal pseudocode, iq-learn with uhhh the chi square div in discrete action space is basically
loss = V(s) - Q(s, a) + 0.5 * td_error
= logsumexp Q(s, .) - Q(s, a) + 0.5 * td_error
td_error = (Q(s, a) - V(s')).pow(2)
which of course as the authors note, this is just a variant of the cql loss with 0 reward
but hey, if you treat Q like some kind of log visitation energy
Q(s, a) = log q(s, a) = log p(s, a) + log Z
then its basically
loss = log p(s, a) + log Z - log p(s, a) - log Z + 0.5 (log p(s, a) + log Z - γlog p(s') - γlog Z).pow(2)
= log p(s) - log p(s, a) + 0.5 (log p(s, a) - γlog p(s') + (1 - γ) log Z).pow(2)
~= log p(a | s) + 0.5 (log p(s, a) - γlog p(s')).pow(2)
for γ close to one...
idk, maybe it can be interpreted as behavioral cloning except the model also parameterizes an implicit estimate of the expert's visitation p(s, a), and adds a penalty term to the loss that regularizes this estimate p(s, a) towards γp(s')?
(and i guess the (1 - γ) log Z term sort-of encourages Z to be close to one...?)
and that would make sense, right? like in terms of loss functions that constrain an estimated visitation towards being an actual valid visitation, given only a single transition sample, this is sort of the best you can do
maybe this is a subset of what the authors of iql describe in their section 3 or what the cql paper describes, maybe it's novel, maybe i messed up the math, who knows
but either way it's a good exercise to try and explain why an algorithm works before reading the author's explanation for how they derived the algorithm
it forces you to understand at a high level what the algorithm is doing
and maybe occasionally this kind of thought exercise might lead to interesting ideas, yknow what i mean?
errr of course i am obligated to tell you to always go back and read the entire, mathematically rigorous derivation in detail afterward, ahaha...