Overall comments.
I think the contents of ch 7 should be "folded in" to the earlier
chapters, rather than being kept to the end of the book.
Concretely, I think sec 7.2 on SimDino should be added to ch 4 or 5,
sec 7.3 on CRATE for image classification should be added to ch 4,
sec 7.7 on CRATE for MAE should be added to sec 6.3.1 (regular MAE).
(And I think 6.3.1 should be moved to ch 5, whose focus is
unsupervised rep learning.)
Secs 7.4-7.5 are the only parts of the book that discuss language
modeling. They should probably be added to ch 4 (CRATE), but could
perhaps be their own short chapter.
Detailed comments (typos etc)
p196. This introductory stuff on CE should really be covered in ch 1,
where you should also introduce linear and logistic regression
(as supervised analogs of PCA).
p198. "We discuss these three parts presently."
Four parts :)
p203. Eq 7.2.34. Maybe worth spelling out how to compute the R term.
Also, the data processing steps are the same as regular DINO,
so you could skip repeating that.
Finally, it is worth explaining that you minimize the distance
from z(g|teacher) to both z(l|student) and z(g|student).
It is not clear why you should include z(g|student) as one
of the contrastive terms.
p204. Eqn 7.2.37. Here you have implicitly used CE(p,logits(q)).
Maybe add a softmax term to W*z?
p204. "The usual practice is to train the model first on
a large dataset (such as ImageNet-1K)".
Emphasize that pre-training is unlabeled.
p208. "We can then do linear probing, attention
map visualization, and detection/segmentation benchmarks,
given the output of this view."
Given that you trained the model with a logistic regression
classification head, it's odd that you don't also evaluate that.
p210. The segmentation results in fig 7.11 are indeed impressive.
Maybe it's worth speculating what aspects of CRATE vs VIT
lead to this performance. For example, one ablation
one could do is to tie Uq, Uk, Uv to be the same for each layer in
the VIT, to match the MSSA operator, and to make the MLP be 1 layer (to
roughly match the Dictionary layer of ISTA). Is this the magic sauce?
Or is it the use of the sparsifying ISTA operation vs MLP?
p212. Footnote 7. This goes to show that 1-based indexing
(used by Matlab/Julia) is better than Python's 0-based :)
p213. "Since E is so large (and the gradient update is very
sparse w.r.t. it since only a small fraction of the vocabulary
is used in each sample), specialized software is
used to make sure the memory updates are not too onerous."
Maybe cite the SparseCore paper?
N. P. Jouppi et al., “TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in International Symposium on Computer Architecture, Apr. 2023.
p213. "hte forward". Typo
p213. "causal constraint". This is not really for efficiency reasons,
since non-causal attention can also be computed in batch mode.
It's because we want to train the model in a way which is similar
to how it will be used at prediciton time, where the future is unknown.
p213. Eq 7.4.8. In a way it's odd that we only use the n'th embedding
to predict the next token, since it forces z(n) to contian information
about the entire past. Obviously z(n) is a function of the entire past,
due to self attention, but it might not have the capacity to store
this info. Has anyone considered using an extraction network
that pools over the last K tokens?
p214. "All these architectural choices mean that causal training
is extremely efficient relative to non-causal training".
It's not just the architecture, it's also because the prediction target
is fixed ahead of time, since we are using teacher forcing,
so we reduce the problem to N classification tasks that can be solved
in parallel (with a shared pre-processing step).
p215. Sec 7.4.5. You should define the word "perplexity"
since that is a common eval metric.
p220. Table 7.9. It would be good to measure or visualize
the running time, as well as validation loss. Is
the context length in these datasets is long enough to show
the benefits of linear time scaling? You also have to compete
with KV caches, etc. Is TOST more efficient in practice?
p220. Table 7.10. Caption says "regular (ViT) transformer "
but it's not ViT.
p222. "each patch feature should contain both information about the patch
and information about the statistics of the whole image".
It sounds like MAE would work much better if there were one or more
global CLS-like token embeddings to capture global semantics.
Do people do this?
(Oh, I see later that you do use a CLS token, even though MAE is
unsuepervised. Maybe you should mention this when you define the encoder.)
p222. "We use a CRATE encoder and decoder, depicted in Figure 7.7"
Fig 7.7 shows standard transformer, not CRATE.
Maybe you mean fig 6.5?
p222. Eq 7.6.4. Emphasize that it is - MSSA, not + MSSA.
p222. "discretizations of a forward- and reverse-time diffusion
process". Where does diffusion come in?
p225. Reference ZLG+ and JJB+ has no year.
p225. Reference Tea is woefully incomplete.
Overall comments.
I think the contents of ch 7 should be "folded in" to the earlier
chapters, rather than being kept to the end of the book.
Concretely, I think sec 7.2 on SimDino should be added to ch 4 or 5,
sec 7.3 on CRATE for image classification should be added to ch 4,
sec 7.7 on CRATE for MAE should be added to sec 6.3.1 (regular MAE).
(And I think 6.3.1 should be moved to ch 5, whose focus is
unsupervised rep learning.)
Secs 7.4-7.5 are the only parts of the book that discuss language
modeling. They should probably be added to ch 4 (CRATE), but could
perhaps be their own short chapter.
Detailed comments (typos etc)
p196. This introductory stuff on CE should really be covered in ch 1,
where you should also introduce linear and logistic regression
(as supervised analogs of PCA).
p198. "We discuss these three parts presently."
Four parts :)
p203. Eq 7.2.34. Maybe worth spelling out how to compute the R term.
Also, the data processing steps are the same as regular DINO,
so you could skip repeating that.
Finally, it is worth explaining that you minimize the distance
from z(g|teacher) to both z(l|student) and z(g|student).
It is not clear why you should include z(g|student) as one
of the contrastive terms.
p204. Eqn 7.2.37. Here you have implicitly used CE(p,logits(q)).
Maybe add a softmax term to W*z?
p204. "The usual practice is to train the model first on
a large dataset (such as ImageNet-1K)".
Emphasize that pre-training is unlabeled.
p208. "We can then do linear probing, attention
map visualization, and detection/segmentation benchmarks,
given the output of this view."
Given that you trained the model with a logistic regression
classification head, it's odd that you don't also evaluate that.
p210. The segmentation results in fig 7.11 are indeed impressive.
Maybe it's worth speculating what aspects of CRATE vs VIT
lead to this performance. For example, one ablation
one could do is to tie Uq, Uk, Uv to be the same for each layer in
the VIT, to match the MSSA operator, and to make the MLP be 1 layer (to
roughly match the Dictionary layer of ISTA). Is this the magic sauce?
Or is it the use of the sparsifying ISTA operation vs MLP?
p212. Footnote 7. This goes to show that 1-based indexing
(used by Matlab/Julia) is better than Python's 0-based :)
p213. "Since E is so large (and the gradient update is very
sparse w.r.t. it since only a small fraction of the vocabulary
is used in each sample), specialized software is
used to make sure the memory updates are not too onerous."
Maybe cite the SparseCore paper?
N. P. Jouppi et al., “TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in International Symposium on Computer Architecture, Apr. 2023.
p213. "hte forward". Typo
p213. "causal constraint". This is not really for efficiency reasons,
since non-causal attention can also be computed in batch mode.
It's because we want to train the model in a way which is similar
to how it will be used at prediciton time, where the future is unknown.
p213. Eq 7.4.8. In a way it's odd that we only use the n'th embedding
to predict the next token, since it forces z(n) to contian information
about the entire past. Obviously z(n) is a function of the entire past,
due to self attention, but it might not have the capacity to store
this info. Has anyone considered using an extraction network
that pools over the last K tokens?
p214. "All these architectural choices mean that causal training
is extremely efficient relative to non-causal training".
It's not just the architecture, it's also because the prediction target
is fixed ahead of time, since we are using teacher forcing,
so we reduce the problem to N classification tasks that can be solved
in parallel (with a shared pre-processing step).
p215. Sec 7.4.5. You should define the word "perplexity"
since that is a common eval metric.
p220. Table 7.9. It would be good to measure or visualize
the running time, as well as validation loss. Is
the context length in these datasets is long enough to show
the benefits of linear time scaling? You also have to compete
with KV caches, etc. Is TOST more efficient in practice?
p220. Table 7.10. Caption says "regular (ViT) transformer "
but it's not ViT.
p222. "each patch feature should contain both information about the patch
and information about the statistics of the whole image".
It sounds like MAE would work much better if there were one or more
global CLS-like token embeddings to capture global semantics.
Do people do this?
(Oh, I see later that you do use a CLS token, even though MAE is
unsuepervised. Maybe you should mention this when you define the encoder.)
p222. "We use a CRATE encoder and decoder, depicted in Figure 7.7"
Fig 7.7 shows standard transformer, not CRATE.
Maybe you mean fig 6.5?
p222. Eq 7.6.4. Emphasize that it is - MSSA, not + MSSA.
p222. "discretizations of a forward- and reverse-time diffusion
process". Where does diffusion come in?
p225. Reference ZLG+ and JJB+ has no year.
p225. Reference Tea is woefully incomplete.