Recommended reading: Dive into Deep Learning — D2L: Chapter 11 up to 11.5.
Homework note: Submit with Module 9 — Add cross-attention to your prior GRU-based seq2seq model so that the decoder can attend over all encoder hidden states at each decoding step (same translation task). Report and compare accuracy/bleu vs your previous best.
Cross-Attention diagram preview: