-
Notifications
You must be signed in to change notification settings - Fork 806
Description
I am trying to figure out the implementation of coverage mechanism, and after debug for a while, I still cannot understand why is the procedure of producing coverage vector in decode mode NOT the same as that in training/eval mode.
Related code is here: this line
Note that this attention decoder passes each decoder input through a linear layer with the previous step's context vector to get a modified version of the input. If initial_state_attention is False, on the first decoder step the "previous context vector" is just a zero vector. If initial_state_attention is True, we use initial_state to (re)calculate the previous step's context vector. We set this to False for train/eval mode (because we call attention_decoder once for all decoder steps) and True for decode mode (because we call attention_decoder once for each decoder step).
IMHO, the training and decode procedures would mismatch to some extend in such an implementation (Please correct me if I am wrong).
For example:
Let H be all encoder hidden states (a list of tensors), then,
In training/eval mode, every decode step use attention network only once:
Input: H, current_decoder_hidden_state, previous_coverage(None for the first decode step)
Output: next coverage, next context and attention weights( i.e. attn_dist in the code).
In decode mode, every step will apply attention mechanism twice:
(1) The first time:
Input: H, previous_decoder_hidden_state, previous_coverage (0s for the first decode step)
Output: modified previous context and next coverage (discard attention weights here)
(2) The second time:
Input: H, current_decoder_hidden_state, next coverage
Output: next context, attention weights (DO NOT update next coverage here)