“This property ensures that the model can only attend to previous positions in the sequence, not future positions, in order to generate predictions sequentially.”
“egarding batched inference: I believe that all sequences in the batch are unrolled together so there is no need to pad anything during inference, but at the end you have to slice shorter outputs. I really couldn't find any reason for having the mask, but yet these "reference" implementations that