The total feature dimention of CMU-MOSEI in your paper is 883(768 for text, 80 for audio and 35 for visual). Concated features will first pass transformer in your code(As named SeqContext). The num_heads range of this transformer in your code is [7, 15].
But, 883 is a prime number.
nn.TransformerEncoderLayer(
d_model=883,
nhead=14,
)
AssertionError: embed_dim must be divisible by num_heads