the dropout use inverted version. but in predict stage, it still scale the output
the dropout use inverted version. but in predict stage, it still scale the output