I have 184*A800, the number of training samples is 120 million,LLaVA-OneVision-1.5-4B-stage0,the training configuration is as follows:
TP="${1:-1}"
PP="${2:-1}"
SEQ_LEN="${3:-8192}"
MBS="${4:-1}"
GBS="${5:-5888}"
NSTEP="${6:-23000}"
The training period is supposed to be 7 days. Why is it taking so long? Is there anything that needs to be changed?
Additionally, I noticed that MBS only supports version 1. Could this be related?
Thanks