thanks your excellent work!
In stage 1, the used dataset is LLaVA-558K-Webdataset which has 558K samples. But the NSTEP and GBS are 2500 and 8 respectively in the file stage_1_alignment_llava_ov_4b.sh. Does this mean that stage 1 only use 20K sampes to train the projector?