Skip to content

Why is the training speed so slow? #72

@tang-doudou1105

Description

@tang-doudou1105

I have 184*A800, the number of training samples is 120 million,LLaVA-OneVision-1.5-4B-stage0,the training configuration is as follows:
TP="${1:-1}"
PP="${2:-1}"
SEQ_LEN="${3:-8192}"
MBS="${4:-1}"
GBS="${5:-5888}"
NSTEP="${6:-23000}"
The training period is supposed to be 7 days. Why is it taking so long? Is there anything that needs to be changed?
Additionally, I noticed that MBS only supports version 1. Could this be related?
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions