Skip to content

Conversation

@baominghelly
Copy link
Contributor

Description

Add Ascend megatron script.

Test evidence

 [2025-09-26 08:00:57] iteration        2/   10000 | consumed samples:           64 | elapsed time per iteration (ms): 6912.0 | average overall token/sec : 18963.0 | average token/sec/GPU : 2370.4 | learning rate: 6.000000E-08 | global batch size:    32 | TFLOPS per GPU: 145.651594 | lm loss: 1.118608E+01 | loss scale: 1.0 | grad norm: 16.350 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:04] iteration        3/   10000 | consumed samples:           96 | elapsed time per iteration (ms): 6903.5 | average overall token/sec : 18986.3 | average token/sec/GPU : 2373.3 | learning rate: 9.000000E-08 | global batch size:    32 | TFLOPS per GPU: 145.830070 | lm loss: 1.119505E+01 | loss scale: 1.0 | grad norm: 74.449 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:11] iteration        4/   10000 | consumed samples:          128 | elapsed time per iteration (ms): 6897.7 | average overall token/sec : 19002.4 | average token/sec/GPU : 2375.3 | learning rate: 1.200000E-07 | global batch size:    32 | TFLOPS per GPU: 145.954090 | lm loss: 1.119183E+01 | loss scale: 1.0 | grad norm: 15.605 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:18] iteration        5/   10000 | consumed samples:          160 | elapsed time per iteration (ms): 6935.7 | average overall token/sec : 18898.2 | average token/sec/GPU : 2362.3 | learning rate: 1.500000E-07 | global batch size:    32 | TFLOPS per GPU: 145.153511 | lm loss: 1.119216E+01 | loss scale: 1.0 | grad norm: 15.761 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:25] iteration        6/   10000 | consumed samples:          192 | elapsed time per iteration (ms): 6908.9 | average overall token/sec : 18971.6 | average token/sec/GPU : 2371.4 | learning rate: 1.800000E-07 | global batch size:    32 | TFLOPS per GPU: 145.717298 | lm loss: 1.119343E+01 | loss scale: 1.0 | grad norm: 74.597 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:32] iteration        7/   10000 | consumed samples:          224 | elapsed time per iteration (ms): 6900.9 | average overall token/sec : 18993.4 | average token/sec/GPU : 2374.2 | learning rate: 2.100000E-07 | global batch size:    32 | TFLOPS per GPU: 145.884806 | lm loss: 1.118990E+01 | loss scale: 1.0 | grad norm: 15.641 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:39] iteration        8/   10000 | consumed samples:          256 | elapsed time per iteration (ms): 6903.3 | average overall token/sec : 18986.8 | average token/sec/GPU : 2373.4 | learning rate: 2.400000E-07 | global batch size:    32 | TFLOPS per GPU: 145.834351 | lm loss: 1.117223E+01 | loss scale: 1.0 | grad norm: 74.422 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:46] iteration        9/   10000 | consumed samples:          288 | elapsed time per iteration (ms): 6901.5 | average overall token/sec : 18991.7 | average token/sec/GPU : 2374.0 | learning rate: 2.700000E-07 | global batch size:    32 | TFLOPS per GPU: 145.872126 | lm loss: 1.119384E+01 | loss scale: 1.0 | grad norm: 16.069 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-09-26 08:01:53] iteration       10/   10000 | consumed samples:          320 | elapsed time per iteration (ms): 6908.6 | average overall token/sec : 18972.2 | average token/sec/GPU : 2371.5 | learning rate: 3.000000E-07 | global batch size:    32 | TFLOPS per GPU: 145.721718 | lm loss: 1.117562E+01 | loss scale: 1.0 | grad norm: 16.322 | num zeros: 0 | number of skipped iterations:   0 | number of nan iterations:   0 |
saving checkpoint at iteration      10 to /home/libaoming/workplace/InfiniPerf/benchmarks/compatibility/Megatron-LM/checkpoints/llama2-7b_pretrain_WS8_TP4_PP2 in torch_dist format

@baominghelly baominghelly self-assigned this Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants