Skip to content

Conversation

@loadams
Copy link
Collaborator

@loadams loadams commented Sep 28, 2023

This reverts commit 388c848.

@tjruwase
Copy link
Contributor

@delock and @Yejing-Lai do you have any idea on the problem with #4263? Thanks!

@delock
Copy link
Collaborator

delock commented Sep 29, 2023

@delock and @Yejing-Lai do you have any idea on the problem with #4263? Thanks!

Hi @tjruwase @loadams Do you have a description to the problem with #4263 ?

@loadams
Copy link
Collaborator Author

loadams commented Sep 29, 2023

@delock and @Yejing-Lai do you have any idea on the problem with #4263? Thanks!

Hi @tjruwase @loadams Do you have a description to the problem with #4263 ?

@delock - it appears after that PR was merged we see the cpu_inference tests either failing or failing and timing out at the 6 hour mark. I'm not sure why/how this passed on the merge queue, but reverting the PR fixes it, so I'm trying to isolate what the change was that caused this, or if these are new test failures that should be skipped. Let me know if you have any thoughts on this?

This looks to be when it ran in the merge queue, it seems to have failed but somehow GitHub marked that as a pass? Its also running more tests than the previous one that ran in the merge queue before it. Is it worth skipping these tests for now, or should we revert then re-apply this PR?

@delock
Copy link
Collaborator

delock commented Sep 30, 2023

Hi @loadams, sorry for breaking the workflow. From the error message (https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538), it means the UT tries to start test with FP16 data format which is not supported on CPU. How did it escaped merge CI is not clear to me. A proper fix should skip these tests for FP16.

From the error log, TestModelTask should have the following check as other tests. I need to run a few tests in my fork to verify.

if dtype not in get_accelerator().supported_dtypes():                                                
    pytest.skip(f"Acceleraor {get_accelerator().device_name()} does not support {dtype}.")           

@delock and @Yejing-Lai do you have any idea on the problem with #4263? Thanks!

Hi @tjruwase @loadams Do you have a description to the problem with #4263 ?

@delock - it appears after that PR was merged we see the cpu_inference tests either failing or failing and timing out at the 6 hour mark. I'm not sure why/how this passed on the merge queue, but reverting the PR fixes it, so I'm trying to isolate what the change was that caused this, or if these are new test failures that should be skipped. Let me know if you have any thoughts on this?

This looks to be when it ran in the merge queue, it seems to have failed but somehow GitHub marked that as a pass? Its also running more tests than the previous one that ran in the merge queue before it. Is it worth skipping these tests for now, or should we revert then re-apply this PR?

@delock
Copy link
Collaborator

delock commented Oct 1, 2023

@loadams @tjruwase This PR fix the not implemented error
#4430
This fix is verified on my local repo, cpu_inference can pass.
delock#15

Hi @loadams, sorry for breaking the workflow. From the error message (https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538), it means the UT tries to start test with FP16 data format which is not supported on CPU. How did it escaped merge CI is not clear to me. A proper fix should skip these tests for FP16.

From the error log, TestModelTask should have the following check as other tests. I need to run a few tests in my fork to verify.

if dtype not in get_accelerator().supported_dtypes():                                                
    pytest.skip(f"Acceleraor {get_accelerator().device_name()} does not support {dtype}.")           

@delock and @Yejing-Lai do you have any idea on the problem with #4263? Thanks!

Hi @tjruwase @loadams Do you have a description to the problem with #4263 ?

@delock - it appears after that PR was merged we see the cpu_inference tests either failing or failing and timing out at the 6 hour mark. I'm not sure why/how this passed on the merge queue, but reverting the PR fixes it, so I'm trying to isolate what the change was that caused this, or if these are new test failures that should be skipped. Let me know if you have any thoughts on this?
This looks to be when it ran in the merge queue, it seems to have failed but somehow GitHub marked that as a pass? Its also running more tests than the previous one that ran in the merge queue before it. Is it worth skipping these tests for now, or should we revert then re-apply this PR?

@loadams
Copy link
Collaborator Author

loadams commented Oct 2, 2023

Thanks, @delock - I'll close this PR and we can move any other discussion to your PR. I'm still not sure how this showed as passed on the GitHub checks, but thanks for looking into this.

@loadams loadams closed this Oct 2, 2023
github-merge-queue bot pushed a commit that referenced this pull request Jan 9, 2024
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
#4419

https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
deepspeedai#4419

https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants