-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hello,
I encountered an issue while running a script and would appreciate your help.
I need to generate longer token sequences, so I modified the MAX_NUM_TOKENS value in include/flexflow/batch_config.h from its default(1024) to 16384. However, when I run the script and try to increase the sequence length to 1024, the program crashes with the following error:
{gpu}: CUDA error reported on GPU 0: an illegal memory access was encountered (CUDA_ERROR_ILLEGAL_ADDRESS)
python: /home/wutong/SpecInfer_24_new2/deps/legion/runtime/realm/cuda/cuda_module.cc:343: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
After debugging, I found that the issue originates in the serve_spec_infer function in src/runtime/request_manager.cc, where the call to get_void_result() causes the crash.
Steps I have tried:
1.Modified MAX_NUM_TOKENS and recompiled the program.
2.Checked GPU memory availability.
3.Reduced the sequence length—shorter sequences run without errors.
Despite these efforts, the error persists.
My questions:
1.Are there additional configurations or files that need to be updated to support longer sequence lengths?(such as 1024,2048)
2.Could this error be related to GPU memory allocation or CUDA kernel settings? How can I further debug or resolve this issue?
Any guidance would be greatly appreciated. Thank you!