-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge EmbeddedLLM/vllm-rocm into vLLM main #1749
Conversation
* port dtype_float16.cuh and cache_kernels.cu * port dtype_bfloat16.cuh * port attention_utils.cuh * port more kernels * fix typo * add cuda_compat.h * sync branches * update * update * fixes * cleanup * update * update * update * fmt * cleanup * refactor * update * detecting rocm and adding flag for compiling * using asm volatile instead of hip api * using asm volatile for type casting of f16 --------- Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Amir Balwel <[email protected]>
…oblem and xformers license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for upstreaming this! We will review soon.
|
||
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] | ||
|
||
FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make a new docker file.
&& git clone https://github.com/ROCmSoftwarePlatform/flash-attention.git \ | ||
&& cd flash-attention \ | ||
&& git submodule update --init \ | ||
&& sed -i -e "s/--offload-arch=native/--offload-arch=$(/opt/rocm/llvm/bin/amdgpu-offload-arch)/g" setup.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pull request.
This line is no-op since I don't see any reference of offload-arch in setup.py file.
Therefore, when I test this pull request and build the docker using this Dockerfile, it failed because of that.
Can you check the setup.py file?
/opt/rocm/bin/hipcc -I/app/libs/flash-attention/csrc/flash_attn_rocm -I/app/libs/flash-attention/csrc/flash_attn_rocm/src -I/app/libs/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/app/libs/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.10/include/python3.10 -c -c /app/libs/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /app/libs/flash-attention/build/temp.linux-x86_64-cpython-310/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc^M
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'^M
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We implemented a temporary solution during the build process with ROCm/flash-attention@edc7698.
The issue with the hardcoded --offload-arch=native
has been resolved by the commit ROCm/flash-attention@5f1ae07. It appears that the temporary fix is no longer necessary. Following the testing of the most recent version of flash-attention, we plan to revise the Dockerfile accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, use a specific commit of a named branch to achieve stable and reproducible result than using the default branch since it might keep changing. Related to the name of the Dockerfile, you might want to rename the Dockerfile to Dockerfile.rocm_xxx with xxx related to the version of the rocm you are using.
@hongxiayang @WoosukKwon @simon-mo |
The full version is merged in #1836! |
Checklist: