-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fastsafetensors #298
base: main
Are you sure you want to change the base?
Fastsafetensors #298
Conversation
Signed-off-by: Jefferson Fialho <[email protected]>
Signed-off-by: Jefferson Fialho <[email protected]>
Signed-off-by: Jefferson Fialho <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: fialhocoelho The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@fialhocoelho: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@@ -307,7 +308,15 @@ def _get_weights_iterator( | |||
hf_weights_files, | |||
) | |||
elif use_safetensors: | |||
weights_iterator = safetensors_weights_iterator(hf_weights_files) | |||
use_fastsafe_tensor = os.getenv('USE_FASTSAFETENSOR', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion here would be to enable loader extra config for the DefaultLoader: https://github.com/vllm-project/vllm/blob/e784c6b9984e8f8116f74000b863d941495acb0b/vllm/model_executor/model_loader/loader.py#L190
So that you can enable fastsafetensors and GDS with
--model-loader-extra-config='{"use_fastsafetensors": true, "enable_gds: true"}'
RUN --mount=type=cache,target=/root/.cache/pip \ | ||
--mount=type=cache,target=/root/.cache/uv \ | ||
--mount=type=bind,from=build,src=/workspace/dist,target=/workspace/dist \ | ||
HOME=/root uv pip install "$(echo /workspace/dist/*.whl)[tensorizer]" vllm-tgis-adapter==0.6.0 | ||
HOME=/root uv pip install "$(echo /workspace/dist/*.whl)[tensorizer]" vllm-tgis-adapter==0.6.0 nvidia-cufile-cu12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the fastsafetensors repo's pyproject.toml, nvidia-cufile-cu12
is not listed as a dependency. Should we open a PR there to add this as a dep instead of manually adding it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cufile
lib isn’t listed as a direct installation requirement, but it’s necessary because when importing the main object, it relies on a piece of code (fastcpp
) that depends on cufile
. The key point is that cufile
is required for using the library with GDS support, which isn’t our use case. However, due to how the lib is structured, we still need this dependency even if we’re not using GDS. According to the docs, setting nogds=True
disables GDS support, but the cufile
requirement is triggered immediately upon import. I believe the best course of action is to open an issue in the repo to address this condition (I can take care of that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @fialhocoelho, I missed this reply.
Thanks for the explanation. Are we sure we're not interested in GDS though? I don't know much about it, but it seems it's one of the performance improvements that fastsafetensors
brings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtrifiro I don’t have much experience with how GDS would be installed in an OpenShift env, since it operates at a low level. I’ve only used it in a DGX environment. I can look into it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html#software-architecture it doesn't seem like there's much needed apart from cufile
and the nvidia driver
=================================================================== | ||
|
||
Using fastsafetensor library enables loading model weights to GPU memory by leveraging GPU direct storage. See https://github.com/foundation-model-stack/fastsafetensors for more details. | ||
For enabling this feature, set the environment variable ``USE_FASTSAFETENSOR`` to ``true`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant?
This PR adds support for using
fastsafetensors
.Implementation considerations
Dockerfile.ubi
:The
numactl-devel
package is required for installingfastsafetensors
. However, it is not available in the repository (onlynumactl-libs
is). To address this, we compilenuma
(numactl
,numactl-devel
, andnumactl-libs
) and install it in thepython-cuda-base
layer immediately before the installation of the packages inrequirements-cuda.txt
(wherefastsafetensors
is listed for installation)In the
vllm-openai
layer, we add theLD_LIBRARY_PATH
environment variable to include the path for thenvidia-cufile
library, which is required for usingfastsafetensors
. Normally,cufile
is installed alongsidecuda-toolkit
(version >12) ornvidia-GDS
. Even though we are not using GDS, the binaries compiled byfastsafetensors
require thecufile
library to be loadedIn the final layer (vllm-grpc-adapter), we install
numactl
andnumactl-libs
, which were compiled in thepython-cuda-base
layer, to ensure thefastsafetensors
library works properly with the same version used in the compilation stepLastly, we install the
nvidia-cufile
package, which was already linked in the previous step/layer. While it may not be "aesthetically pleasing" to install the library alongside theRUN uv pip install
command that installs the adapter, I also don't find it very functional to create arequirements-fastsafetensors.txt
file or something similar, considering this is the only library besides thenuma
dependenciesvllm/model_executor/model_loader/weight_utils.py
:SafeTensorsFileLoader()
function initializesfastsafetensors
and includes thenogds
parameter, which defaults toFalse
. According to the docs, when set toTrue
, it disables support fornvidia-gds
. Since we do not usenvidia-gds
, I set it toTrue
.nogds
isFalse
andgds
is unavailable, the loader will fail. However, during testing, the function worked withnogds
set to bothTrue
andFalse
. It might be necessary to conduct more tests/benchmarks to evaluate the functionality and performance impact of using this library.