Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastsafetensors #298

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion Dockerfile.ubi
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ ARG PYTHON_VERSION=3.12
ARG TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
ARG vllm_fa_cmake_gpu_arches='80-real;90-real'


## Base Layer ##################################################################
FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} as base
ARG PYTHON_VERSION
Expand Down Expand Up @@ -50,12 +51,22 @@ ENV CUDA_HOME="/usr/local/cuda" \
PATH="${CUDA_HOME}/bin:${PATH}" \
LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${CUDA_HOME}/extras/CUPTI/lib64:${LD_LIBRARY_PATH}"


## Python cuda base #################################################################
FROM cuda-base AS python-cuda-base

ENV VIRTUAL_ENV=/opt/vllm
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# install numactl and common dependencies for fastsafetensors
RUN microdnf install autoconf automake libtool make rpm-build -y && \
microdnf download --source numactl.src && \
NUMACTL_V=$(rpm -qp --qf "%{VERSION}-%{RELEASE}\n" numactl-*.rpm | sort -V | tail -n 1) && \
rpm -i numactl-${NUMACTL_V}.src.rpm && \
rpmbuild -ba /root/rpmbuild/SPECS/numactl.spec && \
rpm -i /root/rpmbuild/RPMS/x86_64/{numactl-libs-${NUMACTL_V}.x86_64.rpm,numactl-${NUMACTL_V}.x86_64.rpm,numactl-devel-${NUMACTL_V}.x86_64.rpm} && \
microdnf clean all

# install cuda and common dependencies
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/uv \
Expand All @@ -80,6 +91,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
-r requirements-cuda.txt \
-r requirements-dev.txt


## Builder #####################################################################
FROM dev AS build

Expand Down Expand Up @@ -122,6 +134,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
CMAKE_BUILD_TYPE=Release \
python3 setup.py bdist_wheel --dist-dir=dist


#################### libsodium Build IMAGE ####################
FROM base as libsodium-builder

Expand All @@ -139,6 +152,7 @@ RUN curl -LO https://github.com/jedisct1/libsodium/releases/download/${LIBSODIUM
RUN CFLAGS="-O3 -Wall -Werror=format-security -Wno-unused-function -Wp,-D_GLIBCXX_ASSERTIONS -fstack-protector-strong -fstack-clash-protection -fcf-protection"\
./configure --prefix="/usr/" && make -j $MAX_JOBS && make check


## Release #####################################################################
FROM python-install AS vllm-openai
ARG PYTHON_VERSION
Expand All @@ -152,6 +166,7 @@ ENV PATH=$VIRTUAL_ENV/bin:$PATH
ENV LD_LIBRARY_PATH="${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages/nvidia/cuda_nvrtc/lib:${LD_LIBRARY_PATH}"
ENV LD_LIBRARY_PATH="${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages/nvidia/cuda_runtime/lib:${LD_LIBRARY_PATH}"
ENV LD_LIBRARY_PATH="${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages/nvidia/nvtx/lib:${LD_LIBRARY_PATH}"
ENV LD_LIBRARY_PATH="${VIRTUAL_ENV}/lib/python${PYTHON_VERSION}/site-packages/nvidia/cufile/lib:${LD_LIBRARY_PATH}"

# Triton needs a CC compiler
RUN microdnf install -y gcc \
Expand Down Expand Up @@ -202,14 +217,21 @@ WORKDIR /home/vllm
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]


## Last image ##################################################################
FROM vllm-openai as vllm-grpc-adapter

USER root

# Installing numactl and numactl-libs for fastsafetensors
RUN --mount=type=bind,from=python-cuda-base,src=/root/rpmbuild/RPMS/x86_64,target=/workspace/RPMS \
NUMACTL_V=$(rpm -qp --qf "%{VERSION}-%{RELEASE}\n" /workspace/RPMS/numactl-*.rpm | sort -V | tail -n 1) && \
rpm -i /workspace/RPMS/numactl-${NUMACTL_V}.x86_64.rpm /workspace/RPMS/numactl-libs-${NUMACTL_V}.x86_64.rpm

# Ensure correct vLLM version, vllm-tgis-adapter and cufile for fastsafetensors
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,from=build,src=/workspace/dist,target=/workspace/dist \
HOME=/root uv pip install "$(echo /workspace/dist/*.whl)[tensorizer]" vllm-tgis-adapter==0.6.0
HOME=/root uv pip install "$(echo /workspace/dist/*.whl)[tensorizer]" vllm-tgis-adapter==0.6.0 nvidia-cufile-cu12
Copy link

@dtrifiro dtrifiro Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the fastsafetensors repo's pyproject.toml, nvidia-cufile-cu12 is not listed as a dependency. Should we open a PR there to add this as a dep instead of manually adding it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cufile lib isn’t listed as a direct installation requirement, but it’s necessary because when importing the main object, it relies on a piece of code (fastcpp) that depends on cufile. The key point is that cufile is required for using the library with GDS support, which isn’t our use case. However, due to how the lib is structured, we still need this dependency even if we’re not using GDS. According to the docs, setting nogds=True disables GDS support, but the cufile requirement is triggered immediately upon import. I believe the best course of action is to open an issue in the repo to address this condition (I can take care of that).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @fialhocoelho, I missed this reply.

Thanks for the explanation. Are we sure we're not interested in GDS though? I don't know much about it, but it seems it's one of the performance improvements that fastsafetensors brings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtrifiro I don’t have much experience with how GDS would be installed in an OpenShift env, since it operates at a low level. I’ve only used it in a DGX environment. I can look into it a bit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html#software-architecture it doesn't seem like there's much needed apart from cufile and the nvidia driver


ENV GRPC_PORT=8033 \
PORT=8000 \
Expand Down
5 changes: 5 additions & 0 deletions docs/source/serving/weights_loading_with_fastsafetensor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Loading Model weights with fastsafetensors
===================================================================

Using fastsafetensor library enables loading model weights to GPU memory by leveraging GPU direct storage. See https://github.com/foundation-model-stack/fastsafetensors for more details.
For enabling this feature, set the environment variable ``USE_FASTSAFETENSOR`` to ``true``

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant?

1 change: 1 addition & 0 deletions requirements-cuda.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ torch == 2.5.1
# These must be updated alongside torch
torchvision == 0.20.1 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
xformers == 0.0.28.post3; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch 2.5.1
fastsafetensors # Required for model loading via gpu direct storage
17 changes: 13 additions & 4 deletions vllm/model_executor/model_loader/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,10 @@
set_default_torch_dtype)
from vllm.model_executor.model_loader.weight_utils import (
download_safetensors_index_file_from_hf, download_weights_from_hf,
filter_duplicate_safetensors_files, filter_files_not_needed_for_inference,
get_gguf_extra_tensor_names, gguf_quant_weights_iterator,
initialize_dummy_weights, np_cache_weights_iterator, pt_weights_iterator,
fastsafetensors_weights_iterator, filter_duplicate_safetensors_files,
filter_files_not_needed_for_inference, get_gguf_extra_tensor_names,
gguf_quant_weights_iterator, initialize_dummy_weights,
np_cache_weights_iterator, pt_weights_iterator,
runai_safetensors_weights_iterator, safetensors_weights_iterator)
from vllm.model_executor.utils import set_weight_attrs
from vllm.platforms import current_platform
Expand Down Expand Up @@ -307,7 +308,15 @@ def _get_weights_iterator(
hf_weights_files,
)
elif use_safetensors:
weights_iterator = safetensors_weights_iterator(hf_weights_files)
use_fastsafe_tensor = os.getenv('USE_FASTSAFETENSOR',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion here would be to enable loader extra config for the DefaultLoader: https://github.com/vllm-project/vllm/blob/e784c6b9984e8f8116f74000b863d941495acb0b/vllm/model_executor/model_loader/loader.py#L190

So that you can enable fastsafetensors and GDS with
--model-loader-extra-config='{"use_fastsafetensors": true, "enable_gds: true"}'

'False').lower() == 'true'
if use_fastsafe_tensor:
logger.info("Using fastsafetensor for loading weights")
weights_iterator = fastsafetensors_weights_iterator(
hf_weights_files)
else:
weights_iterator = safetensors_weights_iterator(
hf_weights_files)
else:
weights_iterator = pt_weights_iterator(hf_weights_files)

Expand Down
32 changes: 32 additions & 0 deletions vllm/model_executor/model_loader/weight_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import huggingface_hub.constants
import numpy as np
import torch
from fastsafetensors import SafeTensorsFileLoader, SingleGroup
from huggingface_hub import HfFileSystem, hf_hub_download, snapshot_download
from safetensors.torch import load_file, safe_open, save_file
from tqdm.auto import tqdm
Expand Down Expand Up @@ -418,6 +419,37 @@ def safetensors_weights_iterator(
yield name, param


def fastsafetensors_weights_iterator(
hf_weights_files: List[str]
) -> Generator[Tuple[str, torch.Tensor], None, None]:
"""Iterate over the weights in the model safetensor files
using fastsafetensor library."""
pg = SingleGroup()
if torch.distributed.is_initialized():
pg = torch.distributed.group.WORLD

device = torch.device(f'cuda:{pg.rank()}')
weight_files_sub_lists = [
hf_weights_files[i:i + pg.size()]
for i in range(0, len(hf_weights_files), pg.size())
]

for f_list in weight_files_sub_lists:
# nogds=True DISABLE the NVIDIA GDS support for fastsafetensors
loader = SafeTensorsFileLoader(pg, device,
nogds=True,
debug_log=False)
rank_file_map = {i: [f] for i, f in enumerate(f_list)}
loader.add_filenames(rank_file_map)
fb = loader.copy_files_to_device()
keys = list(fb.key_to_rank_lidx.keys())
for k in keys:
t = fb.get_tensor(k)
yield k, t
fb.close()
loader.close()


def runai_safetensors_weights_iterator(
hf_weights_files: List[str]
) -> Generator[Tuple[str, torch.Tensor], None, None]:
Expand Down
Loading