Run InternLM-XComposer-2d5-4bit with LMDeploy

Thanks to the LMDeploy team for providing AWQ quantization support (https://github.com/InternLM/lmdeploy/blob/main/docs/en/multi_modal/xcomposer2d5.md#quantization). We compare the memory usage between the FP16 model and the 4-bit model, setting cache_max_entry_count=0.01 to reduce GPU memory usage and better observe memory savings. The program was tested on PyTorch 2.2.2+cu118.

	GPU Memory (GB)
IXC2.5-lmdeploy	31.81
IXC2.5-lmdeploy-4bit	23.21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Run InternLM-XComposer-2d5-4bit with LMDeploy

Files

README.md

Latest commit

History

README.md

File metadata and controls

Run InternLM-XComposer-2d5-4bit with LMDeploy