[X86] Possible missed optimization with truncating vector cast #81883

okaneco · 2024-02-15T17:35:05Z

This function produces the following assembly
https://llvm.godbolt.org/z/q8YEhevbo

define void @cast_i16x4_to_u8x4(ptr sret(<4 x i8>) %_0, ptr %x) unnamed_addr #0 {
  %1 = load <4 x i16>, ptr %x
  %2 = trunc <4 x i16> %1 to <4 x i8>
  store <4 x i8> %2, ptr %_0
  ret void
}

cast_i16x4_to_u8x4:                     # @cast_i16x4_to_u8x4
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        movq    xmm0, rcx
        movdqa  xmmword ptr [rsp - 24], xmm0
        movzx   ecx, cl
        movzx   edx, byte ptr [rsp - 22]
        shl     edx, 8
        or      edx, ecx
        movzx   ecx, byte ptr [rsp - 20]
        shl     ecx, 16
        or      ecx, edx
        movzx   edx, byte ptr [rsp - 18]
        shl     edx, 24
        or      edx, ecx
        mov     dword ptr [rdi], edx
        ret

but I expected something like this instead.

.LCPI0_0:
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
cast_i16x4_to_u8x4:                     # @cast_i16x4_to_u8x4
        mov     rax, rdi
        movq    xmm0, qword ptr [rsi]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        packuswb        xmm0, xmm0
        movd    dword ptr [rdi], xmm0
        ret

Original Rust code https://rust.godbolt.org/z/WrKdjx9sa
Originally found in Rust portable-simd code for casting an i16x4 to u8x4 rust-lang/portable-simd#369 (comment)

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-02-15T18:11:23Z

@llvm/issue-subscribers-backend-x86

Author: Collyn O'Kane (okaneco)

This function produces the following assembly https://llvm.godbolt.org/z/q8YEhevbo ```llvm define void @cast_i16x4_to_u8x4(ptr sret(<4 x i8>) %_0, ptr %x) unnamed_addr #0 { %1 = load <4 x i16>, ptr %x %2 = trunc <4 x i16> %1 to <4 x i8> store <4 x i8> %2, ptr %_0 ret void } ``` ```asm cast_i16x4_to_u8x4: # @cast_i16x4_to_u8x4 mov rax, rdi mov rcx, qword ptr [rsi] movq xmm0, rcx movdqa xmmword ptr [rsp - 24], xmm0 movzx ecx, cl movzx edx, byte ptr [rsp - 22] shl edx, 8 or edx, ecx movzx ecx, byte ptr [rsp - 20] shl ecx, 16 or ecx, edx movzx edx, byte ptr [rsp - 18] shl edx, 24 or edx, ecx mov dword ptr [rdi], edx ret ``` but I expected something like this instead. ```asm .LCPI0_0: .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff cast_i16x4_to_u8x4: # @cast_i16x4_to_u8x4 mov rax, rdi movq xmm0, qword ptr [rsi] pand xmm0, xmmword ptr [rip + .LCPI0_0] packuswb xmm0, xmm0 movd dword ptr [rdi], xmm0 ret ```

Original Rust code https://rust.godbolt.org/z/WrKdjx9sa
Originally found in Rust portable-simd code for casting an i16x4 to u8x4 rust-lang/portable-simd#369 (comment)

RKSimon · 2024-02-16T10:13:09Z

Even x86-64-v4 scalarizes this, despite it being able to have done a simple zextload/truncstore combo

…irectly We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB). For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us. Fixes llvm#81883

…irectly (llvm#83120) We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB). For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us. Fixes llvm#81883

github-actions bot added the new issue label Feb 15, 2024

EugeneZelenko added backend:X86 missed-optimization and removed new issue labels Feb 15, 2024

RKSimon self-assigned this Feb 16, 2024

RKSimon mentioned this issue Feb 27, 2024

[X86] ReplaceNodeResults - truncate sub-128-bit vectors as shuffles directly #83120

Merged

RKSimon closed this as completed in #83120 Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Possible missed optimization with truncating vector cast #81883

[X86] Possible missed optimization with truncating vector cast #81883

okaneco commented Feb 15, 2024

llvmbot commented Feb 15, 2024

RKSimon commented Feb 16, 2024

[X86] Possible missed optimization with truncating vector cast #81883

[X86] Possible missed optimization with truncating vector cast #81883

Comments

okaneco commented Feb 15, 2024

llvmbot commented Feb 15, 2024

RKSimon commented Feb 16, 2024