Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Possible missed optimization with truncating vector cast #81883

Closed
okaneco opened this issue Feb 15, 2024 · 2 comments · Fixed by #83120
Closed

[X86] Possible missed optimization with truncating vector cast #81883

okaneco opened this issue Feb 15, 2024 · 2 comments · Fixed by #83120

Comments

@okaneco
Copy link

okaneco commented Feb 15, 2024

This function produces the following assembly
https://llvm.godbolt.org/z/q8YEhevbo

define void @cast_i16x4_to_u8x4(ptr sret(<4 x i8>) %_0, ptr %x) unnamed_addr #0 {
  %1 = load <4 x i16>, ptr %x
  %2 = trunc <4 x i16> %1 to <4 x i8>
  store <4 x i8> %2, ptr %_0
  ret void
}
cast_i16x4_to_u8x4:                     # @cast_i16x4_to_u8x4
        mov     rax, rdi
        mov     rcx, qword ptr [rsi]
        movq    xmm0, rcx
        movdqa  xmmword ptr [rsp - 24], xmm0
        movzx   ecx, cl
        movzx   edx, byte ptr [rsp - 22]
        shl     edx, 8
        or      edx, ecx
        movzx   ecx, byte ptr [rsp - 20]
        shl     ecx, 16
        or      ecx, edx
        movzx   edx, byte ptr [rsp - 18]
        shl     edx, 24
        or      edx, ecx
        mov     dword ptr [rdi], edx
        ret

but I expected something like this instead.

.LCPI0_0:
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
cast_i16x4_to_u8x4:                     # @cast_i16x4_to_u8x4
        mov     rax, rdi
        movq    xmm0, qword ptr [rsi]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        packuswb        xmm0, xmm0
        movd    dword ptr [rdi], xmm0
        ret

Original Rust code https://rust.godbolt.org/z/WrKdjx9sa
Originally found in Rust portable-simd code for casting an i16x4 to u8x4 rust-lang/portable-simd#369 (comment)

@llvmbot
Copy link
Member

llvmbot commented Feb 15, 2024

@llvm/issue-subscribers-backend-x86

Author: Collyn O'Kane (okaneco)

This function produces the following assembly https://llvm.godbolt.org/z/q8YEhevbo ```llvm define void @cast_i16x4_to_u8x4(ptr sret(<4 x i8>) %_0, ptr %x) unnamed_addr #0 { %1 = load <4 x i16>, ptr %x %2 = trunc <4 x i16> %1 to <4 x i8> store <4 x i8> %2, ptr %_0 ret void } ``` ```asm cast_i16x4_to_u8x4: # @cast_i16x4_to_u8x4 mov rax, rdi mov rcx, qword ptr [rsi] movq xmm0, rcx movdqa xmmword ptr [rsp - 24], xmm0 movzx ecx, cl movzx edx, byte ptr [rsp - 22] shl edx, 8 or edx, ecx movzx ecx, byte ptr [rsp - 20] shl ecx, 16 or ecx, edx movzx edx, byte ptr [rsp - 18] shl edx, 24 or edx, ecx mov dword ptr [rdi], edx ret ``` but I expected something like this instead. ```asm .LCPI0_0: .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff .short 255 # 0xff cast_i16x4_to_u8x4: # @cast_i16x4_to_u8x4 mov rax, rdi movq xmm0, qword ptr [rsi] pand xmm0, xmmword ptr [rip + .LCPI0_0] packuswb xmm0, xmm0 movd dword ptr [rdi], xmm0 ret ```

Original Rust code https://rust.godbolt.org/z/WrKdjx9sa
Originally found in Rust portable-simd code for casting an i16x4 to u8x4 rust-lang/portable-simd#369 (comment)

@RKSimon RKSimon self-assigned this Feb 16, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Feb 16, 2024

Even x86-64-v4 scalarizes this, despite it being able to have done a simple zextload/truncstore combo

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Feb 27, 2024
…irectly

We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB).

For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us.

Fixes llvm#81883
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Feb 27, 2024
…irectly

We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB).

For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us.

Fixes llvm#81883
fadlyas07 pushed a commit to greenforce-project/llvm-project that referenced this issue Feb 27, 2024
…irectly (llvm#83120)

We were scalarizing these truncations, but in most cases we can widen the source vector to 128-bits and perform the truncation as a shuffle directly (which will usually lower as a PACK or PSHUFB).

For the cases where the widening and shuffle isn't legal we can leave it to generic legalization to scalarize for us.

Fixes llvm#81883
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants