Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help the compiler vectorize adjacent_difference #4958

Merged
merged 27 commits into from
Oct 30, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Sep 14, 2024

📜 The approach

The following things prevented the original algorithm from vectorization:

  • Loop-carried dependency, the previous input is used as one of operands.
    • This seems expected that the compiler doesn't transform such code to eliminate this automagically, too much of a transformation.
    • This was addressed by transforming the code to read the input array twice per iteration instead of carrying the values through the loop.
  • Odd iterator pattern where the compiler cannot understand the iteration.
    • This seemed to me a strange limitation, so it was reported as DevCom-10742868.
    • This was addressed by using integer index.

🛑 Correctness concern

The standard defines exact steps for this algorithm. The optimization alters the steps.
In particular the standard wants the subtracted value to be saved from the previous iteration, rather than being read again.
The two below sections explain what precautions are made to make the change unobservable, so I hope the change is correct.

✅ Checks for eligibility

The following checks were added:

  • No Aliasing (see below)
  • Iterators can be pointers
  • Source iterator is not volatile (read order is altered)
  • Trivially copyable (we skip copying where the standard asks for it)

There's no need in check for integral types or so, since the compiler makes the final decision anyway, and it may be able optimize even something that wouldn't pass a strict check.

⚠️ No Aliasing

Apparently there's no rule that the source and the destination ranges may not overlap.
We should handle aliasing.

Unlike the #4431 precedent, we can't yield to the compiler here. The compiler is able to insert overlaps check that prevents vectorization and go to the scalar fallback in case of checks failure, but:

  • We apply transformation that would change the meaning of the program in case of overlapping range, and the meaning would be changed no matter if vectorization happens
  • The checks that compiler inserts may be too loose, it may allow like equal source and destination pointer, as these are thc checks if the transformed algorithm would not change the meaning

So we do our own checks.

Then we tell the compiler with __restrict that we already checked, and it should not bother. This is done in a separate function, because the __restrict is not aliased within scope, so saying __restrict within the original algorithm would apparently be a lie.

The extra check by the compiler, if not prevented would slightly add run time and dead code size.

😾 Compiler warnings

We have a great feature called integral promotion. Smaller types are converted to integers, and there is a warning about converting them back. Local pragma suppresses them in benchmark, but not in the test.

@StephanTLavavej used a function object with static_cast to avoid warnings in the test.

⏱️ Benchmark results

Benchmark main this this + AVX2
bm<uint8_t>/2255 745 ns 563 ns 562 ns
bm<uint16_t>/2255 799 ns 83.3 ns 75.1 ns
bm<uint32_t>/2255 731 ns 154 ns 141 ns
bm<uint64_t>/2255 805 ns 293 ns 272 ns
bm/2255 751 ns 154 ns 123 ns
bm/2255 753 ns 304 ns 233 ns

🥇 Results interpretation

  • Overall, we're good 😸
  • 8-bit case failed to vectorize for no reason, reported DevCom-10745948
  • Still 8-bit case is noticeably better. I didn't analyze that, but looks like this a consistent thing, not codegen gremlins. I think it is a side effect of eliminating loop-carried dependency, so the processor can parallelize and overlap iterations
  • AVX2 is only slightly faster. I did not analyze, but think that memory wall is being hit here 🧱
@AlexGuteniev AlexGuteniev requested a review from a team as a code owner September 14, 2024 10:33
@AlexGuteniev AlexGuteniev changed the title Help the compiler vectorize adjacent_differentce Sep 14, 2024
stl/inc/numeric Outdated Show resolved Hide resolved
stl/inc/numeric Outdated Show resolved Hide resolved
stl/inc/numeric Outdated Show resolved Hide resolved
stl/inc/numeric Show resolved Hide resolved
@StephanTLavavej StephanTLavavej added the performance Must go faster label Sep 15, 2024
@StephanTLavavej StephanTLavavej self-assigned this Sep 15, 2024
@AlexGuteniev

This comment was marked as resolved.

@CaseyCarter
Copy link
Member

CaseyCarter commented Sep 15, 2024

  • 8-bit case failed to vectorize for no reason (didn't look up if it is known compiler issue, or to be reported)

Interestingly it vectorizes if we use - directly instead of indirecting through std::minus, or if the output is a pointer to int. Something to do with narrowing the result of the promoted operation, maybe?

stl/inc/numeric Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
benchmarks/src/adjacent_difference.cpp Show resolved Hide resolved
benchmarks/src/adjacent_difference.cpp Show resolved Hide resolved
benchmarks/src/adjacent_difference.cpp Show resolved Hide resolved
stl/inc/numeric Outdated Show resolved Hide resolved
stl/inc/numeric Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Oct 24, 2024
@StephanTLavavej StephanTLavavej self-assigned this Oct 24, 2024
benchmarks/src/adjacent_difference.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
stl/inc/numeric Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed minor nitpicks and a significant fix for C++14/17. Speedups look good on my 5950X:

Benchmark Before After Speedup
bm<uint8_t>/2255 968 ns 967 ns 1.00
bm<uint16_t>/2255 917 ns 97.2 ns 9.43
bm<uint32_t>/2255 648 ns 158 ns 4.10
bm<uint64_t>/2255 689 ns 331 ns 2.08
bm<float>/2255 646 ns 158 ns 4.09
bm<double>/2255 652 ns 332 ns 1.96
@StephanTLavavej StephanTLavavej removed their assignment Oct 26, 2024
@StephanTLavavej StephanTLavavej self-assigned this Oct 29, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

I had to push an additional commit to fix the overlap check for heterogeneous types.

@StephanTLavavej StephanTLavavej merged commit 1990083 into microsoft:main Oct 30, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks for helping the compiler, said the author of the presentation, Don't Help The Compiler 😹 😻 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
5 participants