# SSE & AVX Vectorization

Marchete

16.1K views

## First AVX Code: SQRT calculation

Now that we have reviewed all the requirements, the autovectorization, and AVX intrinsics, we can create our first manually vectorized program. In this exercise, you need to vectorize a sqrt calculation of float numbers. We will explicitly use the __m256 datatype to store our floats, reducing the overhead in data loading.

Vectorized SQRT

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags

#pragma GCC option("arch=native","tune=native","no-zeroupper") //Enable AVX

#pragma GCC target("avx") //Enable AVX

#include <x86intrin.h> //SSE Extensions

#include <bits/stdc++.h> //All main STD libraries

const int N = 64000000; //Number of tests

const int V = N/8; //Vectorized size

//linear function,

float linear[N];

__attribute__((optimize("no-tree-vectorize"))) //Force disable auto-vectorization

inline void normal_sqrt()

{

for (int i = 0; i < N; ++i)

linear[i] = sqrtf(linear[i]);

}

//Exercise 1: Create a vectorized version of the "linear" function.

//Please note the following:

// "vectorized" array is size V=N/8, because each __m256 variable holds 8 floats.

// sqrtf(const float& f) vectorized function is: _mm256_sqrt_ps(const __m256& v)

__m256 __attribute__((aligned(32))) vectorized[V]; //Vectorized array

inline void avx_sqrt()

{

//****** Add AVX code here*******

}

using namespace std;

using namespace std::chrono;

high_resolution_clock::time_point now = high_resolution_clock::now();

#define TIME duration_cast<duration<double>>(high_resolution_clock::now() - now).count()

int main()

{

//Data initialization

for (int i = 0; i < N; ++i) { linear[i] = ((float)i)+ 0.1335f; }

for (int i = 0; i < V; ++i) {

for (int v=0;v<8;++v)

{ vectorized[i][v] = ((float)(i*8+v))+ 0.1335f; }

}

//Normal sqrt benchmarking. 20*64 Million Sqrts

now = high_resolution_clock::now();

for (int i = 0; i < 20; ++i)

normal_sqrt();

double linear_time = TIME;

cerr << "Normal sqrtf: "<< linear_time << endl;

//AVX vectorized sqrt benchmarking. 20*8*8 Million Sqrts

now = high_resolution_clock::now();

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

You will probably see a 600% performance improvement or more. That is, once you have the data loaded, AVX will perform up to 7 times faster than normal sqrtf. The theoretical limit is 800%, but it's rarely achieved. You can expect between a 300% and 600% average increase.

Create your playground on Tech.io

This playground was created on Tech.io, our hands-on, knowledge-sharing platform for developers.

Suggested playgrounds

Join the CodinGame community on Discord to chat about puzzle contributions, challenges, streams, blog articles - all that good stuff!

JOIN US ON DISCORD Online Participants