Comparison of OpenMP SIMD to Auto-Vectorization

This paper is my seminar thesis at RWTH Aachen.

What is parallel programming?

According to Moore's law, the number of transistors in a CPU will double every two years. But nowadays, Moore's law seems to be dead. Therefore, computer scientists decided to add many CPUs to their computation setup and try to apply parallel programming with multiple cores.

But mostly compiler is always trying to optimize and increase performance parallel without letting the programmer know. So why do we need to apply OpenMP?

OpenMP is a popular standard paradigm in shared memory programming for high-performance computing to increase application performance and programmer productivity.

For compiler, some massive code bases are hard to find parallelization possibilities. Therefore, OpenMP allows a programmer which can easily declare where exactly the code should be parallelized. (Explicit vector programming)

I was lucky to test my performance in HPC Cluster at my University, which is on the TOP500 list.

My paper introduces a brief introduction to SIMD and the difference in performance between compiler auto-vectorizations and OpenMP SIMD parallelization. I also used Mandelbrot set to analyze performance increase.

Mandelbrot set is generated by iterations, and each iteration depends on the previous result, which is an excellent example to parallelize.

f_c(z) = z^2 +c

This quadratic polynomial function does not diverge to infinite when iterated c values are complex numbers.

for (j = 0; j < HEIGHT∗WIDTH; j++) {
    x = j % WIDTH;
    y = j / WIDTH;
    ry = y_min + y ∗ ph;
    rx = X_MIN + x ∗ pw;
    zx = 0.0;
    zy = 0.0;
    zx2 = 0.0;
    zy2 = 0.0;
    for (i = 0; i < MAX_ITER && ((zx2 + zy2) < 4); i++) {
        zy = 2 ∗ zx ∗ zy + ry;
        zx = zx2 − zy2 + rx;
        zx2 = zx ∗ zx;
        zy2 = zy ∗ zy;
    }
    image [j∗4] = 255 − (cos(i∗PI / (double)MAX_ITER)+ 1) / 2∗255;
    image [j∗4 + 1] = 255 − (sin(i∗PI / (double)MAX_ITER)+ 1) / 3∗255;
    image [j∗4 + 2] = 255 − (i / (double )MAX_ITER) ∗ 255;
    image [j∗4 + 3] = 255;
}

This part of the code is parallelized and analyzed before/after parallelization.

Conclusion

This experimental demonstrates several programming principles:

Code optimization is essential while programming because some loops cannot be vectorized. Therefore, programmers should be aware of such obstacles.
Get more information about machine specifications because of hardware limitations. If we do not know when to stop increasing thread numbers, it will worsen, which is not parallelizing but serializing. According to Amdahl’s Law, the more processors, the less the performance

After reaching 48 threads, there is no performance further seen. Comparing threads is given the maximum possible performance between single thread is a difference with 15.1x speedup by ICC and 19.2x speedup by GCC. This research aimed to identify effective methods for getting a high performance through the compiler and high-quality code.

If you are interested to read my paper: CLICK HERE

PreviousAbout me NextLinux Privilege Escalation

Last updated 2 years ago

hashtagWhat is parallel programming?

hashtagOpenMP is a popular standard paradigm in shared memory programming for high-performance computing to increase application performance and programmer productivity.

hashtagConclusion

What is parallel programming?

OpenMP is a popular standard paradigm in shared memory programming for high-performance computing to increase application performance and programmer productivity.

Conclusion