Homework 10

Details

The goal of this assignment is to play around with Open MP to learn what you can about private versus shared variables, atomics and critical sections, and the various scheduling options.

Note: You may discuss strategies and ideas with each other on this assignment, but please don't just blindly follow any of your classmates. Your goal is to gain a better understanding of Open MP.

The file matmul_par.c is a parallel implementation of the matrix multiplication algorithm. Download it, place it in the directory with the files from the Open MP videos and modify the makefile so it will compile it (You will need to copy two lines and change the details and make 2 other additions to the file).

This file is a start of a parallel implementation but it is buggy and inefficient. Your job is to make it as efficient as possible. In other words, make the speedup ([time for 1 thread]/[time for 24 threads]) as high as possible. Note: It is broken right now so you will need to do some changes before it will even work with multiple threads.

The code looks a bit strange because the arrays are declared as 1-dimensional instead of 2-dimensional arrays, but they are treated as 2-dimensional arrays the way they are indexed. You shouldn't have to worry about this, though, since you probably won't be making any changes related to those details.

You should play around with shared versus private variables, using critical/atomic regions, changing the scheduling, etc.

Create a document in which you list every change you tried with as much detail as necessary to make it clear exactly what you changed (but as brief as possible). For instance, write something like added schedule(auto) to the parallel for loop. Then give the results for that change (the output of the program). document everything you try!. Your document should be quite large by the time you are done.

Attempt to get your speedup to 18-20, and even higher if possible.

When you are done optimizing your code, change the line that reads

#define ORDER 1000

#define ORDER 2000

and run your algorithm again and base your final analysis on this. (This should make your algorithm take about 1 minute on 1 core, and 2-3 seconds on 24 cores.)

Submit matmul_par.c using Webhandin 385-HW10.

Also submit your document with all of the details of your experiments. At the beginning of your document add a summary that includes what the best speedup you could attain was and how you did it. Explain why (to the best of your knowledge) the things you did gave you the improvements. Be as detailed as possible. Then include a table that summarizes all of your attempted improvements that shows the speedup with 24 threads for each change and the time it takes for 24 threads. (In other words, it should have 3 columns: A brief description of the change, the speedup attained with 24 threads, and the time for 24 threads.) Since this is likely to be a long document, print out just the summary page or two, but submit the entire document using Webhandin.