Monday, April 29, 2013

OpenMP provides a keyword called THREADPRIVATE that allow thread-local storage for certain variables. That means that each thread will have its own copy of that variable. For instance:

int x = 0;
#pragma omp threadprivate(x)
#pragma omp parallel
{
    x++;   
    // here x will have value 1, for all threads, regardless of how many threads there are.
    // the next line will print n different addresses (n is the number of threads) ! 
    // there is actually n different variables at n different memory locations.
    cout << hex << &x << endl; 
}
// here x will be 0, untouched by the threads.

C++11 provides a standard way to do this, the keyword thread_local. Now things get complicated if the thread local variable has non-trivial constructors or destructors. For example:

struct A {
    int x;
    A(int _x) : x(_x) { x++; /* then print x */ }
    void some_method() { /* do something */ }
    ~A() { x--; /* then print x */ } 
};

thread_local A a(3);

int main() {
    #pragma omp parallel
    {
        a.some_method();
    }
}

If the number of threads is 2, then program will print:

4
4
3
3

Each thread executes the constructor of (with the parameter 3 !) when it starts, and executes the destructor when it terminates. This behavior is specified in the standard. However, that is not what happens with the current implementation of OpenMP. What actually happens is that only the constructors are executed. I looked at the code and it appears that they use some sort of thread pool. However, even if they do, it should be expected that when the program terminates, all the threads are also terminated, even if they are in a thread pool. 

When I tired an equivalent code with pthreads, the correct behavior happened. So I compared both the disassemblies, and it appears that there is no special difference, hinting that the problem is with OpenMP binaries instead of the compiler, which is weird. I found a mention [1] in the GCC implementation of OpenMP (libgomp), and it appears they do not support dynamic construction of thread local storage (although they use pthreads as their implementation). 

The destruction of the threads depend on a call to __cxa_thread_atexit [2], and for some reason this is called with pthreads but not with OpenMP. I do not know why, but if the behavior of thread_local depends on external libraries, beyond the compiler itself, then this is a serious threat to its adoption. 

Conclusion: if you plan to use thread_local with classes having non-trivial destructors, don't (yet).

No comments: