Here is a Friday post that has a sufficient lack of information that will probably be impossible to answer. I have some older Fortran code I'm trying to improve the performance of. VTune shows 75% of serial execution is consumed calculating the numerical Jacobian of an expensive function. It's easily to parallelize. I first used MPI, and that does show modest improvement when a few processes are added, but it does not scale very well probably because of the large Jacobian matrix that must be broadcast to all the processes.
So, I also tired OpenMP, thinking that it might do slightly better since it does not need to broadcast the matrix. However, when I run the serial code with OMP directives disabled, it runs 4 times fast than the code with the OMP directives enabled, but using only one thread. If more threads are used, some improvement occurs, but it never gets better than the code with out the OMP directives.
My question: Does OpenMP incur a large overhead even if only one thread is used so there is no forking?