We encountered a problem when migrating a code from Intel MPI 4.1.3.049 to 5.0.3.048. The code in question is a complex simulation that first reads global input state from disk into several parts in memory and then accesses this memory in a hard to predict fashion to create a new decomposition. We use active target RMA for this (on machines which support this like BG/Q we also use passive target) since a rank might need data from the part that is at another rank to form its halo. Since the mesh we work on stores edges, cell centers and vertices in different data structures, we have 3 corresponding MPI_Win_create calls per mesh. For local refinement the number of these meshes can be increased as needed, the problem I'm about to describe happens with 3 nested meshes.
With Intel MPI 5.0.3.048 the program locks up in this code, with some ranks being stuck in one MPI_Win_fence for one of the RMA windows and others being stuck in the next MPI_Win_fence (about 30 code lines later) for another RMA window. Both windows belong to the same mesh but that should not matter to MPI.
The code in question works with Intel MPI 4.1.3.049 and Open MPI 1.6.5.
Is this a known problem? How small would a demonstration code have to be?