Hi all,
I need to profile an HPC application on multiple nodes with very low overhead impact. In the application code, I need to monitor MPI synchronization points (barrier, alltoall, etc.). I'm using invariant TSC (RDTSC/RDTSCP instruction) because I cannot rely on clock_gettime() due high overheads of syscalls. I knew that TSCs should be synchronized among cores and sockets on the same node, hence I should have no problems for intra-node timing synchronization.
But I have the following concerns:
1) How can I synchronize TSCs among different nodes with a very fine-grain accuracy (sub-microsecond accuracy)? I think that developers of "Intel Trace Analyzer and Collector" should had similar problems.
2) I suppose that TSCs on different nodes increment always at a fixed nominal frequency. Do you think that invariant clock oscillators can have little drifts? I suppose to yes, but in this case for long application runs, profilers on different nodes can produce inconsistent inter-node timing information. Moreover, If TSCs are affected to clock drifts, I cannot transform time stamp in seconds.
My target system is an HPC machine composed to double-socket Broadwell nodes interconneted with an Omni-Path network.
Thanks to all in advance,
Daniele