Hi,
I tried to run one of my workload model for training on a CentOs cluster for MPI analysis. Please find below the command used and the error is displayed below. Request your help in resolving the issue.
Commands used
mpiexec –ppn 1 -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi
mpirun –ppn 1 –l amplxe-cl -collect hotspots -k sampling-mode=hw -result-dir results -- ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver models/intel_optimized_models/multinode/resnet50_8nodes_2s/solver.prototxt --network tcp --netmask enp175s0 --benchmark mpi
I keep getting the following error.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 26 PID 72362 RUNNING AT node001
= EXIT STATUS: 255
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 27 PID 72363 RUNNING AT node001
= KILLED BY SIGNAL: 9 (Killed)