Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 927

mpiexec.hydra 2019u4 crashes on AMD Zen2

$
0
0

Hello,

mpexec.hydra binary from Inltel 2019U4 crashes on Zen2 and Zen1 platforms.

 

 

user@Zen1[pts/0]stream $ mpirun -np 2   /vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1

/vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpirun: line 103:  7399 Floating point exception(core dumped) mpiexec.hydra "$@" 0<&0

 

user@Zen2[pts/1]demo $ mpirun -np 2   /vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/IMB-MPI1

/vend/intel/parallel_studio_xe_2019_update4/compilers_and_libraries_2019.4.243/linux/mpi/intel64/bin/mpirun: line 103: 121108 Floating point exception(core dumped) mpiexec.hydra "$@" 0<&0

A strace reveals that mpiexec.hydra crashes trying to parse to processor configuration, I believe binary cpuininfo suffers from the same symptoms.

...

openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, [{d_ino=37, d_off=1, d_reclen=24, d_name=".", d_type=DT_DIR}, {d_ino=9, d_off=2690600, d_reclen=24, d_name="..", d_type=DT_DIR}, {d_ino=170171, d_off=25909499, d_reclen=24, d_name="smt", d_type=DT_DIR}, {d_ino=90582, d_off=25909675, d_reclen=24, d_name="cpu0", d_type=DT_DIR}, {d_ino=90600, d_off=25909851, d_reclen=24, d_name="cpu1", d_type=DT_DIR}, {d_ino=90619, d_off=25910027, d_reclen=24, d_name="cpu2", d_type=DT_DIR}, {d_ino=90638, d_off=25910203, d_reclen=24, d_name="cpu3", d_type=DT_DIR}, {d_ino=90657, d_off=25910379, d_reclen=24, d_name="cpu4", d_type=DT_DIR}, {d_ino=90676, d_off=25910555, d_reclen=24, d_name="cpu5", d_type=DT_DIR}, {d_ino=90695, d_off=25910731, d_reclen=24, d_name="cpu6", d_type=DT_DIR}, {d_ino=90714, d_off=25910907, d_reclen=24, d_name="cpu7", d_type=DT_DIR}, {d_ino=90733, d_off=25911083, d_reclen=24, d_name="cpu8", d_type=DT_DIR}, {d_ino=90752, d_off=141151836, d_reclen=24, d_name="cpu9", d_type=DT_DIR}, {d_ino=222492, d_off=141566558, d_reclen=32, d_name="cpufreq", d_type=DT_DIR}, {d_ino=82070, d_off=285014906, d_reclen=32, d_name="cpuidle", d_type=DT_DIR}, {d_ino=90771, d_off=285015082, d_reclen=32, d_name="cpu10", d_type=DT_DIR}, {d_ino=90790, d_off=285015258, d_reclen=32, d_name="cpu11", d_type=DT_DIR}, {d_ino=90809, d_off=285015434, d_reclen=32, d_name="cpu12", d_type=DT_DIR}, {d_ino=90828, d_off=285015610, d_reclen=32, d_name="cpu13", d_type=DT_DIR}, {d_ino=90847, d_off=285015786, d_reclen=32, d_name="cpu14", d_type=DT_DIR}, {d_ino=90866, d_off=285015962, d_reclen=32, d_name="cpu15", d_type=DT_DIR}, {d_ino=90885, d_off=285016138, d_reclen=32, d_name="cpu16", d_type=DT_DIR}, {d_ino=90904, d_off=285016314, d_reclen=32, d_name="cpu17", d_type=DT_DIR}, {d_ino=90923, d_off=285016490, d_reclen=32, d_name="cpu18", d_type=DT_DIR}, {d_ino=90942, d_off=285016842, d_reclen=32, d_name="cpu19", d_type=DT_DIR}, {d_ino=90961, d_off=285017018, d_reclen=32, d_name="cpu20", d_type=DT_DIR}, {d_ino=90980, d_off=285017194, d_reclen=32, d_name="cpu21", d_type=DT_DIR}, {d_ino=90999, d_off=285017370, d_reclen=32, d_name="cpu22", d_type=DT_DIR}, {d_ino=91018, d_off=285017546, d_reclen=32, d_name="cpu23", d_type=DT_DIR}, {d_ino=91037, d_off=285017722, d_reclen=32, d_name="cpu24", d_type=DT_DIR}, {d_ino=91056, d_off=285017898, d_reclen=32, d_name="cpu25", d_type=DT_DIR}, {d_ino=91075, d_off=285018074, d_reclen=32, d_name="cpu26", d_type=DT_DIR}, {d_ino=91094, d_off=285018250, d_reclen=32, d_name="cpu27", d_type=DT_DIR}, {d_ino=91113, d_off=285018426, d_reclen=32, d_name="cpu28", d_type=DT_DIR}, {d_ino=91132, d_off=285018778, d_reclen=32, d_name="cpu29", d_type=DT_DIR}, {d_ino=91151, d_off=285018954, d_reclen=32, d_name="cpu30", d_type=DT_DIR}, {d_ino=91170, d_off=285019130, d_reclen=32, d_name="cpu31", d_type=DT_DIR}, {d_ino=91189, d_off=285019306, d_reclen=32, d_name="cpu32", d_type=DT_DIR}, {d_ino=91208, d_off=285019482, d_reclen=32, d_name="cpu33", d_type=DT_DIR}, {d_ino=91227, d_off=285019658, d_reclen=32, d_name="cpu34", d_type=DT_DIR}, {d_ino=91246, d_off=285019834, d_reclen=32, d_name="cpu35", d_type=DT_DIR}, {d_ino=91265, d_off=285020010, d_reclen=32, d_name="cpu36", d_type=DT_DIR}, {d_ino=91284, d_off=285020186, d_reclen=32, d_name="cpu37", d_type=DT_DIR}, {d_ino=91303, d_off=285020362, d_reclen=32, d_name="cpu38", d_type=DT_DIR}, {d_ino=91322, d_off=285020714, d_reclen=32, d_name="cpu39", d_type=DT_DIR}, {d_ino=91341, d_off=285020890, d_reclen=32, d_name="cpu40", d_type=DT_DIR}, {d_ino=91360, d_off=285021066, d_reclen=32, d_name="cpu41", d_type=DT_DIR}, {d_ino=91379, d_off=285021242, d_reclen=32, d_name="cpu42", d_type=DT_DIR}, {d_ino=91398, d_off=285021418, d_reclen=32, d_name="cpu43", d_type=DT_DIR}, {d_ino=91417, d_off=285021594, d_reclen=32, d_name="cpu44", d_type=DT_DIR}, {d_ino=91436, d_off=285021770, d_reclen=32, d_name="cpu45", d_type=DT_DIR}, {d_ino=91455, d_off=285021946, d_reclen=32, d_name="cpu46", d_type=DT_DIR}, {d_ino=91474, d_off=285022122, d_reclen=32, d_name="cpu47", d_type=DT_DIR}, {d_ino=91493, d_off=285022298, d_reclen=32, d_name="cpu48", d_type=DT_DIR}, {d_ino=91512, d_off=285022650, d_reclen=32, d_name="cpu49", d_type=DT_DIR}, {d_ino=91531, d_off=285022826, d_reclen=32, d_name="cpu50", d_type=DT_DIR}, {d_ino=91550, d_off=285023002, d_reclen=32, d_name="cpu51", d_type=DT_DIR}, {d_ino=91569, d_off=285023178, d_reclen=32, d_name="cpu52", d_type=DT_DIR}, {d_ino=91588, d_off=285023354, d_reclen=32, d_name="cpu53", d_type=DT_DIR}, {d_ino=91607, d_off=285023530, d_reclen=32, d_name="cpu54", d_type=DT_DIR}, {d_ino=91626, d_off=285023706, d_reclen=32, d_name="cpu55", d_type=DT_DIR}, {d_ino=91645, d_off=285023882, d_reclen=32, d_name="cpu56", d_type=DT_DIR}, {d_ino=91664, d_off=285024058, d_reclen=32, d_name="cpu57", d_type=DT_DIR}, {d_ino=91683, d_off=285024234, d_reclen=32, d_name="cpu58", d_type=DT_DIR}, {d_ino=91702, d_off=285024586, d_reclen=32, d_name="cpu59", d_type=DT_DIR}, {d_ino=91721, d_off=285024762, d_reclen=32, d_name="cpu60", d_type=DT_DIR}, {d_ino=91740, d_off=285024938, d_reclen=32, d_name="cpu61", d_type=DT_DIR}, {d_ino=91759, d_off=285025114, d_reclen=32, d_name="cpu62", d_type=DT_DIR}, {d_ino=91778, d_off=318580955, d_reclen=32, d_name="cpu63", d_type=DT_DIR}, {d_ino=47, d_off=385790491, d_reclen=32, d_name="power", d_type=DT_DIR}, {d_ino=57, d_off=661204875, d_reclen=40, d_name="vulnerabilities", d_type=DT_DIR}, {d_ino=46, d_off=718872595, d_reclen=32, d_name="modalias", d_type=DT_REG}, {d_ino=42, d_off=900028725, d_reclen=32, d_name="kernel_max", d_type=DT_REG}, {d_ino=40, d_off=1321717208, d_reclen=32, d_name="possible", d_type=DT_REG}, {d_ino=39, d_off=1412398250, d_reclen=32, d_name="online", d_type=DT_REG}, {d_ino=43, d_off=1431608070, d_reclen=32, d_name="offline", d_type=DT_REG}, {d_ino=44, d_off=1472641949, d_reclen=32, d_name="isolated", d_type=DT_REG}, {d_ino=38, d_off=1826905203, d_reclen=32, d_name="uevent", d_type=DT_REG}, {d_ino=45, d_off=1905639739, d_reclen=32, d_name="nohz_full", d_type=DT_REG}, {d_ino=197551, d_off=2084586514, d_reclen=32, d_name="microcode", d_type=DT_DIR}, {d_ino=41, d_off=2147483647, d_reclen=32, d_name="present", d_type=DT_REG}], 32768) = 2496
getdents(3, [], 32768)                  = 0
close(3)                                = 0
uname({sysname="Linux", nodename="SERVER", release="3.10.0-1062.1.2.el7.x86_64", version="#1 SMP Mon Sep 30 14:19:46 UTC 2019", machine="x86_64", domainname="houston"}) = 0
sched_getaffinity(0, 128, [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]) = 128
--- SIGFPE {si_signo=SIGFPE, si_code=FPE_INTDIV, si_addr=0x44d325} ---
+++ killed by SIGFPE (core dumped) +++
Floating point exception (core dumped)

 

 

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    1
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7502 32-Core Processor
Stepping:              0
CPU MHz:               1500.000
CPU max MHz:           2500.0000
CPU min MHz:           1500.0000
BogoMIPS:              5000.07
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca
 

 

 

TCE Open Date: 

Monday, December 9, 2019 - 09:10

Viewing all articles
Browse latest Browse all 927

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>