Hi all,
I'm using Intel MPI Library 2017 Update 1 (v.2017.1.143) for Windows on Windows Server 2012 R2 Standard 64-bit nodes. I'm using 2 identical nodes and each node has the following specs:
CPU: Intel Xeon CPU E5-2450 v2 @ 2.50GHz
RAM: DDR3 49086 MBytes Triple Channels (800 Mhz.)
GPU: NVIDIA Tesla K40c (driver version 24.21.14.1229)
Network: Mellanox ConnectX-3 Pro Ethernet Adapter (2)
- Driver: Mellanox Infiniband 40Gbit ConnectX 3 Pro HBA driver, Version 5.10 (MLNX_VPI_WinOF-5_10_All_win2012R2_x64.exe)
I'm using fabrics as dapl:dapl or shm:tcp. Dapl version is "DAPL-ND - DAPL NetworkDirect Stand Alone installer v1.4.5 [06-02-2016]"
When using shm:tcp, I'm getting "read from socket error". Here is the full trace:
mpiexec -l -genv I_MPI_FABRICS shm:tcp -genv I_MPI_PIN_DOMAIN=omp -genv I_MPI_WAIT_MODE=1 -genv I_MPI_DEBUG=1000 -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe master --do stitching --dted --exposure --config-path ..\input\cape\cape_fl_veo_50_dual.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CAPE50 --dds-id 121 -i 10.4.1.* --gcp-db-ip 10.4.1.122 --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do stabilization --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do bgsubtraction --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do tracking --group-name EO50 : -n 1 -host 10.0.0.2 ../ped/Release/Cape.exe slave -d -o dds file --group-name EO50 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe master --do stitching --exposure --config-path ..\input\cape\cape_fl_veo_100_dual.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CAPE100 --dds-id 121 -i 10.4.1.* --gcp-db-ip 10.4.1.122 --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do stabilization --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do bgsubtraction --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/Cape.exe slave --do tracking --group-name EO100 : -n 1 -host 10.0.0.2 ../ped/Release/Cape.exe slave -d -o dds file --group-name EO100 : -n 1 -host 10.0.0.1 ../ped/Release/CameraController.exe --config-path ..\input\cc\cc_veo.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CC : -n 1 -host 10.0.0.1 ../ped/Release/CameraGroupProxy.exe --config-path ..\input\cgp\configuration.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CGP : -n 1 -host 10.0.0.2 ../ped/Release/GroupMetadataSynchronizer.exe --config-path ..\input\gms\configuration.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id GMS --dds-reader-id 121 --dds-reader-allow-interface 10.4.1.* --dds-writer-id 122 --dds-writer-allow-interface 10.4.1.* [11] WARNING: Logging before InitGoogleLogging() is written to STDERR [11] I0430 19:24:42.494819 13396 ArgParser.cpp:66] CGP MQTT IP set to 10.4.1.121 [11] I0430 19:24:42.494819 13396 ArgParser.cpp:73] CGP MQTT Port set to 1883 [11] I0430 19:24:42.494819 13396 ArgParser.cpp:80] CGP MQTT ID set to CGP [10] WARNING: Logging before InitGoogleLogging() is written to STDERR [10] I0430 19:24:42.501842 3068 ArgParser.cpp:93] CC Config path set to ..\input\cc\cc_veo.xml [10] I0430 19:24:42.504853 3068 ArgParser.cpp:181] CC MQTT IP set to 10.4.1.121 [10] I0430 19:24:42.504853 3068 ArgParser.cpp:188] CC MQTT Port set to 1883 [10] I0430 19:24:42.504853 3068 ArgParser.cpp:195] CC MQTT ID set to CC [10] I0430 19:24:42.504853 3068 Executor.cpp:53] initMpi [2] WARNING: Logging before InitGoogleLogging() is written to STDERR [3] WARNING: Logging before InitGoogleLogging() is written to STDERR [3] W0430 19:24:42.506860 9924 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: tracking [3] I0430 19:24:42.506860 9924 ArgParser.cpp:406] Group name set to EO50 [6] WARNING: Logging before InitGoogleLogging() is written to STDERR [6] W0430 19:24:42.506860 5404 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stabilization [6] I0430 19:24:42.506860 5404 ArgParser.cpp:406] Group name set to EO100 [6] I0430 19:24:42.506860 5404 Executor.cpp:53] initMpi [2] W0430 19:24:42.506860 16752 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: bgsubtraction [2] I0430 19:24:42.506860 16752 ArgParser.cpp:406] Group name set to EO50 [2] I0430 19:24:42.506860 16752 Executor.cpp:53] initMpi [3] I0430 19:24:42.506860 9924 Executor.cpp:53] initMpi [5] WARNING: Logging before InitGoogleLogging() is written to STDERR [8] WARNING: Logging before InitGoogleLogging() is written to STDERR [8] W0430 19:24:42.506860 11428 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: tracking [8] I0430 19:24:42.506860 11428 ArgParser.cpp:406] Group name set to EO100 [8] I0430 19:24:42.506860 11428 Executor.cpp:53] initMpi [1] WARNING: Logging before InitGoogleLogging() is written to STDERR [1] W0430 19:24:42.506860 14572 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stabilization [1] I0430 19:24:42.506860 14572 ArgParser.cpp:406] Group name set to EO50 [5] W0430 19:24:42.506860 16108 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stitching [1] I0430 19:24:42.506860 14572 Executor.cpp:53] initMpi [5] I0430 19:24:42.506860 16108 ArgParser.cpp:356] Config path is set to ..\input\cape\cape_fl_veo_100_dual.xml [5] I0430 19:24:42.506860 16108 ArgParser.cpp:363] MQTT IP is set to 10.4.1.121 [5] I0430 19:24:42.506860 16108 ArgParser.cpp:370] MQTT Port is set to 1883 [5] I0430 19:24:42.506860 16108 ArgParser.cpp:377] MQTT ID is set to CAPE100 [5] I0430 19:24:42.506860 16108 ArgParser.cpp:384] DDS ID is set to 121 [5] I0430 19:24:42.506860 16108 ArgParser.cpp:391] DDS Allow Interface is set to 10.4.1.* [5] I0430 19:24:42.506860 16108 ArgParser.cpp:398] GCP DB IP set to 10.4.1.122 [5] I0430 19:24:42.506860 16108 ArgParser.cpp:406] Group name set to EO100 [5] I0430 19:24:42.506860 16108 ArgParser.cpp:184] Exposure will be executed along with provided processes [7] WARNING: Logging before InitGoogleLogging() is written to STDERR [7] W0430 19:24:42.506860 1656 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: bgsubtraction [7] I0430 19:24:42.506860 1656 ArgParser.cpp:406] Group name set to EO100 [5] I0430 19:24:42.506860 16108 Executor.cpp:53] initMpi [7] I0430 19:24:42.506860 1656 Executor.cpp:53] initMpi [0] WARNING: Logging before InitGoogleLogging() is written to STDERR [0] W0430 19:24:42.506860 14212 ArgParser.cpp:81] No end process provided. Setting cape to execute only start process: stitching [0] I0430 19:24:42.507864 14212 ArgParser.cpp:356] Config path is set to ..\input\cape\cape_fl_veo_50_dual.xml [0] I0430 19:24:42.507864 14212 ArgParser.cpp:363] MQTT IP is set to 10.4.1.121 [0] I0430 19:24:42.507864 14212 ArgParser.cpp:370] MQTT Port is set to 1883 [0] I0430 19:24:42.507864 14212 ArgParser.cpp:377] MQTT ID is set to CAPE50 [0] I0430 19:24:42.507864 14212 ArgParser.cpp:384] DDS ID is set to 121 [0] I0430 19:24:42.507864 14212 ArgParser.cpp:391] DDS Allow Interface is set to 10.4.1.* [0] I0430 19:24:42.507864 14212 ArgParser.cpp:398] GCP DB IP set to 10.4.1.122 [0] I0430 19:24:42.507864 14212 ArgParser.cpp:406] Group name set to EO50 [0] I0430 19:24:42.507864 14212 ArgParser.cpp:184] Exposure will be executed along with provided processes [0] I0430 19:24:42.507864 14212 ArgParser.cpp:240] CAPE will attempt to calculate elevation matrix from dted file if any processes require it. [0] I0430 19:24:42.507864 14212 Executor.cpp:53] initMpi [12] WARNING: Logging before InitGoogleLogging() is written to STDERR [12] I0430 19:24:42.523279 261596 ArgParser.cpp:56] GMS MQTT IP set to 10.4.1.121 [12] I0430 19:24:42.523279 261596 ArgParser.cpp:63] GMS MQTT Port set to 1883 [12] I0430 19:24:42.523279 261596 ArgParser.cpp:70] GMS MQTT ID set to GMS [12] I0430 19:24:42.523279 261596 ArgParser.cpp:77] GMS DDS Reader Domain ID set to 121 [12] I0430 19:24:42.523279 261596 ArgParser.cpp:84] GMS DDS Writer Domain ID set to 122 [12] I0430 19:24:42.523279 261596 ArgParser.cpp:91] GMS DDS Allow Interface for Reader set to 10.4.1.* [12] I0430 19:24:42.523279 261596 ArgParser.cpp:98] GMS DDS Allow Interface for Writer set to 10.4.1.* [4] WARNING: Logging before InitGoogleLogging() is written to STDERR [4] I0430 19:24:42.537328 260632 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE' [4] I0430 19:24:42.537328 260632 ArgParser.cpp:406] Group name set to EO50 [4] I0430 19:24:42.537328 260632 Executor.cpp:53] initMpi [9] WARNING: Logging before InitGoogleLogging() is written to STDERR [9] I0430 19:24:42.538331 259540 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE' [9] I0430 19:24:42.538331 259540 ArgParser.cpp:406] Group name set to EO100 [9] I0430 19:24:42.538331 259540 Executor.cpp:53] initMpi [11] I0430 19:24:43.496381 13396 Manager.cpp:150] Application Mode : INITIALIZING published! [11] I0430 19:24:43.496381 13396 Executor.cpp:53] initMpi [12] I0430 19:24:43.524725 261596 GMSApplication.cpp:134] DDS Initialization ... [12] I0430 19:24:44.087589 261596 GMSApplication.cpp:144] DDS Reader initialized ! [121 - 10.4.1.*][12] [12] I0430 19:24:44.087589 261596 GMSApplication.cpp:145] DDS Writer initialized ! [122 - 10.4.1.*] [12] I0430 19:24:44.087589 261596 GMSApplication.cpp:87] MPI Initialization ... [12] I0430 19:24:44.087589 261596 Executor.cpp:53] initMpi [0] [0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 1 Build 20161016[0] [0] [0] MPI startup(): Copyright (C) 2003-2016 Intel Corporation. All rights reserved. [0] [0] MPI startup(): Multi-threaded optimized library [12] [12] MPI startup(): shm and tcp data transfer modes [4] [4] MPI startup(): shm and tcp data transfer modes[4] [9] [9] MPI startup(): shm and tcp data transfer modes[2] [2] MPI startup(): shm and tcp data transfer modes[2] [9] [3] [3] MPI startup(): shm and tcp data transfer modes [1] [1] MPI startup(): shm and tcp data transfer modes [0] [0] MPI startup(): shm and tcp data transfer modes [11] [11] MPI startup(): shm and tcp data transfer modes [5] [5] MPI startup(): shm and tcp data transfer modes [7] [7] MPI startup(): shm and tcp data transfer modes [6] [6] MPI startup(): shm and tcp data transfer modes [10] [10] MPI startup(): shm and tcp data transfer modes [8] [8] MPI startup(): shm and tcp data transfer modes [10] Fatal error in PMPI_Init_thread: Other MPI error, error stack: [10] MPIR_Init_thread(805)......................: fail failed [10] MPID_Init(1783)............................: channel initialization failed [10] MPIDI_CH3_Init(147)........................: fail failed [10] MPID_nem_tcp_post_init(351)................: fail failed [10] MPID_nem_newtcp_module_connpoll(3116)......: fail failed [10] recv_id_or_tmpvc_info_success_handler(1336): read from socket failed - No error
When using dapl:dapl I'm getting "MPIR_Init_Thread" error. Here is the full trace:
mpiexec -l -genv I_MPI_FABRICS dapl:dapl -genv I_MPI_PIN_DOMAIN=omp -genv I_MPI_WAIT_MODE=1 -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe master --do stitching --dted --exposure --config-path ..\input\cape\cape_fl_veo_50_dual.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CAPE50 --dds-id 121 -i 10.4.1.* --gcp-db-ip 10.4.1.122 --group-name EO50 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do stabilization --group-name EO50 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do bgsubtraction --group-name EO50 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do tracking --group-name EO50 : -n 1 -host 10.0.0.2 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave -d -o dds file --group-name EO50 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe master --do stitching --exposure --config-path ..\input\cape\cape_fl_veo_100_dual.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CAPE100 --dds-id 121 -i 10.4.1.* --gcp-db-ip 10.4.1.122 --group-name EO100 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do stabilization --group-name EO100 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do bgsubtraction --group-name EO100 : -n 1 -host 10.0.0.1 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave --do tracking --group-name EO100 : -n 1 -host 10.0.0.2 ../Cape-3.5.0-78-SNAPSHOT-windows-amd64-vc14/bin/Cape.exe slave -d -o dds file --group-name EO100 : -n 1 -host 10.0.0.1 ../CameraController-2.3.1-48-SNAPSHOT-windows-amd64-vc14/bin/CameraController.exe --config-path ..\input\cc\cc_veo.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CC : -n 1 -host 10.0.0.1 ../CameraGroupProxy-0.0.2-41-SNAPSHOT-windows-amd64-vc14/bin/CameraGroupProxy.exe --config-path ..\input\cgp\configuration.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id CGP : -n 1 -host 10.0.0.2 ../GroupMetadataSynchronizer-0.0.2-42-SNAPSHOT-windows-amd64-vc14/bin/GroupMetadataSynchronizer.exe --config-path ..\input\gms\configuration.xml --mqtt-ip 10.4.1.121 --mqtt-port 1883 --mqtt-id GMS --dds-reader-id 121 --dds-reader-allow-interface 10.4.1.* --dds-writer-id 122 --dds-writer-allow-interface 10.4.1.* [11] WARNING: Logging before InitGoogleLogging() is written to STDERR [11] I0503 09:44:51.214726 8132 ArgParser.cpp:66] CGP MQTT IP set to 10.4.1.121 [11] I0503 09:44:51.215728 8132 ArgParser.cpp:73] CGP MQTT Port set to 1883 [11] I0503 09:44:51.215728 8132 ArgParser.cpp:80] CGP MQTT ID set to CGP [12] WARNING: Logging before InitGoogleLogging() is written to STDERR [12] I0503 09:45:04.446153 7932 ArgParser.cpp:56] GMS MQTT IP set to 10.4.1.121 [12] I0503 09:45:04.446153 7932 ArgParser.cpp:63] GMS MQTT Port set to 1883 [12] I0503 09:45:04.447154 7932 ArgParser.cpp:70] GMS MQTT ID set to GMS [12] I0503 09:45:04.447154 7932 ArgParser.cpp:77] GMS DDS Reader Domain ID set to 121 [12] I0503 09:45:04.447154 7932 ArgParser.cpp:84] GMS DDS Writer Domain ID set to 122 [12] I0503 09:45:04.447154 7932 ArgParser.cpp:91] GMS DDS Allow Interface for Reader set to 10.4.1.* [12] I0503 09:45:04.447154 7932 ArgParser.cpp:98] GMS DDS Allow Interface for Writer set to 10.4.1.* [9] WARNING: Logging before InitGoogleLogging() is written to STDERR [4] WARNING: Logging before InitGoogleLogging() is written to STDERR [4] I0503 09:45:04.471177 7412 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE' [4] I0503 09:45:04.471177 7412 ArgParser.cpp:406] Group name set to EO50 [9] I0503 09:45:04.471177 8356 ArgParser.cpp:133] Dissemination Medium Type is set as: 'DDS+FILE' [9] I0503 09:45:04.471177 8356 ArgParser.cpp:406] Group name set to EO100 [12] I0503 09:45:05.447875 7932 GMSApplication.cpp:134] DDS Initialization ... [12] I0503 09:45:06.021322 7932 GMSApplication.cpp:144] DDS Reader initialized ! [121 - 10.4.1.*] [12] I0503 09:45:06.021322 7932 GMSApplication.cpp:145] DDS Writer initialized ! [122 - 10.4.1.*] [12] I0503 09:45:06.021322 7932 GMSApplication.cpp:87] MPI Initialization ... [11] I0503 09:44:53.217674 8132 Manager.cpp:150] Application Mode : INITIALIZING published! [4] dapls_ib_get_dto_status() Unknown NT Error 0xc000021b? ret DAT_INTERNAL_ERR [5] [5:10.0.0.1] unexpected DAPL event 0x4005 [5] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack: [5] MPIR_Init_thread(805): fail failed [5] MPID_Init(1783)......: channel initialization failed [5] MPIDI_CH3_Init(147)..: fail failed [5] (unknown)(): Internal MPI error!
Any ideas?