I am using the intel math kernel library to write my algorithm and I set the number of threads to 16. My program can work well. However, when I tried to combine the MKL with MPI and run my program with
mpirun -n 1 ./MMNET_MPI
I think this will give me the same result as I directly run my program as the following.
./MMNET_MPI
However, the performance of my program degrades a lot when I used 16 threads and the activate threads are only 2 or 3. I am not sure what the problem is. The part of my MKL program is as the following.
void LMMCPU::multXXTTrace(double *out, const double *vec) const { double *snpBlock = ALIGN_ALLOCATE_DOUBLES(Npad * snpsPerBlock); double (*workTable)[4] = (double (*)[4]) ALIGN_ALLOCATE_DOUBLES(omp_get_max_threads() * 256 * sizeof(*workTable)); // store the temp result double *temp1 = ALIGN_ALLOCATE_DOUBLES(snpsPerBlock); for (uint64 m0 = 0; m0 < M; m0 += snpsPerBlock) { uint64 snpsPerBLockCrop = std::min(M, m0 + snpsPerBlock) - m0; #pragma omp parallel for for (uint64 mPlus = 0; mPlus < snpsPerBLockCrop; mPlus++) { uint64 m = m0 + mPlus; if (projMaskSnps[m]) buildMaskedSnpCovCompVec(snpBlock + mPlus * Npad, m, workTable + (omp_get_thread_num() << 8)); else memset(snpBlock + mPlus * Npad, 0, Npad * sizeof(snpBlock[0])); } for (uint64 iter = 0; iter < estIteration; iter++) { // compute A=X^TV MKL_INT row = Npad; MKL_INT col = snpsPerBLockCrop; double alpha = 1.0; MKL_INT lda = Npad; MKL_INT incx = 1; double beta = 0.0; MKL_INT incy = 1; cblas_dgemv(CblasColMajor, CblasTrans, row, col, alpha, snpBlock, lda, vec + iter * Npad, incx, beta, temp1, incy); // compute XA double beta1 = 1.0; cblas_dgemv(CblasColMajor, CblasNoTrans, row, col, alpha, snpBlock, lda, temp1, incx, beta1, out + iter * Npad, incy); } } ALIGN_FREE(snpBlock); ALIGN_FREE(workTable); ALIGN_FREE(temp1); }
TCE Level:
TCE Open Date:
Friday, March 13, 2020 - 03:29