Overlapping Computation & Communication with MPI Non-blocking Calls
The following recommendations are verified with cray-mpich/7.2.0 on CLE 5.2 UP02
The following recommendations may help with MPI non-blocking calls. They have been verified to show improvement with MPI_Iallreduce.
Enable core specialization with aprun
aprun -r 1 # Enable core specialization.
Enable asynchronous progress engine & set MPI thread safety level
export MPICH_NEMESIS_ASYNC_PROGRESS=SC # Enable async progress engine
export MPICH_MAX_THREAD_SAFETY=multiple # MPICH thread safety
Enable MPI shared memory optimizations
export MPICH_SHARED_MEM_COLL_OPT=1 # optimized shared-memory based design for collective operations. Currently supported collective operations are: MPI_Allreduce, MPI_Iallreduce, and MPI_Bcast. 1 enables all.
export MPICH_SMP_SINGLE_COPY_SIZE=1024 # Specifies the minimum message size in bytes to consider for single-copy transfers for on-node messages. This applies only to the SMP (on-node shared memory) device.
?
Link with DMAPP library with following link flags
Static linking:
-Wl,--whole-archive,-ldmapp,--no-whole-archive
Dynamic linking:
-ldmapp
Enable optimized DMAPP collectives
export MPICH_USE_DMAPP_COLL=1 # attempt to use the highly optimized GHAL-based DMAPP collective algorithms, if available.
?Debugging:
export MPICH_GNI_ASYNC_PROGRESS_STATS=enabled # Generates a detailed log. May result in a large stderr file.
export MPICH_ENV_DISPLAY=1 # rank 0 to display all MPICH environment variables and their current settings at MPI initialization time.
export MPICH_VERSION_DISPLAY=1 # display the CRAY MPICH version number as well as build date information.
Results:
The following results show overlap % for various message sizes and rank count
#---------------------------------------------------
# Intel (R) MPI Benchmark Suite V4.0.0, MPI-NBC part
#---------------------------------------------------
# Date : Wed Jun 10 13:22:27 2015
# Machine : x86_64
# System : Linux
# Release : 3.0.101-0.31.1_1.0502.8394-cray_gem_c
# Version : #1 SMP Mon Dec 22 19:59:41 UTC 2014
# MPI Version : 3.0
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# ./IMB-NBC Iallreduce Iallreduce_pure
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# Iallreduce
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 2
# ( 510 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 1.90 1.00 0.76 0.00
4 1000 14.90 8.23 8.29 19.64
8 1000 14.34 8.13 8.34 25.63
16 1000 14.28 8.11 8.31 25.77
32 1000 5.18 2.42 2.19 0.00
64 1000 5.30 2.54 2.30 0.00
128 1000 5.35 2.62 2.31 0.00
256 1000 6.23 2.71 3.01 0.00
512 1000 6.58 3.01 3.02 0.00
1024 1000 25.13 9.17 9.25 0.00
2048 1000 26.93 9.73 9.75 0.00
4096 1000 39.37 14.86 15.90 0.00
8192 1000 45.68 23.98 25.35 14.42
16384 1000 56.51 29.26 30.36 10.27
32768 1000 71.82 39.03 41.71 21.41
65536 640 130.56 64.68 68.07 3.22
131072 320 192.12 128.23 137.54 53.55
262144 160 456.01 333.18 355.34 65.43
524288 80 877.91 648.10 694.36 66.90
1048576 40 1792.55 1319.45 1403.14 66.28
2097152 20 3643.30 2618.75 2707.99 62.17
4194304 10 7861.61 6294.99 6681.63 76.55
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 4
# ( 508 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 1.96 1.03 0.76 0.00
4 1000 14.77 8.21 8.35 21.36
8 1000 14.84 8.32 8.33 21.81
16 1000 14.91 8.40 8.39 22.38
32 1000 8.02 3.74 3.58 0.00
64 1000 8.80 4.03 4.32 0.00
128 1000 8.88 4.03 4.30 0.00
256 1000 9.09 4.34 4.30 0.00
512 1000 10.23 5.12 5.00 0.00
1024 1000 49.12 22.65 22.98 0.00
2048 1000 51.45 19.53 20.89 0.00
4096 1000 85.29 44.52 47.02 13.30
8192 1000 97.68 57.76 61.46 35.04
16384 1000 100.50 48.92 51.10 0.00
32768 1000 133.16 75.88 80.46 28.81
65536 640 208.02 125.01 135.63 38.80
131072 320 320.20 203.18 216.79 46.02
262144 160 552.87 387.50 418.35 60.47
524288 80 1083.17 784.31 837.23 64.30
1048576 40 2667.68 1831.70 1957.51 57.29
2097152 20 5108.64 3597.90 3820.80 60.46
4194304 10 10313.20 8468.91 9063.70 79.65
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 8
# ( 504 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 1.98 1.08 0.79 0.00
4 1000 16.31 9.35 8.99 21.74
8 1000 16.34 9.35 8.99 21.34
16 1000 17.11 9.50 9.67 21.35
32 1000 10.83 5.23 4.93 0.00
64 1000 11.79 5.46 5.78 0.00
128 1000 12.00 5.60 5.79 0.00
256 1000 13.05 6.05 5.68 0.00
512 1000 15.45 7.46 7.79 0.00
1024 1000 88.62 54.16 56.41 38.92
2048 1000 91.67 56.92 59.94 42.03
4096 1000 117.08 68.63 72.45 33.13
8192 1000 155.92 100.22 105.75 47.34
16384 1000 169.91 105.55 111.33 42.19
32768 1000 206.10 124.65 132.13 38.36
65536 640 282.44 175.18 188.25 43.02
131072 320 451.45 293.52 318.08 50.35
262144 160 814.90 586.73 637.70 64.22
524288 80 1525.50 1116.74 1217.77 66.43
1048576 40 2812.62 2110.67 2290.47 69.35
2097152 20 5562.60 4368.10 4736.86 74.78
4194304 10 11493.40 9449.10 10221.51 80.00
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 16
# ( 496 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 1.97 0.99 0.78 0.00
4 1000 23.98 13.42 13.79 23.40
8 1000 23.31 13.23 13.12 23.03
16 1000 23.58 13.33 13.14 21.64
32 1000 17.37 8.55 8.40 0.00
64 1000 18.41 9.06 9.36 0.07
128 1000 19.42 9.59 9.79 0.00
256 1000 20.89 10.66 10.55 3.04
512 1000 25.39 13.18 13.29 8.11
1024 1000 104.53 64.62 68.77 41.98
2048 1000 111.75 69.24 73.65 42.28
4096 1000 153.60 88.38 90.89 28.24
8192 1000 204.74 119.83 124.31 31.69
16384 1000 231.00 133.90 140.69 30.98
32768 1000 283.16 163.01 172.74 30.45
65536 640 401.26 234.56 247.10 32.54
131072 320 614.08 399.35 427.43 49.76
262144 160 1542.62 1202.69 1307.54 74.00
524288 80 1799.06 1249.39 1350.63 59.30
1048576 40 3540.18 2597.12 2869.92 67.14
2097152 20 7238.65 5102.35 5448.03 60.79
4194304 10 14614.80 11747.81 12562.39 77.18
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 32
# ( 480 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 2.02 1.11 0.75 0.00
4 1000 26.12 14.94 15.33 27.10
8 1000 26.55 15.09 15.36 25.44
16 1000 26.55 15.13 15.47 26.21
32 1000 56.41 30.47 31.51 17.71
64 1000 59.09 31.82 32.48 16.05
128 1000 60.26 31.61 32.79 12.63
256 1000 62.16 32.38 33.32 10.61
512 1000 66.90 35.50 36.33 13.57
1024 1000 132.19 87.74 92.36 51.88
2048 1000 143.53 95.89 101.80 53.20
4096 1000 233.79 134.12 141.19 29.41
8192 1000 296.50 175.57 179.41 32.59
16384 1000 319.48 187.27 196.92 32.86
32768 1000 368.47 213.33 221.82 30.06
65536 640 492.14 291.76 309.23 35.20
131072 320 684.80 447.48 480.06 50.57
262144 160 1126.11 755.92 811.77 54.40
524288 80 2924.45 2255.89 2435.89 72.55
1048576 40 4575.87 3269.03 3582.63 63.52
2097152 20 8135.84 6039.50 6651.46 68.48
4194304 10 15998.10 12681.91 13849.74 76.06
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 64
# ( 448 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 2.04 1.15 0.76 0.00
4 1000 29.01 17.49 17.86 35.52
8 1000 29.03 17.67 17.89 36.53
16 1000 30.20 17.80 18.35 32.39
32 1000 73.38 38.22 39.07 10.02
64 1000 74.97 39.25 39.98 10.66
128 1000 76.75 39.92 41.26 10.76
256 1000 78.64 40.89 43.21 12.64
512 1000 83.92 43.38 45.37 10.64
1024 1000 154.92 105.69 111.82 55.97
2048 1000 172.74 118.62 124.87 56.66
4096 1000 274.13 156.72 164.95 28.82
8192 1000 347.89 206.35 214.72 34.08
16384 1000 370.66 217.45 225.99 32.21
32768 1000 413.57 237.15 251.72 29.91
65536 640 545.39 319.63 337.08 33.02
131072 320 753.33 488.38 519.44 48.99
262144 160 1737.74 1251.31 1299.16 62.56
524288 80 2977.36 2254.06 2367.65 69.45
1048576 40 4748.08 3493.15 3702.74 66.11
2097152 20 8336.81 6419.29 6859.48 72.05
4194304 10 16571.00 13475.70 14752.53 79.02
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 128
# ( 384 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 2.04 1.09 0.78 0.00
4 1000 30.45 19.34 19.24 41.99
8 1000 31.35 19.46 19.93 40.37
16 1000 31.67 19.67 20.53 41.52
32 1000 87.47 44.39 46.79 7.95
64 1000 88.55 45.19 46.93 7.60
128 1000 90.29 46.80 48.85 10.96
256 1000 93.93 47.70 48.93 5.52
512 1000 99.08 51.59 53.88 11.87
1024 1000 195.37 139.19 146.23 61.58
2048 1000 208.54 142.09 150.35 55.80
4096 1000 309.73 173.60 182.41 25.37
8192 1000 390.39 230.71 242.45 34.14
16384 1000 407.32 236.79 248.74 31.45
32768 1000 450.14 256.92 271.70 28.89
65536 640 585.12 341.30 359.84 32.24
131072 320 825.88 513.12 549.32 43.06
262144 160 1953.49 1313.58 1388.67 53.92
524288 80 3207.60 2371.51 2500.62 66.56
1048576 40 5129.15 3738.55 4064.95 65.79
2097152 20 9047.15 6680.85 6916.02 65.79
4194304 10 18619.20 13818.79 15136.36 68.29
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 256
# ( 256 additional processes waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 2.07 1.13 0.78 0.00
4 1000 31.49 19.95 19.94 42.12
8 1000 32.18 20.12 20.58 41.37
16 1000 32.56 20.31 21.06 41.81
32 1000 112.31 60.96 61.83 16.96
64 1000 111.23 59.58 61.60 16.16
128 1000 109.10 55.40 57.23 6.17
256 1000 115.20 60.47 62.23 12.04
512 1000 118.41 60.85 62.80 8.36
1024 1000 209.41 142.46 148.86 55.03
2048 1000 237.63 163.81 172.69 57.25
4096 1000 344.82 192.93 198.02 23.30
8192 1000 427.11 250.26 262.61 32.66
16384 1000 446.48 258.61 267.91 29.87
32768 1000 490.09 280.44 295.29 29.00
65536 640 622.08 363.85 383.41 32.65
131072 320 892.98 532.36 558.42 35.42
262144 160 2272.04 1694.26 1788.40 67.69
524288 80 3432.19 2461.62 2507.14 61.29
1048576 40 5334.58 3799.22 4057.78 62.16
2097152 20 9105.75 6808.46 7253.48 68.33
4194304 10 17817.90 14020.30 15345.43 75.25
#-----------------------------------------------------------------------------
# Benchmarking Iallreduce
# #processes = 512
#-----------------------------------------------------------------------------
#bytes #repetitions t_ovrl[usec] t_pure[usec] t_CPU[usec] overlap[%]
0 1000 2.14 1.09 0.77 0.00
4 1000 33.53 21.75 21.94 46.32
8 1000 34.28 21.93 21.92 43.66
16 1000 35.75 22.59 22.60 41.77
32 1000 132.41 70.23 73.37 15.25
64 1000 129.62 67.00 68.07 8.01
128 1000 130.41 67.86 69.37 9.83
256 1000 131.01 66.39 69.32 6.78
512 1000 138.63 73.31 77.14 15.31
1024 1000 323.16 253.45 263.74 73.57
2048 1000 342.64 263.96 279.01 71.80
4096 1000 424.28 242.21 253.95 28.30
8192 1000 504.31 309.63 321.41 39.43
16384 1000 518.99 312.61 327.53 36.99
32768 1000 560.26 333.12 347.17 34.57
65536 640 687.74 412.09 435.21 36.66
131072 320 994.97 611.65 644.15 40.49
262144 160 2419.55 1789.48 1893.83 66.73
524288 80 3542.32 2525.26 2685.30 62.12
1048576 40 5559.43 3904.75 4200.27 60.61
2097152 20 9645.30 7030.95 7644.80 65.80
4194304 10 19036.89 14211.01 15473.82 68.81
|