Framework for scalable intra-node collective operations using ...

Framework for scalable intra-node collective operations using ...

Framework for scalable intra-node collective operations using shared memory Presenter: Surabhi Jain Contributors: Surabhi Jain, Rashid Kaleem, Marc Gamell Balmana, Akhil Langer, Dmitry Durnov, Alexander Sannikov, and Maria Garzaran Supercomputing 2018, Dallas, USA Legal Notices & Disclaimers Acknowledgment: This material is based upon work supported by the U.S. Department of Energy and Argonne National Laboratory and its Leadership Computing Facility under Award Number(s) DE-AC02-06CH11357 and Award Number 8F-30005. This work was generated with financial support from the U.S. Government through said Contract and Award Number(s), and as such the U.S. Government retains a paid-up, nonexclusive, irrevocable, world-wide license to reproduce, prepare derivative works, distribute copies to the public, and display publicly, by or on behalf of the Government, this work in whole or in part, or otherwise use the work for Federal purposes. Disclaimer: This report/presentation was prepared as an account of work sponsored by an agency and/or National Laboratory of the United States Government. Neither the United States Government nor any agency or National Laboratory thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency or National Laboratory thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency or National Laboratory thereof. Access to this document is with the understanding that Intel is not engaged in rendering advice or other professional services. Information in this document may be changed or updated without notice by Intel. This document contains copyright information, the terms of which must be observed and followed. Reference herein to any specific commercial product, process or service does not constitute or imply endorsement, recommendation, or favoring by Intel or the US Government. Intel makes no representations whatsoever about this document or the information contained herein. IN NO EVENT SHALL INTEL BE LIABLE TO ANY PARTY FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES FOR ANY USE OF THIS DOCUMENT, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, BUSINESS INTERRUPTION, OR OTHERWISE, EVEN IF INTEL IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Copyright 2018, Intel Corporation. All rights

reserved. 2 Legal Notices & Disclaimers (cont.) INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Performance results are based on testing as of July 31, 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No component or product can be absolutely secure. Intel, Pentium, Intel Xeon, Intel Xeon PhiTM, Intel CoreTM, Intel VTuneTM, Intel CilkTM, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Optimization Notice Intels compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Copyright 2018, Intel Corporation. All rights reserved. 3 Motivation

MPI collectives represent common communication patterns, computations, or synchronization Why should we optimize intra-node collectives? They are on the critical path for many collectives (Reduce, Allreduce, Barrier,) First, perform intra-node portion Then, perform inter-node portion Important for large multicore nodes and/or small clusters Copyright 2018, Intel Corporation. All rights reserved. 4 Contributions 1. Propose a framework to optimize intra-node collectives Based on release/gather building blocks Dedicated shared memory layer Topology aware intra-node trees 2. Implement 3 collectives: MPI_Bcast(), MPI_Reduce(), and MPI_Allreduce() 3. Significant speedups with respect to MPICH, MVAPICH, and Open MPI E.g., for MPI_Allreduce, average speedups of 3.9x faster than Open MPI 1.2x faster than MVAPICH

2.1x faster than MPICH/ch3, 2.9x faster than MPICH/ch4 Copyright 2018, Intel Corporation. All rights reserved. 5 Outline Background Design and Implementation Shared memory layout Release and gather steps Implement collectives using release and gather Optimizations Performance Evaluation Conclusion Copyright 2018, Intel Corporation. All rights reserved. 6 Background MPI_Allreduce Current MPI Implementations optimize collectives for multiple ranks per node 1. Intra-node reduce MPICH and Open MPI use point to point, MVAPICH uses dedicated shared memory Intra-node reduce MPI_AllReduce ( ,0,... ) Copyright 2018, Intel Corporation. All rights reserved.

Node 0 Node 1 Node 2 Rank 0 Rank 3 Rank 6 Rank 1 Rank 4 Rank 7 Rank 2 Rank 5 Rank 8 Node 3

Rank 9 Rank 10 Rank 11 7 Background MPI_Allreduce Current MPI Implementations optimize collectives for multiple ranks per node 1. Intra-node reduce MPICH and Open MPI use point to point, MVAPICH uses dedicated shared memory Inter-node allreduce 2. Inter-node allreduce MPI_AllReduce ( ,0,... ) Copyright 2018, Intel Corporation. All rights reserved. Node 0 Node 1 Node 2 Rank 0 Rank 3 Rank 6 Rank

1 Rank 4 Rank 7 Rank 2 Rank 5 Rank 8 Node 3 Rank 9 Rank 10 Rank 11 8 Background MPI_Allreduce Current MPI Implementations optimize collectives for multiple ranks per node 1. Intra-node reduce MPICH and Open MPI use point to point, MVAPICH uses dedicated shared memory Intra-node bcast 2. Inter-node allreduce 3. Intra-node bcast MPI_AllReduce ( ,0,... ) Copyright 2018, Intel Corporation. All rights

reserved. Node 0 Node 1 Node 2 Rank 0 Rank 3 Rank 6 Rank 1 Rank 4 Rank 7 Rank 2 Rank 5 Rank 8

Node 3 Rank 9 Rank 10 Rank 11 9 Intra-node Broadcast 4 steps: 1. Root copies to shared memory buffer 2. Root sets a flag to let the ranks know that data is ready 3. Other ranks copy the data out 4. Other ranks update a flag to indicate the root that they have copied the data Copyout Root MPI_Bcast (...) User buffer Copyin Shared buffer Copyout Copyright 2018, Intel Corporation. All rights reserved. User buffer User buffer Non-root 1 MPI_Bcast (...)

Non-root 2 MPI_Bcast () 10 Intra-node Reduce 4 steps: 1. Each non-root copies to shared memory 2. Each non-root updates a flag to tell the root that data is ready 3. Root copies the data out of each non-root 4. Root updates a flag to tell non-roots that it has copied the data out Non-root 1 MPI_Reduce (...) Non-root 2 MPI_Reduce (...) User buffer User buffer Copyright 2018, Intel Corporation. All rights reserved. Copyin Copyin Shared buffer Shared buffer

Copy-out User buffer Root MPI_Reduce (...) Copy-out 11 Design and implementation Copyright 2018, Intel Corporation. All rights reserved. 12 Shared Memory Layout Bcast Buffer: Root copies the data in. Other ranks copy data out Reduce Buffer: Each rank copy its data in. Root copies the data out and reduces Flags: To notify the ranks after copying the data in/out of shared memory Copyright 2018, Intel Corporation. All rights reserved. 13 Release and Gather steps Release Step 0

Set-up- Release: A rank releases its children Top-down step 2 1 Arrange the ranks in a tree with rank 0 as the root 3 Gather Step 0 Inform the children using release flags Gather: A rank gathers from all its children Copy the data (if reduce) 6 7 Copy the data (if bcast) Bottom-up step 5 4 2 1

3 4 5 6 7 Inform the parent using gather flags Copyright 2018, Intel Corporation. All rights reserved. 14 Bcast and Reduce using Release and Gather steps 0 0 2 1 3 4 5 2 1 6

7 3 4 5 6 7 Collective Release step Gather step MPI_Bcast Data movement 1. Root copy data in shm buffer 2. Inform children 3. Children copy data out Acknowledgement 1. Inform children buffer ready for next reduce Acknowledgment 1. Inform parent buffer ready for next bcast MPI_Reduce

Copyright 2018, Intel Corporation. All rights reserved. Data movement 1. All copy data in shm buffer 2. Inform parent 3. Parent reduce data 15 Optimizations Intra-node topology aware trees Data pipelining Read from parent flag on the release step Data copy optimization in reduce Copyright 2018, Intel Corporation. All rights reserved. 16 Intra-node Topology aware trees Socket 0

0 2 3 4 5 6 7 Subtree 0 Subtree 4 0 4 0 S0 1 1 2 3 5

7 8 4 12 S12 16 S16 Copyright 2018, Intel Corporation. All rights reserved. 11 Subtree 8 9 0 Better for release step 1 0 9 8 12 13 1

4 S4 S0 S8 10 11 0 Socket 4 16 17 1 8 15 19 Subtree 12 Subtree 16 12 16 8 6 Socket-leader-first Socket 3

Socket 2 Socket 1 13 14 15 17 18 19 0 Socket-leaderlast Better for gather step S0 S4 12 S12 8 4 16 S8

S16 17 Other variants for trees 2 1 K-ary v/s k-nomial trees Topology-unaware trees 0 0 Right skewed v/s left skewed 3 4 5 6 7 Left skewed tree Copyright 2018, Intel Corporation. All rights reserved. 1 2

6 5 4 3 7 Right skewed tree 18 Data Pipelining Split the large message into multiple Bcast Root copy the next chunk of data in next cell Non-roots copy out from previous cells Reduce Non-roots copy in the next cells

Root reduce the data from previous cells Cell 1 Non-root 1 Non-root 3 Cell 2 Non-root 2 Cell 0 Root Bcast buffer split into 3 Also useful for back to back collectives Copyright 2018, Intel Corporation. All rights reserved. 19 Other Optimizations Read from parent flag on the release step 0 Parent updates its own flag Not write flag for each child Data copy optimization in Reduce

2 1 3 4 5 6 7 Root reduce the data directly in its user-buffer Not reduce in shm buf and copy to user-buffer Copyright 2018, Intel Corporation. All rights reserved. 20 performance evaluation Copyright 2018, Intel Corporation. All rights reserved. 21 Experimental Setup

System Configuration Skylake (SKL): Intel Xeon Gold 6138F CPU (2.0 GHz, 2 sockets, 20 cores/socket, 2 threads/core). 32KB L1 data and instruction cache, 1MB L2 cache, 27.5MB L3 cache OmniPath-1 Fabric Interconnect Software Configuration Gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3 running linux version 4.4.132-94.33-default Libfabric (commit id 91669aa), opa-psm2 (commit id 0f9213e) MPICH (commit id d815dd4) used as the baseline for our implementation, MPICH/ch3, MPICH/ch4 Open MPI (version 3.0.0) and MVAPICH (version 2-2.3rc1) Benchmark

Intel MPI Benchmarks (IMB) (version 2018 Update 1). Reported T-max used for comparison Copyright 2018, Intel Corporation. All rights reserved. 22 MPI_Bcast: Single node, 40 MPI ranks (1 rank per core) Lower the better 32KB buffer split in 4 cells Flat tree used to propagate flags Tuned Open MPI, MVAPICH, MPICH/ch3, and MPICH/ch4 Average Speedups: 3.9x faster than Open MPI 1.2x faster than MVAPICH 2.1x faster than MPICH/ch3 2.9x faster than MPICH/ch4 Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks -iter 5000 -msglog 22 -sync 1 imb_barrier 1 root_shift

0, *See performance-related Copyright Tmax 2018, Intel Corporation. All rights reserved. disclaimers on slide 3 23 Lower the better MPI_Allreduce: Single node, 40 MPI ranks (1 rank per core) 32KB buffers split in 4 cells Tree configuration Reduce: Socket-leaders-last and right-skewed Msg size < 512B topology-unaware, k-nomial tree, K=4 512B <= msg_size < 8KB topology aware, k-ary tree, K=3 Msg size >= 8KB topology aware, kary tree, K=2

Bcast: Flat tree Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks -iter 5000 -msglog 22 -sync 1 imb_barrier 1 root_shift 0, *See performance-related Copyright Tmax 2018, Intel Corporation. All rights reserved. disclaimers on slide 3 24 Lower the better Impact of Topology aware trees MPI_Reduce, 40 MPI ranks, 1 rank per core Topology-aware tree Socket-leaders-last and rightskewed Msg size <= 4KB, k-ary tree, K=3

Msg size > 4KB, k-ary tree, K=2 Topology-unaware trees Msg size <= 16KB, k-nomial tree, K=8 Msg size > 16KB, k-nomial tree, K=2 Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks -iter 5000 -msglog 22 -sync 1 imb_barrier 1 root_shift 0, *See performance-related Copyright Tmax 2018, Intel Corporation. All rights reserved. disclaimers on slide 3 25 Lower the better Multiple node runs (32 nodes, 40 ranks per We only compare to MPICH/ch3 and MPICH/ch4 to keep the inter-node collectives node)

implementation same Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks -iter 5000 -msglog 22 -sync 1 imb_barrier 1 root_shift 0, *See performance-related Copyright Tmax 2018, Intel Corporation. All rights reserved. disclaimers on slide 3 26 Why are we better? Network topology aware Dedicated shared memory Node topology aware Open MPI MVAPICH MPICH (ch3, ch4) Our framework Copyright 2018, Intel Corporation. All rights reserved.

27 Conclusions Implement MPI_Bcast, MPI_Reduce, and MPI_Allreduce using release and gather building blocks Significantly outperform MVAPICH, Open MPI, and MPICH Careful design of trees to propagate data and flags provide improvement upto 1.8x over nave trees Compared to MPICH, speedups upto 2.18x for MPI_Allreduce and upto 2.5x for MPI_Bcast on a 32 node cluster Copyright 2018, Intel Corporation. All rights reserved. 28 Questions? [email protected] Check out MPICH BoF today! @C145, 5:15pm Copyright 2018, Intel Corporation. All rights reserved. 29

Recently Viewed Presentations

  • Medical Terminology - WordPress.com

    Medical Terminology - WordPress.com

    Suffix Prefix These four parts of a word are known as ELEMENTS. Word Root Main part or foundation of a word. All words have at least one word root. A word root may be used alone or combined with other...
  • Mental Health Response to Mass Violence

    Mental Health Response to Mass Violence

    Have an escape route and plan in mind. Evacuate regardless of whether others agree to follow. Leave your belongings behind. Help others escape, if possible. Prevent individuals from entering an area where the active shooter may be. Keep your hands...
  • Utilitarianism - University of Hawaii

    Utilitarianism - University of Hawaii

    J.S.MILL The utilitarian principle: Act to promote the greatest good (happiness) for the greatest number. Bentham's Act Utilitarianism Jeremy Bentham 1748-1832 Philosopher of ethics & political-legal theory Democrat, reformer Bentham=s project Was to find an objective basis for moral decision...
  • BRAND PPT TEMPLATE FONTS YOU WILL NEED TO

    BRAND PPT TEMPLATE FONTS YOU WILL NEED TO

    Go to uw.edu/brand/fonts. Download three fonts: Encode Sans, Uni Sans, Open Sans. You may need to close PowerPoint and re-open in order to access the fonts. All templates themes include these brand fonts. YOU WILL NEED TO INSTALL FONTS TO...
  • Progressive Energy Dynamics (The Key to Low Standoff

    Progressive Energy Dynamics (The Key to Low Standoff

    Fluid Flow Theory - Cleaning Small Gaps Depends on 2 things. . . 1. Physical properties of the cleaning agent (surface tension, density and viscosity) 2. Higher energy fluid delivery (flow rate and impact velocity) Energy Delivered is dependent on...
  • My Life in Pink and Green by Lisa Greenwald

    My Life in Pink and Green by Lisa Greenwald

    My Life in Pink and Green is all about Realistic Fiction. I know this because Lucy's family problem could totally happen. Also, it has dialogue that people would actually say. Pharmacies close down all the time today because of big-time...
  • IPR Protection in India - ibef.org

    IPR Protection in India - ibef.org

    IPR Protection in India Structure of the Presentation An Overview -The IP Laws of India & its TRIPS Compliance The IP Offices in India Nodal Agencies & Facilitation agencies The Road Ahead An Overview -The IP Laws of India &...
  • Results from the Humber Estuary Transport Model

    Results from the Humber Estuary Transport Model

    west of Scunthorpe to east of Hull. 150 traffic O/D zones. Model capability. 3 time periods: am peak, interpeak, pm peak . ... Growth in trips 2010-2033 from DfT National Trip End Model (NTEM) forecast, version 6.2 incorporating the effects...