Process Allocation Strategies forRuntime OptimizationJoint work with: Guillaume Mercier, François Tessier,Charm team, Torsten HoeflerEmmanuel JeannotRuntime TeamInria Bordeaux Sud-OuestJuly 10, 2014

MPI (Process-based runtime systems)Performance of MPI programs depends on manyfactors: Implementation of the standard (e.g. collective com.)Parallel algorithm(s)Implementation of the algorithmUnderlying libraries (e.g. BLAS)Data LayoutHardware (processors, cache, network)etc.But Emmanuel Jeannot

Process PlacementThe MPI model makes little (no?) assumption on theway processes are mapped to resourcesIt is often assume that the network topology is flatand hence the process mapping has little impact onthe performance Interconection networkEmmanuel JeannotCPUCPUCPUCPUMemMemMemMem

The Topology is not FlatDue to multicore processors current and future parallelmachines are hierarchicalCommunication speed depends on: Receptor and emitter Cache hierarchy Memory bus Interconnection networketc.Almost nothing in the MPI standard help to handlethese factorsEmmanuel Jeannot

Example on a Parallel MachineThe higher we have to go into the hierarchy the costly thedata exchangeLocal RAMNetworkLocal RAMMem. ControlerNICL3Local RAML1/L2Mem.L1/L2ControlerNode Core 1Core.Emmanuel JeannotLocal RAMThe network canalsobeMem.ControlerInterconectMem. ControlerNIChierarchical!L3Local RAM BusL1/L2L1/L2Mem. ControlerNodeCore2CoreLocal RAML1/L2L3L1/L2ControlerMem.NodeCore3CoreLocal RAML1/L2L1/L2 Mem. Controler4. CoreCoreNode

RationaleNot all processes exchange thesame amount of data6420152105101520Sender rankEmmanuel Jeannot25303505Receiver rankThe speed of communications,and hence performance of theapplication depends on the wayprocesses are mapped toresources.2530835MPICH2 BERTHA RR sp.C.36.mat

Do we Really Care: to Bind or not to Bind?Zeus MHD Blast70No process bindingProcess bindingAfter all, the system scheduler is able to move processeswhen needed.60Execution time (s)50Yes, but only for shared memory system. Migration ispossible but it is not in the MPI standard (see charm )4030Moreover binding provides better execution 003500040000# iterationZeus MHD Blast. 64 Processes/Cores. Mvapich2 1.8. ICCEmmanuel Jeannot4500050000

Process Placement ProblemGiven : Parallel machine topology Process affinity (communication pattern)Map processes to resources (cores) to reducecommunication cost. A nice algorithmic problem: Graph partitionning (Scotch, Metis) Application tuning [Aktulga et al. Euro-Par 12] Topology-to-pattern matching (TreeMatch)Emmanuel Jeannot

!"# %#&'()* ,# -)./01&# 0#2/ 3)')0'4)"5)67889)'%:)789;)ReduceCommunication Cost?!"# %&#!"# %& 7%5893@%/'()('()* ,-.%/%01%2345" 6 '()*' 9.:37;#) .:37;# )(((?A BCD'( BCBut wait, my applicationis compute-bound!8 ! 7)? @A ! )))E F(G H" %#;I:37 J* ,# -)- -"/ BCD88EF": )4 /5"/-'%0 D7G)HIDJ7) "/)DGKIBCD8E)L BCD88EF": )- -"/ ) ;7G H M,7)? AK M,)E F((' H" %#;I:37 JBCD88EWell, but this might notbe still truein the future:strongD7BCDNE)"/)D8NBCD88E)L BCD888Emight not G)H M,always be 788?A88H M,a solution.BCD88EK"#'(scalingF": )O%# /0"%% 0#) ;F": )0"%02// %0 CDPA)"/)DPQ)5/"-)- -"/ ) ;E* ,# -),&R )C%": ,EDQJS88BCD88J888E)"/)BCD9EBCD8E)L BCD88EK"#'()0"%02// %0 77GJ888BCT&((&"%E UBCD8E)#")BCD88E)5"/)('# %0 )1&:&%VWBCD8J888E*#"/'V DG)! G88?D888)! )CXD8Y , ,# -)- -"/ )&,)-&%EBCD8E L BCD88EOB8 7)K @8 K M, C1"3)("%V)#"):/'&%)#1 )-'01&% EBCD88E:' ,BCD):' E? BCD8E9KKOTaken from one of J. Dongarra’s Talk.Emmanuel Jeannot

Obtaining the Topology (Shared Memory)HWLOC (portable hardware locality): Runtime and OpenMPI team Portable abstraction (across OS, versions,architectures, .) Hierarchical topology Modern architecture (NUMA, cores, caches, etc.) ID of the cores C library to play with EtcEmmanuel Jeannot

el Jeannot

Obtaining the Topology (DistributedMemory)Not always easy (research issue)MPI core has some routine to get thatSometime requires to build a file that specifies nodeadjacency.Netloc.Emmanuel Jeannot

Getting the Communication PatternNo automatic way so far Emmanuel Jeannot655044033022010203040Sender rank50600110Receiver rankCan be done: Through application monitoring During execution (Loadbalancing) With a « blank execution » From the application design(stencil, etc.) From the data partitionning60Crypto RSA768 64 procs

State of the ArtProcess placement fairly well studied problem: [Träff 02]: placement through graph embeddingand graph partitioning MPIPP [Chen et al. 2006]: placement throughlocal exchange of processes until no gain isachievable [Clet-Ortega & Mercier 09]: placement throughgraph renumbering (Scotch) LibTopoMap [Hoefler & Snir 11]: placementthrough network model graph partitioning(ParMetis)All these solutions require a quantitative knowledge of thetopology.Emmanuel Jeannot

Problems with quantitative knowledgeNetpipe on a 2 sockets, 8 cores each NUMA node50004500104000935008Latency (µsec)Bandwidth Shared114 16 64 256 1K 4K 16K64K256K1M 4MMPI Message Size (Byte)01 2 4 8 1632641282565121K2K4K 16KMPI Message Size (Byte)It is generaly faster to use cache or to stay within a socketBut the acceleration ratio depends on message size: It is not linear (not affine either) Contention makes things even harderEmmanuel Jeannot

Dealing with Qualitative KnowledgeAbstract the topology with a treeAssume communication always cost more when you needto reach higher levels04152637The structure is sufficient. No need to deal with latency orbandwidth.Emmanuel Jeannot

TreeMatch AlgorithmC: communication 01101000010001111001110000234501Grouped matrix010122024101204202202401012420210120Emmanuel Jeannot670415263701234567Communication matrix Tree Topology Process permutation

Process reorderingTreeMatch: process permutationIf process already bound: rank reordering within the MPI codeBuild a new communicator that optimizes the executionProblem: how to take into account placement constraintsEmmanuel Jeannot

Dealing with already mapped applicatonsProblem: Given a hierarchichal topology An already mapped applicationonto a subset of the nodes Reorder process while ensuringonly this subset is usedPrevent Treematch to use somecores.Emmanuel Jeannot0410125125226450337334

New Version of TreeMatch to Deal WithUnbalanced Tree04101252645370415263Solution: Extend the communication matrix with dummy nodes Process the tree backward by doing k-partitionning Force each partition to have the right number of dummy nodes Process recursivelyEmmanuel Jeannot37

Results: placement computation timeEmmanuel Jeannot

Results400% gainagainstsome graphpartitionners36% gainagainststandardMPI policy64 nodes linked with an Infiniband interconnect (HCA: Mellanox Technologies, MT26428 ConnectX IB QDR).Each node features two Quad-core INTEL XEON NEHALEM X5550 (2.66 GHz) processors.Emmanuel Jeannot

Strategy for Charm - Intra-node placementTopology-aware load balancingTreeMatch load balancer1st step : Remap groups of chareson nodes according to thecommunication on the networkWithin Charm libtopomap (example : part of3d Torus)Taking intoaccount:ndstep : Reorder chares inside 2eachCPUnodeload(distributed)Apply TreeMatchon the NUMA Processaffinitytopology and the chares Toplogycommunication patternBind chares according to theirload (leveling on less loadedchares)Each node carries out its ownplacement in parallelEmmanuel JeannotFrancois TessierTreeMatch in Charm 11 / 19

Benchmarks application designed to simulate regular intensivecommunication between processesExperiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIMClusterResult with k-neighbourParticularly compared to RefineCommLBTakes into account load and communicationMinimizes migrationskNeighbor on 64 cores128 elements 1MB message sizekNeighbor on 64 cores256 elements 1MB message size2000700Execution time (in seconds)Execution time (in seconds)60050040030020015001000500100Francois TessierEmmanuel JeannotTreeMatch in Charm TMLB LB TreeBasedRefineCommLBGreedyLBGreedyCommLB0DummyLB0

ConclusionTo ensure performance protability one must take intoaccount the topology of target machineProcess placement according to application behavior andtopology helps in increasing performanceWe are ready to test our techniques on your applications No need to change the code to test TreeMatchEmmanuel Jeannot


Topology-aware load balancing Strategy for Charm - Intra-node placement TreeMatch load balancer 1st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2nd step : Reorder chares inside each node (distributed