Mobile development RSS CSDN home> Mobile development

Use environment variables set operation control in the Intel Xeon processor to execute it co finance core Intel MPI library collection optimization

Published in17:28 2016-01-25| Time reading| sourceCSDN| ZeroArticle comments| authorCSDN

Abstract:This paper discusses a method for optimization in connection with the Intel MPI library application programming message passing interface (MPI) collection performance. Intel MPI library to add the _ADJUST I_MPI environment variable family, to allow the operation of each group and its associated with an explicit choice of the algorithm. When using the Intel MPI library and MPI collective optimization method through the process of adding a fixed and interconnection architecture control, to measure the Intel MPI benchmarks.

abstract

This paper discusses a method for optimization in connection with the Intel MPI library application programming message passing interface (MPI) collection performance. Intel MPI library to add the _ADJUST I_MPI environment variable family, to allow the operation of each group and its associated with an explicit choice of the algorithm. When using the Intel MPI library and MPI collective optimization method through the process of adding a fixed and interconnection architecture control, to measure the Intel MPI benchmarks message delivery delay is proved. The set of Intel MPI benchmarks test analysis is in Intel Xeon thawing nuclear "coprocessor as a native mode operation, but this choice best collection algorithm of model applicable to other Intel microarchitecture.

  1. brief introduction
    • Who will benefit from this article?
    • What method is used to improve the performance of the message passing interface (MPI) collection?
    • How is the text organized?
  2. Set operation control Intel MPI in the library
  3. The Intel Xeon processor architecture of the core melt it
  4. Intel MPI performance evaluation index
  5. MPI_Allreduce performance improvement brought by adjusting the I_MPI_ADJUST_ALLREDUCE environment variable settings
  6. conclusion
  7. reference material

1 Introduction

Who will benefit from this article?

If you are in with Intel MPI library for Linux * or windows * operating system execution) connected to the message passing interface (MPI)One,Two,Three,FourThe use of set operations in the application, and from the show that the collection operation is a hot bottleneckIntel VTune amplifier XEFiveorIntel Trace Analyzer and CollectorSixAnd other software tools for statistical analysis, then this article will be able to help you achieve a better performance of the implementation.

What method is used to improve the performance of MPI collections?

Intel MPI LibraryFourProvides a way to explicitly control the collection algorithm selection by using the I_MPI_ADJUST environment variable family. The optimization process will be set by using the Intel MPI benchmark construction andSevenIntel MPI LibraryFourConnect the machine model of executable files in Intel Xeon core melt TM co processor architecture for demonstration. Users can use this method for their connection with the Intel MPI library and contains the group operation of MPI applications. Moreover, this optimization technology and other Intel micro architecture as well as the MPI implementation (for Intel Architecture to support the Intel microarchitecture and non Intel microprocessors set control algorithm).

This paper is how to organize?

The second part describes the set operation for MPI.One,Two,ThreeAnd a class of environment variables that can be used to control the optimization of a set of Intel MPI libraries. Chapter 3 provides the financial background of the core processor architecture on Intel Xeon, and a group for the Intel Xeon thaw between nuclear co processor socket core computing and communication Intel MPI library environment variables. The fourth part briefly introduces the Intel MPI benchmark.SevenChapter 5 use Intel MPI benchmarks within the MPI collection showcases the Intel Xeon thawing core co processor performance results. Finally provides the conclusion.

2 Intel MPI library set operation control

There are three basic forms of set operation for MPI process:Three

  • Barrier synchronization
  • Broadcast, Gather, Scatter and Gather/Scatter and other global communication functions, from all members to a group of all members (complete exchange or more to many)
  • Allreduce operation, such as a function of the maximum and minimum value of the sum, product, or user defined

In all these cases, the message passing library can make full use of the knowledge about the structure of the computing system to improve the parallelism of these sets.OneAnd improve the performance of the application.Three

According to MPI standardOne,Figure 1The MPI data conversion for the subset (i.e., Broadcast, Scatter, Gather, Allgather, and Alltoall collection functions) for the MPI set operation is demonstrated.OneRow represents the MPI process, and the list shows the data associated with the MPI process. AIB.IC.ID.IE.IAnd FIData items that can be associated with each MPI process.

Figure 1

Figure 1 MPI collection function Broadcast, Scatter, Gather, Allgather and Alltoall, respectively, how to interact with the MPI process data illustrations.One

stayFigure 1The typical method of implementing the set operation can be summarized into two categories:Six

  • A short vector processing by an optimization technique
  • Long vectors are processed by different methods.

For the use of a collection of MPI applications, it is necessary to have a good MPI library for all vector lengths is a priority.EightFor example,NineThe ensemble optimization algorithm can be used to distinguish between short message and long message, and in short message and long messageTenSwitching between algorithms. Therefore, in general, the optimization of the MPI set operation can include the following:Eleven,Twelve

  • A variety of possible algorithms
  • Knowledge about network topology
  • Programming method

Therefore, by constructing a parameterized model for MPI sets, we can implement an automated process, in order to select the most efficient implementation method of the set algorithm at runtime.ThirteenThis paper does not use the automatic process of selecting the most efficient implementation method, and focuses on the introduction of the collection based environment variables in the Intel MPI library. In the Intel MPI library, the user can compile its MPI application once, and then use the set environment variable control to determine the application for a variety of cluster topology.

As mentioned in the introduction, the Intel MPI library for each set of operations to support a variety of communication algorithms. In addition to the highly optimized default settings, the Intel MPI library provides a way to explicitly control algorithm selection by using the I_MPI_ADJUST environment variable family.

MPI I adjust environment variable can be used for Intel and Intel microprocessors, but it can be for Intel microprocessor performs better than non Intel Architecture to support the Intel microprocessor more optimization.Four

Table 1Provides an I_MPI_ADJUST environment variable name list and a set of algorithms for each set of operations in the Intel MPI library.FourCollective optimization algorithm one is different from the cluster topology, the interconnection structure and the shared memory communication.MayBetter performance provide over that of the other algorithms associated with that collective operation.Figure 1MPI, Scatter, Broadcast, Gather, Allgather, and Alltoall, as shown in the,Table 1InAlgorithm selectionThe column lists the possible algorithms which can be used to perform the respective collection operations.

The environment variable is defined in Table 1 to the Intel MPI library set operations of the I_MPI_ADJUST family.Four

Environment variable name

Set operation

Algorithm selection

I_MPI_ADJUST_ALLGATHER

MPI_Allgather

  1. Recursive multiplication algorithm
  2. Bruck algorithm
  3. Ring algorithm
  4. Topology aware Gatherv + Bcast algorithm
  5. Knomial algorithm

I_MPI_ADJUST_ALLGATHERV

MPI_Allgatherv

  1. Recursive multiplication algorithm
  2. Bruck algorithm
  3. Ring algorithm
  4. Topology aware Gatherv + Bcast algorithm

I_MPI_ADJUST_ALLREDUCE

MPI_Allreduce

  1. Recursive multiplication algorithm
  2. Rabenseifner algorithm
  3. Reduce + Bcast algorithm
  4. Topology aware Reduce + Bcast algorithm
  5. The binomial Collection + dispersion algorithm
  6. Topology aware Collection + binomial distribution algorithm
  7. Shumilin ring algorithm
  8. Ring algorithm
  9. Knomial algorithm

I_MPI_ADJUST_ALLTOALL

MPI_Alltoall

  1. Bruck algorithm
  2. Isend/Irecv + waitall algorithm
  3. Pairwise exchange
  4. Plum algorithm

I_MPI_ADJUST_ALLTOALLV

MPI_Alltoallv

  1. Isend/Irecv + waitall algorithm
  2. Plum algorithm

I_MPI_ADJUST_ALLTOALLW

MPI_Alltoallw

Isend/Irecv + waitall algorithm

I_MPI_ADJUST_BARRIER

MPI_Barrier

  1. Propagation algorithm
  2. Recursive multiplication algorithm
  3. Topology aware propagation algorithm
  4. Topology aware recursive multiplication algorithm
  5. The binomial Collection + dispersion algorithm
  6. Topology aware Collection + binomial distribution algorithm

I_MPI_ADJUST_BCAST

MPI_Bcast

  1. binomial algorithms
  2. Recursive multiplication algorithm
  3. Ring algorithm
  4. Topology aware binomial algorithm
  5. Topology aware recursive multiplication algorithm
  6. Topology aware ring algorithm
  7. Shumilin algorithm
  8. Knomial algorithm

I_MPI_ADJUST_EXSCAN

MPI_Exscan

  1. Partial result collection algorithm
  2. Collect some results about the layout of the flow algorithm

I_MPI_ADJUST_GATHER

MPI_Gather

  1. binomial algorithms
  2. Topology aware binomial algorithm
  3. Shumilin algorithm

I_MPI_ADJUST_GATHERV

MPI_Gatherv

  1. Linear algorithm
  2. Topology aware linear algorithm
  3. Knomial algorithm

I_MPI_ADJUST_REDUCE_SCATTER

MPI_Reduce_scatter

  1. Recursive half algorithm
  2. Pairwise exchange
  3. Recursive multiplication algorithm
  4. Reduce + Scatterv algorithm
  5. Topology aware Reduce + Scatterv algorithm

I_MPI_ADJUST_REDUCE

MPI_Reduce

  1. Shumilin algorithm
  2. binomial algorithms
  3. Topology aware Shumilin algorithm
  4. Topology aware binomial algorithm
  5. Rabenseifner algorithm
  6. Topology aware Rabenseifner algorithm
  7. Knomial algorithm

I_MPI_ADJUST_SCAN

MPI_Scan

  1. Partial result collection algorithm
  2. Topology aware partial result collection algorithm

I_MPI_ADJUST_SCATTER

MPI_Scatter

  1. binomial algorithms
  2. Topology aware binomial algorithm
  3. Shumilin algorithm

I_MPI_ADJUST_SCATTERV

MPI_Scatterv

  1. Linear algorithm
  2. Topology aware linear algorithm

aboutTable 1And if the application using MPI scatter set, is MPI I adjust I_MPI_ADJUST_SCATTER environment variables can be set to 1, 2 or 3 of the integer value, which were selected binomial algorithm, topology aware binomial algorithm, or Shumilin algorithm is presented. Shumilin algorithm is suitable for small scale clusters, which can efficiently utilize bandwidth. Reader can find a description of the implementation of various algorithms in the literature (see the Thakur and other references).Twelve

As mentioned in the introduction, for experience collection performance of the Intel microarchitecture is financial core Intel Xeon coprocessor, which uses Intel Integrated many core architecture (Intel mic Architecture). The next part will be briefly described financial core Intel Xeon co processor architecture, and explore a set of selected Intel MPI library environment variables, in order to understand:

  • In the Intel Xeon MPI process fixed financial concept of executing multiple kernel kernel coprocessor slot on
  • Control of financial core Intel Xeon processor socket communication

ThreeThe Intel Xeon processor architecture of the core melt it

Financial core Intel Xeon co processor consists of 61 processing cores, each core 4 hardware threads. Intel Xeon thawing core co processor and cache, memory controller, PCIe * (peripheral component interconnect express *) client logic and bandwidth high bidirectional ring interconnect network. (Figure 2).FourteenEach kernel has a dedicated L2 cache, through the global distribution of the tag directoryFigure 2Winning the bid has "TD") to maintain a complete agreement. The two level cache is set to eight, and the size is 512 KB. Cache is uniform, can be used for data and instruction cache. The first level cache contains a eight link 32 KB first level instruction and 32 KB first level data cache. The memory controller and the PCIe client logic provides a direct interface to the GDDR5 memory (double data rate five synchronous graphics random access memory) on the PCIe bus. All of these components are connected by a ring interconnection network.

View full text

top
Zero
step on
Zero