MPI C Programming for NVidia Jetson TX

Duration: 5 Days

Intended Audience

This course is for experienced C/C++ programmers who also have some familiarity with CUDA who need not only to get up to speed with MPI programming, but also to explore its practical use in networks containing multiple, networked, NVidia Jetson TX2 devices.

Course Overview

Parallel programming by definition involves co-operation between processes to solve a common task. It is up to the programmer to define the tasks that will be executed by the processors, and how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together. The Message Passing Interface (MPI) is the de-facto standard for message passing. This course covers the key aspects of MPI programming such as point-to-point communication, non-blocking operations, derived datatypes, virtual topologies, collective communication. It also covers general parallel programming code design issues. The course is taught using a class network of NVidia Jetson TX2 processors and PC computers running Linux. It also covers applications that combine MPI and CUDA.

Course Contents

Parallel Programming - Concepts and Idioms

Distributed memory and shared memory computing models
Message-Passing Concepts

Message passing paradigms

Features of message passing programs
Point-to-Point Communications and Messages
Communication Modes and Completion Criteria
Blocking and Nonblocking Communication
Collective Communications
Broadcast Operations
Scatter and Gather Operations
Reduction Operations

MPI Program Straucture

MPI Routines and Return Values
MPI Handles
MPI Datatypes

Communicators
Tags
Modes

Point to Point Communication

Sending and Receiving
Blocking and Completion
Deadlock and Deadlock Avoidance

Nonblocking Sends and Receives

Posting, Completion, and Request Handles
Posting Sends and Receives without Blocking
Completion - Waiting and Testing

Send Modes

Standard Mode Send
Synchronous Mode Send
Ready Mode Send
Buffered Mode Send

Derived Data Types

Buffer filling and MPI_Pack
MPI_Struct and Mapping of C Structs to MPI Derived Types
MPI_Type_contiguous
MPI_Type_vector
MPI_Type_hvector
MPI_Type_indexed
MPI_Type_hindexed
Controlling the Extent of a Derived Type

Collective Communication

MPI_Barrier - Barrier Synchronisation
MPI_Bcast- Broadcast
MPI_Reduce - Reduction
MPI_Gather - Gathering
MPI_Allgather
MPI_Scatter - Scattering
MPI_Allreduce
MPI_Gatherv
MPI_Scatterv
MPI_Scan
MPI_Reduce_scatter

Communicators

MPI_Comm_world
MPI_Comm_group
MPI_Group_incl
MPI_Group_excl
MPI_Group_rank
MPI_Group_free
MPI_Comm_create
MPI_Comm_split

Virtual Topologies API

MPI_Cart_create
MPI_Cart_coords
MPI_Cart_rank
MPI_Cart_shift
MPI_Cart_sub
MPI_Cartdim_get
MPI_Cart_get
MPI_Cart_shift

Virtual Topologies API - Applications

Matrix Transposition
Iterative Solvers

Parallel IO

Characteristics of Serial I/O
Characteristics of Parallel I/O
Introduction to MPI-2 Parallel I/O
MPI-2 File Structure
Initializing MPI-2 File I/O
View
Data Access - Reading Data
Data Access - Writing Data
Closing MPI-2 File I/O

Parallel Numerical Libraries - An Overview

PBLAS - Parallel Basic Linear Algebra Subproblems
ScaLAPACK - Scalable Linear Algebra PACkage

MPI application design and design for performance

Domain decomposition
Functional decomposition
Load balancing
Minimising Communication
Designing for Performance

Timer synchronisation

Overview of CUDA parallel programming

Experimenting with mixed CUDA - MPI programming - introduction

OpenMP an Overview

OpenMP, MPI and CUDA compared

First Technology Transfer