First Technology Transfer

Standard and Advanced Technical Training, Consultancy and Mentoring

NVidia Jetson TX CUDA programming in C/C++

Duration: 5 Days

Intended Audience

The course is for embedded systems developers who need to learn how to develop applications for the GPU core on NVidia JetsonTX processors using NVidia's CUDA API. No prior parallel programming knowledge is required however participants must be familiar with programming in C/C++.

Attendees should be familiar with the following C/C++ concepts:

  • Pointers and pointer to pointers (*, **)
  • Taking the address of a variable (&)
  • Functions, for loops, if/else statements
  • Printing to standard output (printf, cout)
  • Memory allocation and deallocation
  • Arrays and indexing
  • Structures
  • General debugging

Course Overview

GPUs have evolved from a fixed pipeline graphics processing hardware into powerful programmable co-processing units that are also capable of performing general purpose computing. The training will introduce the CUDA language and optimisation techniques and provide practice on how to access GPU hardware on NVidia Jetson TX processors as well as computers having NVidia CUDA capable graphics cards. Pre-requisites

Course Contents

  • Overview of GPU Computing
    • Brief history of GPGPU
    • CUDA overview
    • High-level introduction to CUDA syntax
    • GPGPU cores in heterogenous multicore processor system
  • Data-Parallel Architectures and the GPU Programming Model
    • Data-parallelism
    • GPU programming model
      • GPU kernels
      • Host vs. device responsibilities
      • Memory management
      • CUDA syntax
      • Thread hierarchy
  • The GPU Memory Model and Thread Cooperation
    • Task parallelism
    • Thread cooperation in GPU computing
    • GPU memory model
      • Shared memory
      • Constant memory
      • Global memory
  • Asynchronous Operations and Dynamic Parallelism
    • Asynchronous vs. synchronous memory transfers
    • Streams and events
    • Page locked memory
    • Streams and Unified Memory
    • Dynamic Parallelism
  • More advanced CUDA Features
    • Unified Memory
    • NVCC
    • Atomic functions
    • Dynamic memory allocation within kernels
    • Multi-GPU Programming
    • Peer-to-peer memory access
  • Libraries
    • CUFFT
    • CUBLAS
    • Thrust
    • CURAND
    • NVIDIA performance primitives
    • OpenACC
  • Debugging tools and techniques
    • cuda-gdb
    • NVIDIA Nsight
    • cuda-memcheck
  • High-Level Optimization Strategies
    • Timers
    • NVIDIA Visual Profiler
      • Guided Performance Analysis
  • Resource Management, Latency, and Occupancy
    • GPU SM Execution
    • GPU Latencies and How They Impact Performance
    • Occupancy and Occupancy Related Optimizations
  • Arithmetic Optimizations
    • Instruction cost
    • Intrinsic functions
    • Branching efficiency
    • Instruction-Level Parallelism
  • Memory Performance Optimizations
    • Review logical memory spaces
    • Physical implementation of memory and optimal access patterns
      • Global Memory Access Patterns
      • Shared Memory Bank conflicts
      • Constant Memory and Read-Only Cache
    • Memory usage strategies