Parallel and High Performance Computing offers techniques guaranteed to boost your code’s effectiveness.
Complex calculations, like training deep learning models or running large-scale simulations, can take an extremely long time. Efficient parallel programming can save hours—or even days—of computing time. Parallel and High Performance Computing shows you how to deliver faster run-times, greater scalability, and increased energy efficiency to your programs by mastering parallel techniques for multicore processor and GPU hardware.
About the technology
Write fast, powerful, energy efficient programs that scale to tackle huge volumes of data. Using parallel programming, your code spreads data processing tasks across multiple CPUs for radically better performance. With a little help, you can create software that maximizes both speed and efficiency.
About the book
Parallel and High Performance Computing offers techniques guaranteed to boost your code’s effectiveness. You’ll learn to evaluate hardware architectures and work with industry standard tools such as OpenMP and MPI. You’ll master the data structures and algorithms best suited for high performance computing and learn techniques that save energy on handheld devices. You’ll even run a massive tsunami simulation across a bank of GPUs.
- Planning a new parallel project
- Understanding differences in CPU and GPU architecture
- Addressing underperforming kernels and loops
- Managing applications with batch scheduling
About the reader
For experienced programmers proficient with a high-performance computing language like C, C++, or Fortran.
About the author
Robert Robey works at Los Alamos National Laboratory and has been active in the field of parallel computing for over 30 years. Yuliana Zamora is currently a PhD student and Siebel Scholar at the University of Chicago, and has lectured on programming modern hardware at numerous national conferences.
Parallel and High Performance Computing Copyright Dedication contents front matter foreword Yulie Zamora, University of Chicago, Illinois How we came to write this book acknowledgments about this book Who should read this book Part 1 Introduction to parallel computing 1 Why parallel computing? 1.1 Why should you learn about parallel computing? 1.1.1 What are the potential benefits of parallel computing? 1.1.2 Parallel computing cautions 1.2 The fundamental laws of parallel computing 1.2.1 The limit to parallel computing: Amdahl’s Law 1.2.2 Breaking through the parallel limit: Gustafson-Barsis’s Law 1.3 How does parallel computing work? 1.3.1 Walking through a sample application 1.3.2 A hardware model for today’s heterogeneous parallel systems 1.3.3 The application/software model for today’s heterogeneous parallel systems 1.4 Categorizing parallel approaches 1.5 Parallel strategies 1.6 Parallel speedup versus comparative speedups: Two different measures 1.7 What will you learn in this book? 1.7.1 Additional reading 1.7.2 Exercises Summary 2 Planning for parallelization 2.1 Approaching a new project: The preparation 2.1.1 Version control: Creating a safety vault for your parallel code 2.1.2 Test suites: The first step to creating a robust, reliable application 2.1.3 Finding and fixing memory issues 2.1.4 Improving code portability 2.2 Profiling: Probing the gap between system capabilities and application performance 2.3 Planning: A foundation for success 2.3.1 Exploring with benchmarks and mini-apps 2.3.2 Design of the core data structures and code modularity 2.3.3 Algorithms: Redesign for parallel 2.4 Implementation: Where it all happens 2.5 Commit: Wrapping it up with quality 2.6 Further explorations 2.6.1 Additional reading 2.6.2 Exercises Summary 3 Performance limits and profiling 3.1 Know your application’s potential performance limits 3.2 Determine your hardware capabilities: Benchmarking 3.2.1 Tools for gathering system characteristics 3.2.2 Calculating theoretical maximum flops 3.2.3 The memory hierarchy and theoretical memory bandwidth 3.2.4 Empirical measurement of bandwidth and flops 3.2.5 Calculating the machine balance between flops and bandwidth 3.3 Characterizing your application: Profiling 3.3.1 Profiling tools 3.3.2 Empirical measurement of processor clock frequency and energy consumption 3.3.3 Tracking memory during run time 3.4 Further explorations 3.4.1 Additional reading 3.4.2 Exercises Summary 4 Data design and performance models 4.1 Performance data structures: Data-oriented design 4.1.1 Multidimensional arrays 4.1.2 Array of Structures (AoS) versus Structures of Arrays (SoA) 4.1.3 Array of Structures of Arrays (AoSoA) 4.2 Three Cs of cache misses: Compulsory, capacity, conflict 4.3 Simple performance models: A case study 4.3.1 Full matrix data representations 4.3.2 Compressed sparse storage representations 4.4 Advanced performance models 4.5 Network messages 4.6 Further explorations 4.6.1 Additional reading 4.6.2 Exercises Summary 5 Parallel algorithms and patterns 5.1 Algorithm analysis for parallel computing applications 5.2 Performance models versus algorithmic complexity 5.3 Parallel algorithms: What are they? 5.4 What is a hash function? 5.5 Spatial hashing: A highly-parallel algorithm 5.5.1 Using perfect hashing for spatial mesh operations 5.5.2 Using compact hashing for spatial mesh operations 5.6 Prefix sum (scan) pattern and its importance in parallel computing 5.6.1 Step-efficient parallel scan operation 5.6.2 Work-efficient parallel scan operation 5.6.3 Parallel scan operations for large arrays 5.7 Parallel global sum: Addressing the problem of associativity 5.8 Future of parallel algorithm research 5.9 Further explorations 5.9.1 Additional reading 5.9.2 Exercises Summary Part 2 CPU: The parallel workhorse 6 Vectorization: FLOPs for free 6.1 Vectorization and single instruction, multiple data (SIMD) overview 6.2 Hardware trends for vectorization 6.3 Vectorization methods 6.3.1 Optimized libraries provide performance for little effort 6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1) 6.3.3 Teaching the compiler through hints: Pragmas and directives 6.3.4 Crappy loops, we got them: Use vector intrinsics 6.3.5 Not for the faint of heart: Using assembler code for vectorization 6.4 Programming style for better vectorization 6.5 Compiler flags relevant for vectorization for various compilers 6.6 OpenMP SIMD directives for better portability 6.7 Further explorations 6.7.1 Additional reading 6.7.2 Exercises Summary 7 OpenMP that performs 7.1 OpenMP introduction 7.1.1 OpenMP concepts 7.1.2 A simple OpenMP program 7.2 Typical OpenMP use cases: Loop-level, high-level, and MPI plus OpenMP 7.2.1 Loop-level OpenMP for quick parallelization 7.2.2 High-level OpenMP for better parallel performance 7.2.3 MPI plus OpenMP for extreme scalability 7.3 Examples of standard loop-level OpenMP 7.3.1 Loop level OpenMP: Vector addition example 7.3.2 Stream triad example 7.3.3 Loop level OpenMP: Stencil example 7.3.4 Performance of loop-level examples 7.3.5 Reduction example of a global sum using OpenMP threading 7.3.6 Potential loop-level OpenMP issues 7.4 Variable scope importance for correctness in OpenMP 7.5 Function-level OpenMP: Making a whole function thread parallel 7.6 Improving parallel scalability with high-level OpenMP 7.6.1 How to implement high-level OpenMP 7.6.2 Example of implementing high-level OpenMP 7.7 Hybrid threading and vectorization with OpenMP 7.8 Advanced examples using OpenMP 7.8.1 Stencil example with a separate pass for the x and y directions 7.8.2 Kahan summation implementation with OpenMP threading 7.8.3 Threaded implementation of the prefix scan algorithm 7.9 Threading tools essential for robust implementations 7.9.1 Using Allinea/ARM MAP to get a quick high-level profile of your application 7.9.2 Finding your thread race conditions with Intel® Inspector 7.10 Example of a task-based support algorithm 7.11 Further explorations 7.11.1 Additional reading 7.11.2 Exercises Summary 8 MPI: The parallel backbone 8.1 The basics for an MPI program 8.1.1 Basic MPI function calls for every MPI program 8.1.2 Compiler wrappers for simpler MPI programs 8.1.3 Using parallel startup commands 8.1.4 Minimum working example of an MPI program 8.2 The send and receive commands for process-to-process communication 8.3 Collective communication: A powerful component of MPI 8.3.1 Using a barrier to synchronize timers 8.3.2 Using the broadcast to handle small file input 8.3.3 Using a reduction to get a single value from across all processes 8.3.4 Using gather to put order in debug printouts 8.3.5 Using scatter and gather to send data out to processes for work 8.4 Data parallel examples 8.4.1 Stream triad to measure bandwidth on the node 8.4.2 Ghost cell exchanges in a two-dimensional (2D) mesh 8.4.3 Ghost cell exchanges in a three-dimensional (3D) stencil calculation 8.5 Advanced MPI functionality to simplify code and enable optimizations 8.5.1 Using custom MPI data types for performance and code simplification 8.5.2 Cartesian topology support in MPI 8.5.3 Performance tests of ghost cell exchange variants 8.6 Hybrid MPI plus OpenMP for extreme scalability 8.6.1 The benefits of hybrid MPI plus OpenMP 8.6.2 MPI plus OpenMP example 8.7 Further explorations 8.7.1 Additional reading 8.7.2 Exercises Summary Part 3 GPUs: Built to accelerate 9 GPU architectures and concepts 9.1 The CPU-GPU system as an accelerated computational platform 9.1.1 Integrated GPUs: An underused option on commodity-based systems 9.1.2 Dedicated GPUs: The workhorse option 9.2 The GPU and the thread engine 9.2.1 The compute unit is the streaming multiprocessor (or subslice) 9.2.2 Processing elements are the individual processors 9.2.3 Multiple data operations by each processing element 9.2.4 Calculating the peak theoretical flops for some leading GPUs 9.3 Characteristics of GPU memory spaces 9.3.1 Calculating theoretical peak memory bandwidth 9.3.2 Measuring the GPU stream benchmark 9.3.3 Roofline performance model for GPUs 9.3.4 Using the mixbench performance tool to choose the best GPU for a workload 9.4 The PCI bus: CPU to GPU data transfer overhead 9.4.1 Theoretical bandwidth of the PCI bus 9.4.2 A benchmark application for PCI bandwidth 9.5 Multi-GPU platforms and MPI 9.5.1 Optimizing the data movement between GPUs across the network 9.5.2 A higher performance alternative to the PCI bus 9.6 Potential benefits of GPU-accelerated platforms 9.6.1 Reducing time-to-solution 9.6.2 Reducing energy use with GPUs 9.6.3 Reduction in cloud computing costs with GPUs 9.7 When to use GPUs 9.8 Further explorations 9.8.1 Additional reading 9.8.2 Exercises Summary 10 GPU programming model 10.1 GPU programming abstractions: A common framework 10.1.1 Massive parallelism 10.1.2 Inability to coordinate among tasks 10.1.3 Terminology for GPU parallelism 10.1.4 Data decomposition into independent units of work: An NDRange or grid 10.1.5 Work groups provide a right-sized chunk of work 10.1.6 Subgroups, warps, or wavefronts execute in lockstep 10.1.7 Work item: The basic unit of operation 10.1.8 SIMD or vector hardware 10.2 The code structure for the GPU programming model 10.2.1 “Me” programming: The concept of a parallel kernel 10.2.2 Thread indices: Mapping the local tile to the global world 10.2.3 Index sets 10.2.4 How to address memory resources in your GPU programming model 10.3 Optimizing GPU resource usage 10.3.1 How many registers does my kernel use? 10.3.2 Occupancy: Making more work available for work group scheduling 10.4 Reduction pattern requires synchronization across work groups 10.5 Asynchronous computing through queues (streams) 10.6 Developing a plan to parallelize an application for GPUs 10.6.1 Case 1: 3D atmospheric simulation 10.6.2 Case 2: Unstructured mesh application 10.7 Further explorations 10.7.1 Additional reading 10.7.2 Exercises Summary 11 Directive-based GPU programming 11.1 Process to apply directives and pragmas for a GPU implementation 11.2 OpenACC: The easiest way to run on your GPU 11.2.1 Compiling OpenACC code 11.2.2 Parallel compute regions in OpenACC for accelerating computations 11.2.3 Using directives to reduce data movement between the CPU and the GPU 11.2.4 Optimizing the GPU kernels 11.2.5 Summary of performance results for the stream triad 11.2.6 Advanced OpenACC techniques 11.3 OpenMP: The heavyweight champ enters the world of accelerators 11.3.1 Compiling OpenMP code 11.3.2 Generating parallel work on the GPU with OpenMP 11.3.3 Creating data regions to control data movement to the GPU with OpenMP 11.3.4 Optimizing OpenMP for GPUs 11.3.5 Advanced OpenMP for GPUs 11.4 Further explorations 11.4.1 Additional reading 11.4.2 Exercises Summary 12 GPU languages: Getting down to basics Figure 12.1 The interoperability map for the GPU languages shows an increasingly complex situation. Four GPU languages are shown at the top with the various hardware devices at the bottom. The arrows show the code generation pathways from the languages to the hardware. The dashed lines are for hardware that is still in development. 12.1 Features of a native GPU programming language 12.2 CUDA and HIP GPU languages: The low-level performance option 12.2.1 Writing and building your first CUDA application 12.2.2 A reduction kernel in CUDA: Life gets complicated Figure 12.2 Pair-wise reduction tree for a warp that sums up values in log n steps. 12.2.3 Hipifying the CUDA code 12.3 OpenCL for a portable open source GPU language 12.3.1 Writing and building your first OpenCL application 12.3.2 Reductions in OpenCL Figure 12.3 Comparison of OpenCL and CUDA reduction kernels: sum_within_block Figure 12.4 Comparison for the first of two kernel passes for the OpenCL and CUDA reduction kernels Figure 12.5 Comparison of the second pass for the reduction sum 12.4 SYCL: An experimental C++ implementation goes mainstream 12.5 Higher-level languages for performance portability 12.5.1 Kokkos: A performance portability ecosystem 12.5.2 RAJA for a more adaptable performance portability layer 12.6 Further explorations 12.6.1 Additional reading 12.6.2 Exercises Summary 13 GPU profiling and tools 13.1 An overview of profiling tools 13.2 How to select a good workflow 13.3 Example problem: Shallow water simulation 13.4 A sample of a profiling workflow 13.4.1 Run the shallow water application 13.4.2 Profile the CPU code to develop a plan of action 13.4.3 Add OpenACC compute directives to begin the implementation step 13.4.4 Add data movement directives 13.4.5 Guided analysis can give you some suggested improvements 13.4.6 The NVIDIA Nsight suite of tools can be a powerful development aid 13.4.7 CodeXL for the AMD GPU ecosystem 13.5 Don’t get lost in the swamp: Focus on the important metrics 13.5.1 Occupancy: Is there enough work? 13.5.2 Issue efficiency: Are your warps on break too often? 13.5.3 Achieved bandwidth: It always comes down to bandwidth 13.6 Containers and virtual machines provide alternate workflows 13.6.1 Docker containers as a workaround 13.6.2 Virtual machines using VirtualBox 13.7 Cloud options: A flexible and portable capability 13.8 Further explorations 13.8.1 Additional reading 13.8.2 Exercises Summary Part 4 High performance computing ecosystems 14 Affinity: Truce with the kernel 14.1 Why is affinity important? 14.2 Discovering your architecture 14.3 Thread affinity with OpenMP 14.4 Process affinity with MPI 14.4.1 Default process placement with OpenMPI 14.4.2 Taking control: Basic techniques for specifying process placement in OpenMPI 14.4.3 Affinity is more than just process binding: The full picture 14.5 Affinity for MPI plus OpenMP 14.6 Controlling affinity from the command line 14.6.1 Using hwloc-bind to assign affinity 14.6.2 Using likwid-pin: An affinity tool in the likwid tool suite 14.7 The future: Setting and changing affinity at run time 14.7.1 Setting affinities in your executable 14.7.2 Changing your process affinities during run time 14.8 Further explorations 14.8.1 Additional reading 14.8.2 Exercises Summary 15 Batch schedulers:Bringing order to chaos 15.1 The chaos of an unmanaged system 15.2 How not to be a nuisance when working on a busy cluster 15.2.1 Layout of a batch system for busy clusters 15.2.2 How to be courteous on busy clusters and HPC sites: Common HPC pet peeves 15.3 Submitting your first batch script 15.4 Automatic restarts for long-running jobs 15.5 Specifying dependencies in batch scripts 15.6 Further explorations 15.6.1 Additional reading 15.6.2 Exercises Summary 16 File operations for a parallel world 16.1 The components of a high-performance filesystem 16.2 Standard file operations: A parallel-to-serial interface 16.3 MPI file operations (MPI-IO) for a more parallel world 16.4 HDF5 is self-describing for better data management 16.5 Other parallel file software packages 16.6 Parallel filesystem: The hardware interface 16.6.1 Everything you wanted to know about your parallel file setup but didn’t know how to ask 16.6.2 General hints that apply to all filesystems 16.6.3 Hints specific to particular filesystems 16.7 Further explorations 16.7.1 Additional reading 16.7.2 Exercises Summary 17 Tools and resources for better code 17.1 Version control systems: It all begins here 17.1.1 Distributed version control fits the more mobile world 17.1.2 Centralized version control for simplicity and code security 17.2 Timer routines for tracking code performance 17.3 Profilers: You can’t improve what you don’t measure 17.3.1 Simple text-based profilers for everyday use 17.3.2 High-level profilers for quickly identifying bottlenecks 17.3.3 Medium-level profilers to guide your application development 17.3.4 Detailed profilers give the gory details of hardware performance 17.4 Benchmarks and mini-apps: A window into system performance 17.4.1 Benchmarks measure system performance characteristics 17.4.2 Mini-apps give the application perspective 17.5 Detecting (and fixing) memory errors for a robust application 17.5.1 Valgrind Memcheck: The open source standby 17.5.2 Dr. Memory for your memory ailments 17.5.3 Commercial memory tools for demanding applications 17.5.4 Compiler-based memory tools for convenience 17.5.5 Fence-post checkers detect out-of-bounds memory accesses 17.5.6 GPU memory tools for robust GPU applications 17.6 Thread checkers for detecting race conditions 17.6.1 Intel® Inspector: A race condition detection tool with a GUI 17.6.2 Archer: A text-based tool for detecting race conditions 17.7 Bug-busters: Debuggers to exterminate those bugs 17.7.1 TotalView debugger is widely available at HPC sites 17.7.2 DDT is another debugger widely available at HPC sites 17.7.3 Linux debuggers: Free alternatives for your local development needs 17.7.4 GPU debuggers can help crush those GPU bugs 17.8 Profiling those file operations 17.9 Package managers: Your personal system administrator 17.9.1 Package managers for macOS 17.9.2 Package managers for Windows 17.9.3 The Spack package manager: A package manager for high performance computing 17.10 Modules: Loading specialized toolchains 17.10.1 TCL modules: The original modules system for loading software toolchains 17.10.2 Lmod: A Lua-based alternative Modules implementation 17.11 Reflections and exercises Summary appendix A References appendix B Solutions to exercises appendix C Glossary index
How to download source code?
1. Go to:
2. Search the book title:
Parallel and High Performance Computing, sometime you may not get the results, please search the main title
3. Click the book title in the search results
resources section, click
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.