An Introduction to Parallel Programming, 2nd Edition

Length: 496 pages
Edition: 2
Language: English
Publisher: Morgan Kaufmann
Publication Date: 2020-03-15
ISBN-10: 0128046058
ISBN-13: 9780128046050
Sales Rank: #787001 (See Top 100 Books)

An Introduction to Parallel Programming, Second Edition presents a tried-and-true tutorial approach that shows students how to develop effective parallel programs with MPI, Pthreads and OpenMP.

As the first undergraduate text to directly address compiling and running parallel programs on multi-core and cluster architecture, this second edition carries forward its clear explanations for designing, debugging and evaluating the performance of distributed and shared-memory programs while adding coverage of accelerators via new content on GPU programming and heterogeneous programming. New and improved user-friendly exercises teach students how to compile, run and modify example programs.

Front-Matter_2022_An-Introduction-to-Parallel-Programming
Copyright_2022_An-Introduction-to-Parallel-Programming
Dedication_2022_An-Introduction-to-Parallel-Programming
Contents_2022_An-Introduction-to-Parallel-Programming
	Contents
Preface_2022_An-Introduction-to-Parallel-Programming
	Preface
Chapter-1---Why-parallel-computin_2022_An-Introduction-to-Parallel-Programmi
	1 Why parallel computing
		1.1 Why we need ever-increasing performance
		1.2 Why we're building parallel systems
		1.3 Why we need to write parallel programs
		1.4 How do we write parallel programs?
		1.5 What we'll be doing
		1.6 Concurrent, parallel, distributed
		1.7 The rest of the book
		1.8 A word of warning
		1.9 Typographical conventions
		1.10 Summary
		1.11 Exercises
Chapter-2---Parallel-hardware-and-parall_2022_An-Introduction-to-Parallel-Pr
	2 Parallel hardware and parallel software
		2.1 Some background
			2.1.1 The von Neumann architecture
			2.1.2 Processes, multitasking, and threads
		2.2 Modifications to the von Neumann model
			2.2.1 The basics of caching
			2.2.2 Cache mappings
			2.2.3 Caches and programs: an example
			2.2.4 Virtual memory
			2.2.5 Instruction-level parallelism
				Pipelining
				Multiple issue
			2.2.6 Hardware multithreading
		2.3 Parallel hardware
			2.3.1 Classifications of parallel computers
			2.3.2 SIMD systems
				Vector processors
				Graphics processing units
			2.3.3 MIMD systems
				Shared-memory systems
				Distributed-memory systems
			2.3.4 Interconnection networks
				Shared-memory interconnects
				Distributed-memory interconnects
				Latency and bandwidth
			2.3.5 Cache coherence
				Snooping cache coherence
				Directory-based cache coherence
				False sharing
			2.3.6 Shared-memory vs. distributed-memory
		2.4 Parallel software
			2.4.1 Caveats
			2.4.2 Coordinating the processes/threads
			2.4.3 Shared-memory
				Dynamic and static threads
				Nondeterminism
				Thread safety
			2.4.4 Distributed-memory
				Message-passing
				One-sided communication
				Partitioned global address space languages
			2.4.5 GPU programming
			2.4.6 Programming hybrid systems
		2.5 Input and output
			2.5.1 MIMD systems
			2.5.2 GPUs
		2.6 Performance
			2.6.1 Speedup and efficiency in MIMD systems
			2.6.2 Amdahl's law
			2.6.3 Scalability in MIMD systems
			2.6.4 Taking timings of MIMD programs
			2.6.5 GPU performance
		2.7 Parallel program design
			2.7.1 An example
				A serial program
				Parallelizing the serial program
		2.8 Writing and running parallel programs
		2.9 Assumptions
		2.10 Summary
			2.10.1 Serial systems
			2.10.2 Parallel hardware
			2.10.3 Parallel software
			2.10.4 Input and output
			2.10.5 Performance
			2.10.6 Parallel program design
			2.10.7 Assumptions
		2.11 Exercises
Chapter-3---Distributed-memory-programmi_2022_An-Introduction-to-Parallel-Pr
	3 Distributed memory programming with MPI
		3.1 Getting started
			3.1.1 Compilation and execution
			3.1.2 MPI programs
			3.1.3 MPI_Init and MPI_Finalize
			3.1.4 Communicators, MPI_Comm_size, and MPI_Comm_rank
			3.1.5 SPMD programs
			3.1.6 Communication
			3.1.7 MPI_Send
			3.1.8 MPI_Recv
			3.1.9 Message matching
			3.1.10 The status_p argument
			3.1.11 Semantics of MPI_Send and MPI_Recv
			3.1.12 Some potential pitfalls
		3.2 The trapezoidal rule in MPI
			3.2.1 The trapezoidal rule
			3.2.2 Parallelizing the trapezoidal rule
		3.3 Dealing with I/O
			3.3.1 Output
			3.3.2 Input
		3.4 Collective communication
			3.4.1 Tree-structured communication
			3.4.2 MPI_Reduce
			3.4.3 Collective vs. point-to-point communications
			3.4.4 MPI_Allreduce
			3.4.5 Broadcast
			3.4.6 Data distributions
			3.4.7 Scatter
			3.4.8 Gather
			3.4.9 Allgather
		3.5 MPI-derived datatypes
		3.6 Performance evaluation of MPI programs
			3.6.1 Taking timings
			3.6.2 Results
			3.6.3 Speedup and efficiency
			3.6.4 Scalability
		3.7 A parallel sorting algorithm
			3.7.1 Some simple serial sorting algorithms
			3.7.2 Parallel odd-even transposition sort
			3.7.3 Safety in MPI programs
			3.7.4 Final details of parallel odd-even sort
		3.8 Summary
		3.9 Exercises
		3.10 Programming assignments
Chapter-4---Shared-memory-programming-wi_2022_An-Introduction-to-Parallel-Pr
	4 Shared-memory programming with Pthreads
		4.1 Processes, threads, and Pthreads
		4.2 Hello, world
			4.2.1 Execution
			4.2.2 Preliminaries
			4.2.3 Starting the threads
			4.2.4 Running the threads
			4.2.5 Stopping the threads
			4.2.6 Error checking
			4.2.7 Other approaches to thread startup
		4.3 Matrix-vector multiplication
		4.4 Critical sections
		4.5 Busy-waiting
		4.6 Mutexes
		4.7 Producer–consumer synchronization and semaphores
		4.8 Barriers and condition variables
			4.8.1 Busy-waiting and a mutex
			4.8.2 Semaphores
			4.8.3 Condition variables
			4.8.4 Pthreads barriers
		4.9 Read-write locks
			4.9.1 Sorted linked list functions
			4.9.2 A multithreaded linked list
			4.9.3 Pthreads read-write locks
			4.9.4 Performance of the various implementations
			4.9.5 Implementing read-write locks
		4.10 Caches, cache-coherence, and false sharing
		4.11 Thread-safety
			4.11.1 Incorrect programs can produce correct output
		4.12 Summary
		4.13 Exercises
		4.14 Programming assignments
Chapter-5---Shared-memory-programming-w_2022_An-Introduction-to-Parallel-Pro
	5 Shared-memory programming with OpenMP
		5.1 Getting started
			5.1.1 Compiling and running OpenMP programs
			5.1.2 The program
			5.1.3 Error checking
		5.2 The trapezoidal rule
			5.2.1 A first OpenMP version
		5.3 Scope of variables
		5.4 The reduction clause
		5.5 The parallel for directive
			5.5.1 Caveats
			5.5.2 Data dependences
			5.5.3 Finding loop-carried dependences
			5.5.4 Estimating π
			5.5.5 More on scope
		5.6 More about loops in OpenMP: sorting
			5.6.1 Bubble sort
			5.6.2 Odd-even transposition sort
		5.7 Scheduling loops
			5.7.1 The schedule clause
			5.7.2 The static schedule type
			5.7.3 The dynamic and guided schedule types
			5.7.4 The runtime schedule type
			5.7.5 Which schedule?
		5.8 Producers and consumers
			5.8.1 Queues
			5.8.2 Message-passing
			5.8.3 Sending messages
			5.8.4 Receiving messages
			5.8.5 Termination detection
			5.8.6 Startup
			5.8.7 The atomic directive
			5.8.8 Critical sections and locks
			5.8.9 Using locks in the message-passing program
			5.8.10 Critical directives, atomic directives, or locks?
			5.8.11 Some caveats
		5.9 Caches, cache coherence, and false sharing
		5.10 Tasking
		5.11 Thread-safety
			5.11.1 Incorrect programs can produce correct output
		5.12 Summary
		5.13 Exercises
		5.14 Programming assignments
Chapter-6---GPU-programming-with-CU_2022_An-Introduction-to-Parallel-Program
	6 GPU programming with CUDA
		6.1 GPUs and GPGPU
		6.2 GPU architectures
		6.3 Heterogeneous computing
		6.4 CUDA hello
			6.4.1 The source code
			6.4.2 Compiling and running the program
		6.5 A closer look
		6.6 Threads, blocks, and grids
		6.7 Nvidia compute capabilities and device architectures
		6.8 Vector addition
			6.8.1 The kernel
			6.8.2 Get_args
			6.8.3 Allocate_vectors and managed memory
			6.8.4 Other functions called from main
			6.8.5 Explicit memory transfers
		6.9 Returning results from CUDA kernels
		6.10 CUDA trapezoidal rule I
			6.10.1 The trapezoidal rule
			6.10.2 A CUDA implementation
			6.10.3 Initialization, return value, and final update
			6.10.4 Using the correct threads
			6.10.5 Updating the return value and atomicAdd
			6.10.6 Performance of the CUDA trapezoidal rule
		6.11 CUDA trapezoidal rule II: improving performance
			6.11.1 Tree-structured communication
			6.11.2 Local variables, registers, shared and global memory
			6.11.3 Warps and warp shuffles
			6.11.4 Implementing tree-structured global sum with a warp shuffle
			6.11.5 Shared memory and an alternative to the warp shuffle
		6.12 Implementation of trapezoidal rule with warpSize thread blocks
			6.12.1 Host code
			6.12.2 Kernel with warp shuffle
			6.12.3 Kernel with shared memory
			6.12.4 Performance
		6.13 CUDA trapezoidal rule III: blocks with more than one warp
			6.13.1 __syncthreads
			6.13.2 More shared memory
			6.13.3 Shared memory warp sums
			6.13.4 Shared memory banks
			6.13.5 Finishing up
			6.13.6 Performance
		6.14 Bitonic sort
			6.14.1 Serial bitonic sort
			6.14.2 Butterflies and binary representations
			6.14.3 Parallel bitonic sort I
			6.14.4 Parallel bitonic sort II
			6.14.5 Performance of CUDA bitonic sort
		6.15 Summary
		6.16 Exercises
		6.17 Programming assignments
Chapter-7---Parallel-program-develop_2022_An-Introduction-to-Parallel-Progra
	7 Parallel program development
		7.1 Two n-body solvers
			7.1.1 The problem
			7.1.2 Two serial programs
			7.1.3 Parallelizing the n-body solvers
			7.1.4 A word about I/O
			7.1.5 Parallelizing the basic solver using OpenMP
			7.1.6 Parallelizing the reduced solver using OpenMP
			7.1.7 Evaluating the OpenMP codes
			7.1.8 Parallelizing the solvers using Pthreads
			7.1.9 Parallelizing the basic solver using MPI
			7.1.10 Parallelizing the reduced solver using MPI
			7.1.11 Performance of the MPI solvers
			7.1.12 Parallelizing the basic solver using CUDA
			7.1.13 A note on cooperative groups in CUDA
			7.1.14 Performance of the basic CUDA n-body solver
			7.1.15 Improving the performance of the CUDA n-body solver
			7.1.16 Using shared memory in the n-body solver
		7.2 Sample sort
			7.2.1 Sample sort and bucket sort
			7.2.2 Choosing the sample
			7.2.3 A simple implementation of the Map function
			7.2.4 An alternative implementation of Map
			7.2.5 Parallelizing sample sort
			7.2.6 Implementing sample sort with OpenMP
				First implementation
				Second implementation
			7.2.7 Implementing sample sort with Pthreads
			7.2.8 Implementing sample sort with MPI
				First implementation
				Second implementation
			7.2.9 Implementing sample sort with CUDA
				First implementation
					Prefix sums
				Performance
				A second CUDA implementation of sample sort
					Performance of the second CUDA sample sort
		7.3 A word of caution
		7.4 Which API?
		7.5 Summary
			7.5.1 MPI
		7.6 Exercises
		7.7 Programming assignments
Chapter-8---Where-to-go-from-here_2022_An-Introduction-to-Parallel-Programmi
	8 Where to go from here
Bibliography_2022_An-Introduction-to-Parallel-Programming
	Bibliography
Index_2022_An-Introduction-to-Parallel-Programming
	Index