Massive Graph Analytics

Length: 544 pages
Edition: 1
Language: English
Publisher: Chapman and Hall/CRC
Publication Date: 2022-03-01
ISBN-10: 0367464128
ISBN-13: 9780367464127
Sales Rank: #7686644 (See Top 100 Books)
Expertise in massive scale graph analytics is key for solving real-world grand challenges from health to sustainability to detecting insider threats, cyber defense, and more. Massive Graph Analytics provides a comprehensive introduction to massive graph analytics, featuring contributions from thought leaders across academia, industry, and government.
The book will be beneficial to students, researchers and practitioners, in academia, national laboratories, and industry, who wish to learn about the state-of-the-art algorithms, models, frameworks, and software in massive scale graph analytics.
Cover
Half Title
Series Page
Title Page
Copyright Page
Table of Contents
Editor
Contributors
Introduction
SECTION I: Algorithms: Search and Paths
	Chapter 1 A Work-Efficient Parallel Breadth-First Search Algorithm (or How To Cope With the Nondeterminism of Reducers)
		1.1 Introduction
			1.1.1 Outline
		1.2 Background on Dynamic Multithreading
			1.2.1 Multithreaded Pseudocode
			1.2.2 Reducers
		1.3 The PBFS Algorithm
		1.4 The Bag Data Structure
			1.4.1 Pennants
			1.4.2 Bags
			1.4.3 Optimization
		1.5 Experimental Results
			1.5.1 Implementation and Testing
			1.5.2 Results
		1.6 Background on the Dag Model
			1.6.1 The Dag Model
			1.6.2 Determinacy
			1.6.3 Work and Span
			1.6.4 Scheduling
		1.7 Modeling Reducers
		1.8 Analysis of Programs with Nonconstant-Time Reducers
			1.8.1 Scheduling with Reducers
			1.8.2 Analyzing a Performance Dag
			1.8.3 Relating the Performance and User Dags
		1.9 Analyzing PBFS
		1.10 Conclusion
		Acknowledgments
		References
	Chapter 2 Multi-Objective Shortest Paths
		2.1 Introduction
		2.2 Preliminaries
		2.3 More Related Work
		2.4 The Basic Algorithm
			2.4.1 Example
			2.4.2 Algorithm Details
		2.5 A Bicriteria Instantiation – Theory
			2.5.1 Node Local Operations
			2.5.2 The Pareto Queue Q
			2.5.3 Additional Improvements
		2.6 A Practical Bicriteria Implementation
			2.6.1 A Pareto Queue Based on Weight-Balanced B-trees
			2.6.2 Further Practical Improvements
		2.7 Experimental Evaluation
			2.7.1 Pareto Queue Evaluation
			2.7.2 Parallel Bi-Objective Search
			2.7.3 Evaluation of Parallel Bi-Objective Search
		2.8 Single Target Search
			2.8.1 Number of Steps of Goal Directed Search
		2.9 More Objectives
		2.10 Conclusion
		Acknowledgments
		References
SECTION II: Algorithms: Structure
	Chapter 3 Multicore Algorithms for Graph Connectivity Problems
		3.1 Introduction
			3.1.1 Contributions
		3.2 Background
			3.2.1 Strongly Connected Components
				3.2.1.1 Serial Algorithms
				3.2.1.2 Forward-Backward
				3.2.1.3 Coloring
				3.2.1.4 Other Parallel SCC Approaches
			3.2.2 Connected and Weakly Connected Components
			3.2.3 Biconnected Components
		3.3 Multistep Method For SCC Decomposition
			3.3.1 Observations
			3.3.2 Description of Method
			3.3.3 Work Complexity
		3.4 Applying the Multistep Method
			3.4.1 Trim Step
			3.4.2 Breadth-First Search
			3.4.3 Coloring
			3.4.4 Serial Step
			3.4.5 Connected Components and Weakly Connected Components
			3.4.6 Biconnected Components
		3.5 Experimental Setup
		3.6 Experimental Results
			3.6.1 Strongly Connected Component Decomposition
			3.6.2 Connected Components Decomposition
			3.6.3 Biconnected Component Decomposition
		3.7 Conclusions
		References
	Chapter 4 Distributed Memory Parallel Algorithms for Massive Graphs
		4.1 Introduction
		4.2 Generating Random Graphs
			4.2.1 Erdös–Rényi Model
				4.2.1.1 Efficient Sequential Algorithm
				4.2.1.2 Distributed Memory Parallel Algorithm
			4.2.2 Preferential Attachment Model
			4.2.3 Switching Edges of a Graph and Generating Random Graphs with Given Degree Sequences
		4.3 Counting Triangles
			4.3.1 An Algorithm using MPI
			4.3.2 Algorithms using MapReduce
		4.4 Community Detection
		4.5 Giraph: A Distributed Graph Processing Framework
			4.5.1 Parallel Algorithm for the Single Source Shortest Path Problem using Giraph
		4.6 Breadth-First Search
		References
	Chapter 5 Efficient Multi-core Algorithms for Computing Spanning Forests and Connected Components
		5.1 Introduction
		5.2 Previous Work
		5.3 Union-Find Algorithms
		5.4 Parallelizing Rem’s Algorithm
			5.4.1 Using Locks
			5.4.2 A New Lock-Free Approach
			5.4.3 A Local Algorithm
		5.5 Experiments
		5.6 Concluding Remarks
		Acknowledgments
		References
	Chapter 6 Massive-Scale Distributed Triangle Computation and Applications
		6.1 Introduction
			6.1.1 Preliminary Definitions and Terminology
			6.1.2 Motivation for Enumerating Triangles
			6.1.3 System and Algorithm Design Criteria
		6.2 Algorithmic Primer for Exact Computation
			6.2.1 IEEE HPEC Graph Challenge
			6.2.2 Classical Graph Algorithms
			6.2.3 Linear-Algebraic Techniques
				6.2.3.1 Sparse Linear Algebra
				6.2.3.2 Dense Linear Algebra
		6.3 Distributed Asynchronous Triangle Enumeration
			6.3.1 An Asynchronous Distributed Graph Framework
				6.3.1.1 Communication
				6.3.1.2 Effects of Aggregation and Routing
			6.3.2 Triangle Counting at Massive Scale
				6.3.2.1 1.5M CPUs on IBM BG/Q Sequoia
				6.3.2.2 Scaling on a Linux Cluster
			6.3.3 Asynchronous Distributed κ-Truss Decomposition
			6.3.4 Avoiding Communication
		6.4 Approximate Triangle Counting
			6.4.1 Streaming Graph Algorithms
			6.4.2 A Historical Survey of Approximate Triangle Counting
				6.4.2.1 Sparsification Algorithms
				6.4.2.2 Global Approximation Algorithms
				6.4.2.3 Vertex-Local Approximation Algorithms
			6.4.3 Graph Sketching
				6.4.3.1 Triangle Count Estimation via Sketches
			6.4.4 Discussion of Sampling Versus Sketching
		References
SECTION III: Algorithms and Applications
	Chapter 7 Computing Top-k Closeness Centrality in Fully Dynamic Graphs
		7.1 Introduction
		7.2 Preliminaries
			7.2.1 Notation and Problem Definition
			7.2.2 Related Work
			7.2.3 Static Top-k Closeness
		7.3 Dynamic Top-k Closeness Centrality
			7.3.1 Updating the Number of Reachable Vertices
			7.3.2 Finding Affected Vertices
			7.3.3 Update After an Edge Insertion
			7.3.4 Update After an Edge Removal
			7.3.5 Running Times and Memory Requirements
		7.4 Experiments
			7.4.1 Experimental Setup
			7.4.2 Speedups on Recomputation
		7.5 Conclusions
		Acknowledgments
		7.6 Overview of Networks
		7.7 Additional Experimental Results
			7.7.1 Speedups for Complex Networks
			7.7.2 Speedups for Road Networks
			7.7.3 Running Times for Complex Networks
			7.7.4 Running Times for Road Networks
		References
	Chapter 8 Ordering Heuristics for Parallel Graph Coloring
		8.1 Introduction
			8.1.1 Ordering Heuristics
			8.1.2 Parallel Greedy Coloring
			8.1.3 Outline
		8.2 The Jones-Plassmann Algorithm
			8.2.1 The Dag Model of Dynamic Multithreading
			8.2.2 Analysis of JP
		8.3 JP with Random Ordering
		8.4 The LF and SL Heuristics
			8.4.1 The LF Ordering Heuristic
			8.4.2 The SL Ordering Heuristic
		8.5 Log Ordering Heuristics
			8.5.1 The LLF Ordering Heuristic
			8.5.2 The SLL Ordering Heuristic
		8.6 Empirical Evaluation
			8.6.1 Experimental Setup
			8.6.2 Coloring Quality of R, LLF, and SLL
			8.6.3 Scalability of JP-R, JP-LLF, and JP-SLL
		8.7 Implementation Techniques
			8.7.1 Join Trees for Reducing Memory Contention
			8.7.2 Bit Vectors for Assigning Colors
			8.7.3 Software Prefetching
		8.8 The SD Heuristic
		8.9 Related Work
		8.10 Conclusion
		Acknowledgments
		8.11 Appendix: Performance of Serial Ordering Heuristics
		References
	Chapter 9 Partitioning Trillion-Edge Graphs
		9.1 Introduction
		9.2 Background
			9.2.1 Graph Partitioning
			9.2.2 Related Work
		9.3 XTRAPULP
			9.3.1 XTRAPULP Overview
				9.3.1.1 Graph Representation
				9.3.1.2 XTRAPULP Algorithm
			9.3.2 XTRAPULP Initialization
				9.3.2.1 ExchangeUpdates
			9.3.3 XTRAPULP Vertex-Balancing Phase
			9.3.4 XTRAPULP Refinement Phase
			9.3.5 XTRAPULP Edge-Balancing Phase
		9.4 Experimental Setup
		9.5 Results
			9.5.1 Performance and Scalability
				9.5.1.1 Scaling on Blue Waters
				9.5.1.2 Trillion-Edge Runs
				9.5.1.3 Scaling on Compton
			9.5.2 Partitioning Quality
		9.6 Conclusions
		References
	Chapter 10 New Phenomena in Large-Scale Internet Traffic
		10.1 Introduction
		10.2 Methodology
			10.2.1 MAWI and CAIDA Internet Traffic Collection
			10.2.2 Network Quantities from Matrices
			10.2.3 Memory and Computation Requirements
		10.3 Internet Traffic Modeling
			10.3.1 Logarithmic Pooling
			10.3.2 Modified Zipf–Mandelbrot Model
			10.3.3 Nonlinear Model Fitting
		10.4 Results
		10.5 Discussion
		10.6 Conclusions
		Acknowledgments
		References
	Chapter 11 Parallel Algorithms for Butterfly Computations
		11.1 Introduction
		11.2 Preliminaries
		11.3 PARBUTTERFLY Framework
			11.3.1 Counting Framework
				11.3.1.1 Ranking
				11.3.1.2 Wedge Aggregation
				11.3.1.3 Butterfly Aggregation
				11.3.1.4 Other Options
			11.3.2 Peeling Framework
		11.4 PARBUTTERFLY Algorithms
			11.4.1 Preprocessing
			11.4.2 Counting Algorithms
				11.4.2.1 Wedge Retrieval
				11.4.2.2 Per Vertex
				11.4.2.3 Per Edge
			11.4.3 Peeling Algorithms
				11.4.3.1 Per Vertex
				11.4.3.2 Per Edge
				11.4.3.3 Per Vertex/Edge Storing All Wedges
			11.4.4 Approximate Counting
			11.4.5 Approximate Degree Ordering
			11.4.6 Complement Degeneracy Ordering
		11.5 Parallel Fibonacci Heap
			11.5.1 Batch Insertion
			11.5.2 Delete-min
			11.5.3 Batch Decrease-Key
			11.5.4 Application to Bucketing
				11.5.4.1 Retrieving the Minimum Bucket
				11.5.4.2 Updating the Bucketing Structure
		11.6 Experiments
			11.6.1 Environment
			11.6.2 Results
				11.6.2.1 Butterfly Counting
				11.6.2.2 Ranking
				11.6.2.3 Approximate Counting
				11.6.2.4 Butterfly Peeling
			11.6.3 Cache Optimization for Butterfly Counting
			11.6.4 Butterfly Counting
				11.6.4.1 Ranking
				11.6.4.2 Approximate Counting
		11.7 Related Work
		11.8 Conclusion
		Acknowledgments
		References
SECTION IV: Models
	Chapter 12 Recent Advances in Scalable Network Generation
		12.1 Introduction
		12.2 Graph Properties and Uses of Generators
			12.2.1 Graph Properties
			12.2.2 Use Cases
		12.3 General Algorithmic Models and Techniques
			12.3.1 Models of Computation
			12.3.2 Random Permutations
			12.3.3 Basic Sampling Techniques
				12.3.3.1 Bernoulli Sampling
				12.3.3.2 Sampling k Elements from [N]
				12.3.3.3 Rejection Sampling
				12.3.3.4 Weighted Sampling
			12.3.4 Sampling from Huge Sets
		12.4 Basic Models
			12.4.1 Erdös-Rényi’s G(n,m) and Gilbert’s G(n, p) models
			12.4.2 Preferential Attachment
				12.4.2.1 Barabási-Albert Model
				12.4.2.2 Node Copy Model
		12.5 Random Spatial Graphs
			12.5.1 Random Geometric Graphs
			12.5.2 Random Hyperbolic Graphs
				12.5.2.1 Threshold Model
				12.5.2.2 Binomial Model
			12.5.3 Geometric Inhomogenous Random Graphs
			12.5.4 Random Planar Graphs
				12.5.4.1 Random Delaunay Triangulations
				12.5.4.2 Planar Graphs for Infrastructures
		12.6 Random Graphs with Prescribed Degree Sequences
			12.6.1 Chung-Lu
			12.6.2 Configuration Model
			12.6.3 Edge Switching
			12.6.4 Curveball and Global Curveball
		12.7 Block Models
			12.7.1 Stochastic Block Model
			12.7.2 R-MAT / Kronecker Graphs
			12.7.3 LFR
			12.7.4 CKB
		12.8 Graph Replication
			12.8.1 BTER
			12.8.2 Darwini
			12.8.3 Multilevel Generators
			12.8.4 dK-Graphs
		12.9 Additional Graph Types
			12.9.1 Directed Graphs
			12.9.2 Weighted Graphs
			12.9.3 Connected Graphs
			12.9.4 Regular Graphs
			12.9.5 Threshold Graphs
		12.10 Software Packages
		12.11 Future Challenges
		Acknowledgments
		References
	Chapter 13 Computational Models for Cascades in Massive Graphs: How to Spread a Rumor in Parallel
		13.1 Introduction
		13.2 Models for Rumor Diffusion and Epidemics
		13.3 Cascade Dynamics in a Network
			13.3.1 Independent Cascade model
			13.3.2 Threshold Models
			13.3.3 Complex Contagion Model
		13.4 A Unified Model for Rumor Cascades
			13.4.1 Analytical Model of Influence
			13.4.2 Unified Model of Influence
			13.4.3 Infection Probability Formula Under the Unified Model
				13.4.3.1 Complexity Analysis
			13.4.4 Reduction to Other Models
				13.4.4.1 Independent Cascade Model
				13.4.4.2 Threshold Models
				13.4.4.3 Complex Contagion Model
			13.4.5 Empirical Evaluation of the Unified Model
		13.5 Parallel Temporal Rumor Cascade Analysis in Massive Graphs
		13.6 Conclusions
		Acknowledgment
		References
	Chapter 14 Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling
		14.1 Introduction
			14.1.1 A Serial Reference Implementation
			14.1.2 Parallelizing Dynamic Data-Graph Computations
			14.1.3 Contributions
			14.1.4 Outline
		14.2 Background
			14.2.1 The Dag Model of Multithreading
			14.2.2 Work-Span Analysis
			14.2.3 Work-Stealing Runtime Systems
			14.2.4 Worker-Local Storage
		14.3 The PRISM Algorithm
			14.3.1 Design Considerations for the Implementation of Multibags
		14.4 The Multibag Data Structure
			14.4.1 Implementation of MB-INSERT and MB-COLLECT
			14.4.2 Analysis of Multibags
		14.5 Analysis of PRISM
			14.5.1 Theoretical Scalability of PRISM
		14.6 Empirical evaluation
			14.6.1 Experimental Setup
			14.6.2 Overheads for Locking and for Chromatic Scheduling
			14.6.3 Dynamic Scheduling Overhead
			14.6.4 Scalability of PRISM
		14.7 The Prism-R Algorithm
			14.7.1 Data-Graph Computations with Global Reductions
			14.7.2 PRISM-R
		14.8 The multivector data structure
			14.8.1 The Log-Tree Reducer
			14.8.2 Analysis of Multivector Operations
		14.9 Analysis and evaluation of Prism-R
			14.9.1 Work-Span Analysis of Prism-R
			14.9.2 Empirical Evaluation of PRISM-R
		14.10 Conclusion
		14.11 Acknowledgments
		References
SECTION V: Frameworks and Software
	Chapter 15 Graph Data Science Using Neo4j
		15.1 Introduction
			15.1.1 What is Graph Data Science?
			15.1.2 Types of Questions for GDS
			15.1.3 The Rise of GDS
		15.2 Neo4j for GDS
			15.2.1 Neo4j Database and Labeled Property Graph
			15.2.2 Cypher Declarative Query Language
			15.2.3 Neo4j GDS Library
			15.2.4 Neo4j Browser and Neo4j Bloom
		15.3 The GDS Journey
			15.3.1 Knowledge Graphs
			15.3.2 Graph Analytics
			15.3.3 Graph Feature Engineering
			15.3.4 Graph Embedding
			15.3.5 Graph Networks
		15.4 Real-World Uses of GDS
			15.4.1 Prominent Commercial Uses
				15.4.1.1 Drug Discovery and Patient Journeys
				15.4.1.2 Recommendations and Targeted Marketing
				15.4.1.3 Fraud Detection
		15.5 Fraud Example
			15.5.1 Query Dataset
			15.5.2 Outlier Removal
			15.5.3 Finding Suspicious Clusters
			15.5.4 Visually Exploring a Suspicious Cluster
			15.5.5 Predicting Fraudsters Using Graph Features
		15.6 Conclusion
		References
	Chapter 16 The Parallel Boost Graph Library 2.0: Active Messages as a Spanning Model for Parallel Graph Computation
		16.1 Introduction
		16.2 Related
			16.2.1 Active Messages
			16.2.2 Parallel Graph Libraries
			16.2.3 Active Messages for Graph Algorithms
		16.3 Programming Model
			16.3.1 Background: Active Pebbles
		16.4 Design of a Graph Application Stack
			16.4.1 Distributed Data Model
			16.4.2 Runtime Communication Transformations
			16.4.3 Separating Expression from Execution
				16.4.3.1 Message Configuration
		16.5 Experimental Evaluation
			16.5.1 Implementation Details
			16.5.2 Results
				16.5.2.1 Breadth-First Search
				16.5.2.2 Δ-Stepping Shortest Paths
				16.5.2.3 Shiloach-Vishkin Connected Components
				16.5.2.4 PageRank
		16.6 Conclusion
		References
	Chapter 17 RAPIDS cuGraph
		17.1 Introduction
		17.2 Data Science and Workflow Integration
			17.2.1 Ecosystem
			17.2.2 Workflow Integration
		17.3 Technology Stack and Software
			17.3.1 Transformations and Renumbering
			17.3.2 Stack
		17.4 Performance results
			17.4.1 Single GPU Performance
			17.4.2 Multiple GPU Performance
		17.5 Conclusion
		Acknowledgment
		References
	Chapter 18 A Cloud-Based Approach to Big Graphs
		18.1 Introduction
		18.2 Background
			18.2.1 Graph definition
			18.2.2 Memory Latency Challenges
			18.2.3 Scalability Challenges
			18.2.4 Fault-Tolerance
		18.3 Cloud-Based Approach
		18.4 Related Work
		18.5 A Few Words on MapReduce
		18.6 Graph Representation
			18.6.1 Edge list
			18.6.2 Accumulo Edge Table
		18.7 Breadth-First Search
			18.7.1 Cloud-Based BFS
			18.7.2 Accumulo Adjacency
			18.7.3 Key-Space Partitioning
			18.7.4 Reduce Task Count
		18.8 Results and Discussion
			18.8.1 Graph500 Benchmark
			18.8.2 Experiments
			18.8.3 Results
		18.9 Conclusion
		References
	Chapter 19 Introduction to GraphBLAS
		19.1 Adjacency Matrix
		19.2 Incidence Matrix
		19.3 Matrix Values
		19.4 Scalar Operations
		19.5 Matrix Properties
		19.6 0-Element: No Graph Edge
		19.7 Matrix Graph Operations
			19.7.1 Building a Matrix: Edge List to Graph
			19.7.2 Extracting Tuples: Graph to Vertex List
			19.7.3 Transpose: Swap Out-Vertices and In-Vertices
			19.7.4 Matrix Multiplication: Breadth-First-Search, and Adjacency Matrix Construction
			19.7.5 Extract: Selecting Sub-graphs
			19.7.6 Assign: Modifying Sub-Graphs
			19.7.7 Element-Wise Addition and Element-Wise Multiplication: Combining Graphs, Intersecting Graphs, and Scaling Graphs
		19.8 Performance
		19.9 Conclusions
		References
	Chapter 20 Graphulo: Linear Algebra Graph Kernels
		20.1 Introduction
			20.1.1 The Graphulo Initiative
			20.1.2 Outline
		20.2 Associative Arrays and Graph Schemas
			20.2.1 Associative Arrays
			20.2.2 Graph Schemas
				20.2.2.1 Adjacency Matrix
				20.2.2.2 Incidence Matrix
				20.2.2.3 D4M Schema
		20.3 Algorithms
			20.3.1 Centrality
			20.3.2 Subgraph Detection and Vertex Nomination
			20.3.3 Similarity
			20.3.4 Community Detection
			20.3.5 Graph Algorithms and Neural Networks
		20.4 Graphulo Implementation
			20.4.1 Graphulo
			20.4.2 D4M
		20.5 Discussion
		20.6 Conclusions
		Acknowledgments
		References
	Chapter 21 Interactive Graph Analytics at Scale in Arkouda
		21.1 Introduction
		21.2 Arkouda Framework for Data Science
		21.3 Succinct Double-Index Data Structure
			21.3.1 Edge Index and Vertex Index
			21.3.2 Time and Space Complexity Analysis
			21.3.3 Comparison with CSR
		21.4 Multilocale Breadth-First Search and Triangle Counting Algorithms
			21.4.1 Parallel BFS Algorithm
				21.4.1.1 High-Level Multilocale BFS Algorithm
				21.4.1.2 Low-level Multilocale BFS Algorithm
			21.4.2 Parallel Triangle Counting Algorithm
		21.5 Graph Analytics Workflow
			21.5.1 Edge Oriented Sparse Graph Partitioning and Building
			21.5.2 Sliding Window Stream based Sketch Building
			21.5.3 Real-World Graph Distributions based Regression Model
		21.6 Integration with Arkouda
			21.6.1 Graph Classes Definition in Python and Chapel
			21.6.2 BFS and Triangle Counting Benchmark
		21.7 Experiments
			21.7.1 Experimental Setup for BFS Analysis
			21.7.2 Experimental Results of BFS Algorithm
				21.7.2.1 Graph Building
				21.7.2.2 BFS Performance
			21.7.3 Experimental Setup for Triangle Counting
			21.7.4 Experimental Results of Triangle Counting
				21.7.4.1 Accuracy
				21.7.4.2 Performance
		21.8 Related Work
			21.8.1 BFS Algorithm
			21.8.2 Triangle Counting Algorithm
			21.8.3 Graph Stream Sketch
			21.8.4 Complete Graph Stream Processing Method
		21.9 Conclusion
		Acknowledgments
		References
Index
Blank Page