Advanced Algorithms and Data Structures introduces a collection of algorithms for complex programming challenges in data analysis, machine learning, and graph computing.
As a software engineer, you’ll encounter countless programming challenges that initially seem confusing, difficult, or even impossible. Don’t despair! Many of these “new” problems already have well-established solutions. Advanced Algorithms and Data Structures teaches you powerful approaches to a wide range of tricky coding challenges that you can adapt and apply to your own applications. Providing a balanced blend of classic, advanced, and new algorithms, this practical guide upgrades your programming toolbox with new perspectives and hands-on techniques.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
Can you improve the speed and efficiency of your applications without investing in new hardware? Well, yes, you can: Innovations in algorithms and data structures have led to huge advances in application performance. Pick up this book to discover a collection of advanced algorithms that will make you a more effective developer.
About the book
Advanced Algorithms and Data Structures introduces a collection of algorithms for complex programming challenges in data analysis, machine learning, and graph computing. You’ll discover cutting-edge approaches to a variety of tricky scenarios. You’ll even learn to design your own data structures for projects that require a custom solution.
- Build on basic data structures you already know
- Profile your algorithms to speed up application
- Store and query strings efficiently
- Distribute clustering algorithms with MapReduce
- Solve logistics problems using graphs and optimization algorithms
About the reader
For intermediate programmers.
About the author
Marcello La Rocca is a research scientist and a full-stack engineer. His focus is on optimization algorithms, genetic algorithms, machine learning, and quantum computing.
Table of Contents
1 Introducing data structures
PART 1 IMPROVING OVER BASIC DATA STRUCTURES
2 Improving priority queues: d-way heaps
3 Treaps: Using randomization to balance binary search trees
4 Bloom filters: Reducing the memory for tracking content
5 Disjoint sets: Sub-linear time processing
6 Trie, radix trie: Efficient string search
7 Use case: LRU cache
PART 2 MULTIDEMENSIONAL QUERIES
8 Nearest neighbors search
9 K-d trees: Multidimensional data indexing
10 Similarity Search Trees: Approximate nearest neighbors search for image retrieval
11 Applications of nearest neighbor search
13 Parallel clustering: MapReduce and canopy clustering
PART 3 PLANAR GRAPHS AND MINIMUM CROSSING NUMBER
14 An introduction to graphs: Finding paths of minimum distance
15 Graph embeddings and planarity: Drawing graphs with minimal edge intersections
16 Gradient descent: Optimization problems (not just) on graphs
17 Simulated annealing: Optimization beyond local minima
18 Genetic algorithms: Biologically inspired, fast-converging optimization
inside front cover Advanced Algorithms and Data Structures Copyright dedication contents front matter foreword preface Welcome to Advanced Algorithms and Data Structures acknowledgments about this book Who should read this book? How this book is organized: a roadmap About the code liveBook discussion forum about the author about the cover illustration 1 Introducing data structures 1.1 Data structures 1.1.1 Defining a data structure 1.1.2 Describing a data structure 1.1.3 Algorithms and data structures: Is there a difference? 1.2 Setting goals: Your expectations after reading this book 1.3 Packing your knapsack: Data structures meet the real world 1.3.1 Abstracting the problem away 1.3.2 Looking for solutions 1.3.3 Algorithms to the rescue 1.3.4 Thinking (literally) outside of the box 1.3.5 Happy ending Summary Part 1. Improving over basic data structures 2 Improving priority queues: d-way heaps 2.1 Structure of this chapter 2.2 The problem: Handling priority 2.2.1 Priority in practice: Bug tracking 2.3 Solutions at hand: Keeping a sorted list 2.3.1 From sorted lists to priority queues 2.4 Describing the data structure API: Priority queues 2.4.1 Priority queue at work 2.4.2 Priority matters: Generalize FIFO 2.5 Concrete data structures 2.5.1 Comparing performance 2.5.2 What’s the right concrete data structure? 2.5.3 Heap 2.5.4 Priority, min-heap, and max-heap 2.5.5 Advanced variant: d-ary heap 2.6 How to implement a heap 2.6.1 BubbleUp 2.6.2 PushDown 2.6.3 Insert 2.6.4 Top 2.6.5 Update 2.6.6 Dealing with duplicates 2.6.7 Heapify 2.6.8 Beyond API methods: Contains 2.6.9 Performance recap 2.6.10 From pseudo-code to implementation 2.7 Use case: Find the k largest elements 2.7.1 The right data structure . . . 2.7.2 . . . and the right use 2.7.3 Coding it up 2.8 More use cases 2.8.1 Minimum distance in graphs: Dijkstra 2.8.2 More graphs: Prim's algorithm 2.8.3 Data compression: Huffman codes 2.9 Analysis of branching factor25 2.9.1 Do we need d-ary heaps? 2.9.2 Running time 2.9.3 Finding the optimal branching factor 2.9.4 Branching factor vs memory 2.10 Performance analysis: Finding the best branching factor 2.10.1 Please welcome profiling 2.10.2 Interpreting results 2.10.3 The mystery with heapify 2.10.4 Choosing the best branching factor Summary 3 Treaps: Using randomization to balance binary search trees 3.1 Problem: Multi-indexing 3.1.1 The gist of the solution 3.2 Solution: Description and API 3.3 Treap 3.3.1 Rotations 3.3.2 A few design questions 3.3.3 Implementing search 3.3.4 Insert 3.3.5 Delete 3.3.6 Top, peek, and update 3.3.7 Min, max 3.3.8 Performance recap 3.4 Applications: Randomized treaps 3.4.1 Balanced trees 3.4.2 Introducing randomization 3.4.3 Applications of Randomized Treaps 3.5 Performance analysis and profiling 3.5.1 Theory: Expected height 3.5.2 Profiling height 3.5.3 Profiling running time 3.5.4 Profiling memory usage 3.5.5 Conclusions Summary 4 Bloom filters: Reducing the memory for tracking content 4.1 The dictionary problem: Keeping track of things 4.2 Alternatives to implementing a dictionary 4.3 Describing the data structure API: Associative array 4.4 Concrete data structures 4.4.1 Unsorted array: Fast insertion, slow search 4.4.2 Sorted arrays and binary search: Slow insertion, fast(-ish) search 4.4.3 Hash table: Constant-time on average, unless you need ordering 4.4.4 Binary search tree: Every operation is logarithmic 4.4.5 Bloom filter: As fast as hash tables, but saves memory (with a catch) 4.5 Under the hood: How do Bloom filters work? 4.6 Implementation 4.6.1 Using a Bloom filter 4.6.2 Reading and writing bits 4.6.3 Find where a key is stored 4.6.4 Generating hash functions 4.6.5 Constructor 4.6.6 Checking a key 4.6.7 Storing a key 4.6.8 Estimating accuracy 4.7 Applications 4.7.1 Cache 4.7.2 Routers 4.7.3 Crawler 4.7.4 IO fetcher 4.7.5 Spell checker 4.7.6 Distributed databases and file systems 4.8 Why Bloom filters work21 4.8.1 Why there are no false negatives . . . 4.8.2 . . . But there are false positives 4.8.3 Bloom filters as randomized algorithms 4.9 Performance analysis 4.9.1 Running time 4.9.2 Constructor 4.9.3 Storing an element 4.9.4 Looking up an element 4.10 Estimating Bloom filter precision25 4.10.1 Explanation of the false-positive ratio formula 4.11 Improved variants 4.11.1 Bloomier filter 4.11.2 Combining Bloom filters 4.11.3 Layered Bloom filter 4.11.4 Compressed Bloom filter 4.11.5 Scalable Bloom filter Summary 5 Disjoint sets: Sub-linear time processing 5.1 The distinct subsets problem 5.2 Reasoning on solutions 5.3 Describing the data structure API: Disjoint set 5.4 Naïve solution 5.4.1 Implementing naïve solution 5.5 Using a tree-like structure 5.5.1 From list to trees 5.5.2 Implementing the tree version 5.6 Heuristics to improve the running time 5.6.1 Path compression 5.6.2 Implementing balancing and path compression 5.7 Applications 5.7.1 Graphs: Connected components 5.7.2 Graphs:15 Kruskal’s algorithm for minimum spanning tree 5.7.3 Clustering 5.7.4 Unification Summary 6 Trie, radix trie: Efficient string search 6.1 Spell-check 6.1.1 A prncess, a Damon, and an elf walkz into a bar 6.1.2 Compression is the key 6.1.3 Description and API 6.2 Trie 6.2.1 Why is it better again? 6.2.2 Search 6.2.3 Insert 6.2.4 Remove 6.2.5 Longest prefix 6.2.6 Keys matching a prefix 6.2.7 When should we use tries? 6.3 Radix tries 6.3.1 Nodes and edges 6.3.2 Search 6.3.3 Insert 6.3.4 Remove 6.3.5 Longest common prefix 6.3.6 Keys starting with a prefix 6.4 Applications 6.4.1 Spell-checker 6.4.2 String similarity 6.4.3 String sorting 6.4.4 T9 6.4.5 Autocomplete Summary 7 Use case: LRU cache 7.1 Don’t compute things twice 7.2 First attempt: Remembering values 7.2.1 Description and API 7.2.2 Fresh data, please 7.2.3 Handling asynchronous calls 7.2.4 Marking cache values as “Loading” 7.3 Memory is not enough (literally) 7.4 Getting rid of stale data: LRU cache 7.4.1 Sometimes you have to double down on problems 7.4.2 Temporal ordering 7.4.3 Performance 7.5 When fresher data is more valuable: LFU 7.5.1 So how do we choose? 7.5.2 What makes LFU different 7.5.3 Performance 7.5.4 Problems with LFU 7.6 How to use cache is just as important 7.7 Introducing synchronization 7.7.1 Solving concurrency (in Java) 7.7.2 Introducing locks 7.7.3 Acquiring a lock 7.7.4 Reentrant locks 7.7.5 Read locks 7.7.6 Other approaches to concurrency 7.8 Cache applications Summary Part 2. Multidimensional queries 8 Nearest neighbors search 8.1 The nearest neighbors search problem 8.2 Solutions 8.2.1 First attempts 8.2.2 Sometimes caching is not the answer 8.2.3 Simplifying things to get a hint 8.2.4 Carefully choose a data structure 8.3 Description and API 8.4 Moving to k-dimensional spaces 8.4.1 Unidimensional binary search 8.4.2 Moving to higher dimensions 8.4.3 Modeling 2-D partitions with a data structure Summary 9 K-d trees: Multidimensional data indexing 9.1 Right where we left off 9.2 Moving to k-D spaces: Cycle through dimensions 9.2.1 Constructing the BST 9.2.2 Invariants 9.2.2 The importance of being balanced 9.3 Methods 9.3.1 Search 9.3.2 Insert 9.3.3 Balanced tree 9.3.4 Remove 9.3.5 Nearest neighbor 9.3.6 Region search 9.3.7 A recap of all methods 9.4 Limits and possible improvements Summary 10 Similarity Search Trees: Approximate nearest neighbors search for image retrieval 10.1 Right where we left off 10.1.1 A new (more complex) example 10.1.2 Overcoming k-d trees’ flaws 10.2 R-tree 10.2.1 A step back: Introducing B-trees 10.2.2 From B-Tree to R-tree 10.2.3 Inserting points in an R-tree 10.2.4 Search 10.3 Similarity search tree 10.3.1 SS-tree search 10.3.2 Insert 10.3.3 Insertion: Variance, means, and projections 10.3.4 Insertion: Split nodes 10.3.5 Delete 10.4 Similarity Search 10.4.1 Nearest neighbor search 10.4.2 Region search 10.4.3 Approximated similarity search 10.5 SS+-tree18 10.5.1 Are SS-trees better? 10.5.2 Mitigating hyper-sphere limitations 10.5.3 Improved split heuristic 10.5.4 Reducing overlap Summary 11 Applications of nearest neighbor search 11.1 An application: Find nearest hub 11.1.1 Sketching a solution 11.1.2 Trouble in paradise 11.2 Centralized application 11.2.1 Filtering points 11.2.2 Complex decisions 11.3 Moving to a distributed application 11.3.1 Issues handling HTTP communication 11.3.2 Keeping the inventory in sync 11.3.3 Lessons learned 11.4 Other applications 11.4.1 Color reduction 11.4.2 Particle interaction 11.4.3 Multidimensional DB queries optimization 11.4.4 Clustering Summary 12 Clustering 12.1 Intro to clustering 12.1.1 Types of learning 12.1.2 Types of clustering 12.2 K-means 12.2.1 Issues with k-means 12.2.2 The curse of dimensionality strikes again 12.2.3 K-means performance analysis 12.2.4 Boosting k-means with k-d trees 12.2.5 Final remarks on k-means 12.3 DBSCAN 12.3.1 Directly vs density-reachable 12.3.2 From definitions to an algorithm 12.3.3 And finally, an implementation 12.3.4 Pros and cons of DBSCAN 12.4 OPTICS 12.4.1 Definitions 12.4.2 OPTICS algorithm 12.4.3 From reachability distance to clustering 12.4.4 Hierarchical clustering 12.4.5 Performance analysis and final considerations 12.5 Evaluating clustering results: Evaluation metrics 12.5.1 Interpreting the results Summary 13 Parallel clustering: MapReduce and canopy clustering 13.1 Parallelization 13.1.1 Parallel vs distributed 13.1.2 Parallelizing k-means 13.1.3 Canopy clustering 13.1.4 Applying canopy clustering 13.2 MapReduce 13.2.1 Imagine you are Donald Duck . . . 13.2.2 First map, then reduce 13.2.3 There is more under the hood 13.3 MapReduce k-means 13.3.1 Parallelizing canopy clustering 13.3.2 Centroid initialization with canopy clustering 13.3.3 MapReduce canopy clustering 13.4 MapReduce DBSCAN Summary Part 3. Planar graphs and minimum crossing number 14 An introduction to graphs: Finding paths of minimum distance 14.1 Definitions 14.1.1 Implementing graphs 14.1.2 Graphs as algebraic types 14.1.3 Pseudo-code 14.2 Graph properties 14.2.1 Undirected 14.2.2 Connected 14.2.3 Acyclic 14.3 Graph traversal: BFS and DFS 14.3.1 Optimizing delivery routes 14.3.2 Breadth first search 14.3.3 Reconstructing the path to target 14.3.4 Depth first search 14.3.5 It’s queue vs stack again 14.3.6 Best route to deliver a parcel 14.4 Shortest path in weighted graphs: Dijkstra 14.4.1 Differences with BFS 14.4.2 Implementation 14.4.3 Analysis 14.4.4 Shortest route for deliveries 14.5 Beyond Dijkstra’s algorithm: A* 14.5.1 How good is A* search? 14.5.2 Heuristics as a way to balance real-time data Summary 15 Graph embeddings and planarity: Drawing graphs with minimal edge intersections 15.1 Graph embeddings 15.1.1 Some basic definitions 15.1.2 Complete and bipartite graphs 15.2 Planar graphs 15.2.1 Using Kuratowski’s theorem in practice 15.2.2 Planarity testing 15.2.3 A naïve algorithm for planarity testing 15.2.4 Improving performance 15.2.5 Efficient algorithms 15.3 Non-planar graphs 15.3.1 Finding the crossing number 15.3.2 Rectilinear crossing number 15.4 Edge intersections 15.4.1 Straight-line segments 15.4.2 Polylines 15.4.3 Bézier curves 15.4.4 Intersections between quadratic Bézier curves 15.4.5 Vertex-vertex and edge-vertex intersections Summary 16 Gradient descent: Optimization problems (not just) on graphs 16.1 Heuristics for the crossing number 16.1.1 Did you just say heuristics? 16.1.2 Extending to curve-line edges 16.2 How optimization works 16.2.1 Cost functions 16.2.2 Step functions and local minima 16.2.3 Optimizing random sampling 16.3 Gradient descent 16.3.1 The math of gradient descent 16.3.2 Geometrical interpretation 16.3.3 When is gradient descent appliable? 16.3.4 Problems with gradient descent 16.4 Applications of gradient descent 16.4.1 An example: Linear regression 16.5 Gradient descent for graph embedding 16.5.1 A different criterion 16.5.2 Implementation Summary 17 Simulated annealing: Optimization beyond local minima 17.1 Simulated annealing 17.1.1 Sometimes you need to climb up to get to the bottom 17.1.2 Implementation 17.1.3 Why simulated annealing works 17.1.4 Short-range vs long-range transitions 17.1.5 Variants 17.1.6 Simulated annealing vs gradient descent: Which one should I use? 17.2 Simulated annealing + traveling salesman 17.2.1 Exact vs approximated solutions 17.2.2 Visualizing cost 17.2.3 Pruning the domain 17.2.4 State transitions 17.2.5 Adjacent vs random swaps 17.2.6 Applications of TSP 17.3 Simulated annealing and graph embedding 17.3.1 Minimum edge crossing 17.3.2 Force-directed drawing Summary 18 Genetic algorithms: Biologically inspired, fast-converging optimization 18.1 Genetic algorithms 18.1.1 Inspired by nature 18.1.2 Chromosomes 18.1.3 Population 18.1.4 Fitness 18.1.5 Natural selection 18.1.6 Selecting individuals for mating 18.1.7 Crossover 18.1.8 Mutations 18.1.9 The genetic algorithm template 18.1.10 When does the genetic algorithm work best? 18.2 TSP 18.2.1 Fitness, chromosomes, and initialization 18.2.2 Mutations 18.2.3 Crossover 18.2.4 Results and parameters tuning 18.2.5 Beyond TSP: Optimizing the routes of the whole fleet 18.3 Minimum vertex cover 18.3.1 Applications of vertex cover 18.3.2 Implementing a genetic algorithm 18.4 Other applications of the genetic algorithm 18.4.1 Maximum flow 18.4.2 Protein folding 18.4.3 Beyond genetic algorithms 18.4.4 Algorithms, beyond this book Summary appendix A. A quick guide to pseudo-code A.1 Variables and basics A.2 Arrays A.3 Conditional instructions A.3.1 Else-if A.3.2 Switch A.4 Loops A.4.1 For loop A.4.2 While loop A.4.3 Break and continue A.5 Blocks and indent A.6 Functions A.6.1 Overloading and default arguments A.6.2 Tuples A.6.3 Tuples and destructuring objects appendix B. Big-O notation B.1 Algorithms and performance B.2 The RAM model B.3 Order of magnitude B.4 Notation B.5 Examples appendix C. Core data structures C.1 Core data structures C.2 Array C.3 Linked List C.4 Tree C.4.1 Binary search trees C.5 Hash table C.5.1 Storing key-value pairs C.5.2 Hashing C.5.3 Conflicts resolution in hashing C.5.4 Performance C.6 Comparative analysis of core data structures appendix D. Containers as priority queues D.1 Bag D.2 Stack D.3 Queue D.4 A comparative analysis of containers appendix E. Recursion E.1 Simple recursion E.1.1 Pitfalls E.1.2 Good recursion E.2 Tail recursion E.3 Mutual recursion appendix F. Classification problems and randomnized algorithm metrics F.1 Decision problems F.2 Las Vegas algorithms F.3 Monte Carlo algorithms F.4 Classification metrics F.4.1 Accuracy F.4.2 Precision and recall F.4.3 Other metrics and recap index inside back cover
How to download source code?
1. Go to:
2. Search the book title:
Advanced Algorithms and Data Structures, sometime you may not get the results, please search the main title
3. Click the book title in the search results
resources section, click
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.