Advanced Algorithms and Data Structures

Length: 768 pages
Edition: 1
Language: English
Publisher: Manning Publications
Publication Date: 2021-06-22
ISBN-10: 1617295485
ISBN-13: 9781617295485
Sales Rank: #730368 (See Top 100 Books)

Advanced Algorithms and Data Structures introduces a collection of algorithms for complex programming challenges in data analysis, machine learning, and graph computing.

Summary

As a software engineer, you’ll encounter countless programming challenges that initially seem confusing, difficult, or even impossible. Don’t despair! Many of these “new” problems already have well-established solutions. Advanced Algorithms and Data Structures teaches you powerful approaches to a wide range of tricky coding challenges that you can adapt and apply to your own applications. Providing a balanced blend of classic, advanced, and new algorithms, this practical guide upgrades your programming toolbox with new perspectives and hands-on techniques.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology

Can you improve the speed and efficiency of your applications without investing in new hardware? Well, yes, you can: Innovations in algorithms and data structures have led to huge advances in application performance. Pick up this book to discover a collection of advanced algorithms that will make you a more effective developer.

About the book

Advanced Algorithms and Data Structures introduces a collection of algorithms for complex programming challenges in data analysis, machine learning, and graph computing. You’ll discover cutting-edge approaches to a variety of tricky scenarios. You’ll even learn to design your own data structures for projects that require a custom solution.

What’s inside

Build on basic data structures you already know
Profile your algorithms to speed up application
Store and query strings efficiently
Distribute clustering algorithms with MapReduce
Solve logistics problems using graphs and optimization algorithms

About the reader

For intermediate programmers.

About the author

Marcello La Rocca is a research scientist and a full-stack engineer. His focus is on optimization algorithms, genetic algorithms, machine learning, and quantum computing.

1 Introducing data structures

PART 1 IMPROVING OVER BASIC DATA STRUCTURES
2 Improving priority queues: d-way heaps
3 Treaps: Using randomization to balance binary search trees
4 Bloom filters: Reducing the memory for tracking content
5 Disjoint sets: Sub-linear time processing
6 Trie, radix trie: Efficient string search
7 Use case: LRU cache

PART 2 MULTIDEMENSIONAL QUERIES
8 Nearest neighbors search
9 K-d trees: Multidimensional data indexing
10 Similarity Search Trees: Approximate nearest neighbors search for image retrieval
11 Applications of nearest neighbor search
12 Clustering
13 Parallel clustering: MapReduce and canopy clustering

PART 3 PLANAR GRAPHS AND MINIMUM CROSSING NUMBER
14 An introduction to graphs: Finding paths of minimum distance
15 Graph embeddings and planarity: Drawing graphs with minimal edge intersections
16 Gradient descent: Optimization problems (not just) on graphs
17 Simulated annealing: Optimization beyond local minima
18 Genetic algorithms: Biologically inspired, fast-converging optimization

inside front cover
Advanced Algorithms and Data Structures
Copyright
dedication
contents
front matter
    foreword
    preface
        Welcome to Advanced Algorithms and Data Structures
    acknowledgments
    about this book
        Who should read this book?
        How this book is organized: a roadmap
        About the code
        liveBook discussion forum
    about the author
    about the cover illustration
1 Introducing data structures
    1.1 Data structures
        1.1.1 Defining a data structure
        1.1.2 Describing a data structure
        1.1.3 Algorithms and data structures: Is there a difference?
    1.2 Setting goals: Your expectations after reading this book
    1.3 Packing your knapsack: Data structures meet the real world
        1.3.1 Abstracting the problem away
        1.3.2 Looking for solutions
        1.3.3 Algorithms to the rescue
        1.3.4 Thinking (literally) outside of the box
        1.3.5 Happy ending
    Summary
Part 1. Improving over basic data structures
2 Improving priority queues: d-way heaps
    2.1 Structure of this chapter
    2.2 The problem: Handling priority
        2.2.1 Priority in practice: Bug tracking
    2.3 Solutions at hand: Keeping a sorted list
        2.3.1 From sorted lists to priority queues
    2.4 Describing the data structure API: Priority queues
        2.4.1 Priority queue at work
        2.4.2 Priority matters: Generalize FIFO
    2.5 Concrete data structures
        2.5.1 Comparing performance
        2.5.2 What’s the right concrete data structure?
        2.5.3 Heap
        2.5.4 Priority, min-heap, and max-heap
        2.5.5 Advanced variant: d-ary heap
    2.6 How to implement a heap
        2.6.1 BubbleUp
        2.6.2 PushDown
        2.6.3 Insert
        2.6.4 Top
        2.6.5 Update
        2.6.6 Dealing with duplicates
        2.6.7 Heapify
        2.6.8 Beyond API methods: Contains
        2.6.9 Performance recap
        2.6.10 From pseudo-code to implementation
    2.7 Use case: Find the k largest elements
        2.7.1 The right data structure . . .
        2.7.2 . . . and the right use
        2.7.3 Coding it up
    2.8 More use cases
        2.8.1 Minimum distance in graphs: Dijkstra
        2.8.2 More graphs: Prim's algorithm
        2.8.3 Data compression: Huffman codes
    2.9 Analysis of branching factor25
        2.9.1 Do we need d-ary heaps?
        2.9.2 Running time
        2.9.3 Finding the optimal branching factor
        2.9.4 Branching factor vs memory
    2.10 Performance analysis: Finding the best branching factor
        2.10.1 Please welcome profiling
        2.10.2 Interpreting results
        2.10.3 The mystery with heapify
        2.10.4 Choosing the best branching factor
    Summary
3 Treaps: Using randomization to balance binary search trees
    3.1 Problem: Multi-indexing
        3.1.1 The gist of the solution
    3.2 Solution: Description and API
    3.3 Treap
        3.3.1 Rotations
        3.3.2 A few design questions
        3.3.3 Implementing search
        3.3.4 Insert
        3.3.5 Delete
        3.3.6 Top, peek, and update
        3.3.7 Min, max
        3.3.8 Performance recap
    3.4 Applications: Randomized treaps
        3.4.1 Balanced trees
        3.4.2 Introducing randomization
        3.4.3 Applications of Randomized Treaps
    3.5 Performance analysis and profiling
        3.5.1 Theory: Expected height
        3.5.2 Profiling height
        3.5.3 Profiling running time
        3.5.4 Profiling memory usage
        3.5.5 Conclusions
    Summary
4 Bloom filters: Reducing the memory for tracking content
    4.1 The dictionary problem: Keeping track of things
    4.2 Alternatives to implementing a dictionary
    4.3 Describing the data structure API: Associative array
    4.4 Concrete data structures
        4.4.1 Unsorted array: Fast insertion, slow search
        4.4.2 Sorted arrays and binary search: Slow insertion, fast(-ish) search
        4.4.3 Hash table: Constant-time on average, unless you need ordering
        4.4.4 Binary search tree: Every operation is logarithmic
        4.4.5 Bloom filter: As fast as hash tables, but saves memory (with a catch)
    4.5 Under the hood: How do Bloom filters work?
    4.6 Implementation
        4.6.1 Using a Bloom filter
        4.6.2 Reading and writing bits
        4.6.3 Find where a key is stored
        4.6.4 Generating hash functions
        4.6.5 Constructor
        4.6.6 Checking a key
        4.6.7 Storing a key
        4.6.8 Estimating accuracy
    4.7 Applications
        4.7.1 Cache
        4.7.2 Routers
        4.7.3 Crawler
        4.7.4 IO fetcher
        4.7.5 Spell checker
        4.7.6 Distributed databases and file systems
    4.8 Why Bloom filters work21
        4.8.1 Why there are no false negatives . . .
        4.8.2 . . . But there are false positives
        4.8.3 Bloom filters as randomized algorithms
    4.9 Performance analysis
        4.9.1 Running time
        4.9.2 Constructor
        4.9.3 Storing an element
        4.9.4 Looking up an element
    4.10 Estimating Bloom filter precision25
        4.10.1 Explanation of the false-positive ratio formula
    4.11 Improved variants
        4.11.1 Bloomier filter
        4.11.2 Combining Bloom filters
        4.11.3 Layered Bloom filter
        4.11.4 Compressed Bloom filter
        4.11.5 Scalable Bloom filter
    Summary
5 Disjoint sets: Sub-linear time processing
    5.1 The distinct subsets problem
    5.2 Reasoning on solutions
    5.3 Describing the data structure API: Disjoint set
    5.4 Naïve solution
        5.4.1 Implementing naïve solution
    5.5 Using a tree-like structure
        5.5.1 From list to trees
        5.5.2 Implementing the tree version
    5.6 Heuristics to improve the running time
        5.6.1 Path compression
        5.6.2 Implementing balancing and path compression
    5.7 Applications
        5.7.1 Graphs: Connected components
        5.7.2 Graphs:15 Kruskal’s algorithm for minimum spanning tree
        5.7.3 Clustering
        5.7.4 Unification
    Summary
6 Trie, radix trie: Efficient string search
    6.1 Spell-check
        6.1.1 A prncess, a Damon, and an elf walkz into a bar
        6.1.2 Compression is the key
        6.1.3 Description and API
    6.2 Trie
        6.2.1 Why is it better again?
        6.2.2 Search
        6.2.3 Insert
        6.2.4 Remove
        6.2.5 Longest prefix
        6.2.6 Keys matching a prefix
        6.2.7 When should we use tries?
    6.3 Radix tries
        6.3.1 Nodes and edges
        6.3.2 Search
        6.3.3 Insert
        6.3.4 Remove
        6.3.5 Longest common prefix
        6.3.6 Keys starting with a prefix
    6.4 Applications
        6.4.1 Spell-checker
        6.4.2 String similarity
        6.4.3 String sorting
        6.4.4 T9
        6.4.5 Autocomplete
    Summary
7 Use case: LRU cache
    7.1 Don’t compute things twice
    7.2 First attempt: Remembering values
        7.2.1 Description and API
        7.2.2 Fresh data, please
        7.2.3 Handling asynchronous calls
        7.2.4 Marking cache values as “Loading”
    7.3 Memory is not enough (literally)
    7.4 Getting rid of stale data: LRU cache
        7.4.1 Sometimes you have to double down on problems
        7.4.2 Temporal ordering
        7.4.3 Performance
    7.5 When fresher data is more valuable: LFU
        7.5.1 So how do we choose?
        7.5.2 What makes LFU different
        7.5.3 Performance
        7.5.4 Problems with LFU
    7.6 How to use cache is just as important
    7.7 Introducing synchronization
        7.7.1 Solving concurrency (in Java)
        7.7.2 Introducing locks
        7.7.3 Acquiring a lock
        7.7.4 Reentrant locks
        7.7.5 Read locks
        7.7.6 Other approaches to concurrency
    7.8 Cache applications
    Summary
Part 2. Multidimensional queries
8 Nearest neighbors search
    8.1 The nearest neighbors search problem
    8.2 Solutions
        8.2.1 First attempts
        8.2.2 Sometimes caching is not the answer
        8.2.3 Simplifying things to get a hint
        8.2.4 Carefully choose a data structure
    8.3 Description and API
    8.4 Moving to k-dimensional spaces
        8.4.1 Unidimensional binary search
        8.4.2 Moving to higher dimensions
        8.4.3 Modeling 2-D partitions with a data structure
    Summary
9 K-d trees: Multidimensional data indexing
    9.1 Right where we left off
    9.2 Moving to k-D spaces: Cycle through dimensions
        9.2.1 Constructing the BST
        9.2.2 Invariants
        9.2.2 The importance of being balanced
    9.3 Methods
        9.3.1 Search
        9.3.2 Insert
        9.3.3 Balanced tree
        9.3.4 Remove
        9.3.5 Nearest neighbor
        9.3.6 Region search
        9.3.7 A recap of all methods
    9.4 Limits and possible improvements
    Summary
10 Similarity Search Trees: Approximate nearest neighbors search for image retrieval
    10.1 Right where we left off
        10.1.1 A new (more complex) example
        10.1.2 Overcoming k-d trees’ flaws
    10.2 R-tree
        10.2.1 A step back: Introducing B-trees
        10.2.2 From B-Tree to R-tree
        10.2.3 Inserting points in an R-tree
        10.2.4 Search
    10.3 Similarity search tree
        10.3.1 SS-tree search
        10.3.2 Insert
        10.3.3 Insertion: Variance, means, and projections
        10.3.4 Insertion: Split nodes
        10.3.5 Delete
    10.4 Similarity Search
        10.4.1 Nearest neighbor search
        10.4.2 Region search
        10.4.3 Approximated similarity search
    10.5 SS+-tree18
        10.5.1 Are SS-trees better?
        10.5.2 Mitigating hyper-sphere limitations
        10.5.3 Improved split heuristic
        10.5.4 Reducing overlap
    Summary
11 Applications of nearest neighbor search
    11.1 An application: Find nearest hub
        11.1.1 Sketching a solution
        11.1.2 Trouble in paradise
    11.2 Centralized application
        11.2.1 Filtering points
        11.2.2 Complex decisions
    11.3 Moving to a distributed application
        11.3.1 Issues handling HTTP communication
        11.3.2 Keeping the inventory in sync
        11.3.3 Lessons learned
    11.4 Other applications
        11.4.1 Color reduction
        11.4.2 Particle interaction
        11.4.3 Multidimensional DB queries optimization
        11.4.4 Clustering
    Summary
12 Clustering
    12.1 Intro to clustering
        12.1.1 Types of learning
        12.1.2 Types of clustering
    12.2 K-means
        12.2.1 Issues with k-means
        12.2.2 The curse of dimensionality strikes again
        12.2.3 K-means performance analysis
        12.2.4 Boosting k-means with k-d trees
        12.2.5 Final remarks on k-means
    12.3 DBSCAN
        12.3.1 Directly vs density-reachable
        12.3.2 From definitions to an algorithm
        12.3.3 And finally, an implementation
        12.3.4 Pros and cons of DBSCAN
    12.4 OPTICS
        12.4.1 Definitions
        12.4.2 OPTICS algorithm
        12.4.3 From reachability distance to clustering
        12.4.4 Hierarchical clustering
        12.4.5 Performance analysis and final considerations
    12.5 Evaluating clustering results: Evaluation metrics
        12.5.1 Interpreting the results
    Summary
13 Parallel clustering: MapReduce and canopy clustering
    13.1 Parallelization
        13.1.1 Parallel vs distributed
        13.1.2 Parallelizing k-means
        13.1.3 Canopy clustering
        13.1.4 Applying canopy clustering
    13.2 MapReduce
        13.2.1 Imagine you are Donald Duck . . .
        13.2.2 First map, then reduce
        13.2.3 There is more under the hood
    13.3 MapReduce k-means
        13.3.1 Parallelizing canopy clustering
        13.3.2 Centroid initialization with canopy clustering
        13.3.3 MapReduce canopy clustering
    13.4 MapReduce DBSCAN
    Summary
Part 3. Planar graphs and minimum crossing number
14 An introduction to graphs: Finding paths of minimum distance
    14.1 Definitions
        14.1.1 Implementing graphs
        14.1.2 Graphs as algebraic types
        14.1.3 Pseudo-code
    14.2 Graph properties
        14.2.1 Undirected
        14.2.2 Connected
        14.2.3 Acyclic
    14.3 Graph traversal: BFS and DFS
        14.3.1 Optimizing delivery routes
        14.3.2 Breadth first search
        14.3.3 Reconstructing the path to target
        14.3.4 Depth first search
        14.3.5 It’s queue vs stack again
        14.3.6 Best route to deliver a parcel
    14.4 Shortest path in weighted graphs: Dijkstra
        14.4.1 Differences with BFS
        14.4.2 Implementation
        14.4.3 Analysis
        14.4.4 Shortest route for deliveries
    14.5 Beyond Dijkstra’s algorithm: A*
        14.5.1 How good is A* search?
        14.5.2 Heuristics as a way to balance real-time data
    Summary
15 Graph embeddings and planarity: Drawing graphs with minimal edge intersections
    15.1 Graph embeddings
        15.1.1 Some basic definitions
        15.1.2 Complete and bipartite graphs
    15.2 Planar graphs
        15.2.1 Using Kuratowski’s theorem in practice
        15.2.2 Planarity testing
        15.2.3 A naïve algorithm for planarity testing
        15.2.4 Improving performance
        15.2.5 Efficient algorithms
    15.3 Non-planar graphs
        15.3.1 Finding the crossing number
        15.3.2 Rectilinear crossing number
    15.4 Edge intersections
        15.4.1 Straight-line segments
        15.4.2 Polylines
        15.4.3 Bézier curves
        15.4.4 Intersections between quadratic Bézier curves
        15.4.5 Vertex-vertex and edge-vertex intersections
    Summary
16 Gradient descent: Optimization problems (not just) on graphs
    16.1 Heuristics for the crossing number
        16.1.1 Did you just say heuristics?
        16.1.2 Extending to curve-line edges
    16.2 How optimization works
        16.2.1 Cost functions
        16.2.2 Step functions and local minima
        16.2.3 Optimizing random sampling
    16.3 Gradient descent
        16.3.1 The math of gradient descent
        16.3.2 Geometrical interpretation
        16.3.3 When is gradient descent appliable?
        16.3.4 Problems with gradient descent
    16.4 Applications of gradient descent
        16.4.1 An example: Linear regression
    16.5 Gradient descent for graph embedding
        16.5.1 A different criterion
        16.5.2 Implementation
    Summary
17 Simulated annealing: Optimization beyond local minima
    17.1 Simulated annealing
        17.1.1 Sometimes you need to climb up to get to the bottom
        17.1.2 Implementation
        17.1.3 Why simulated annealing works
        17.1.4 Short-range vs long-range transitions
        17.1.5 Variants
        17.1.6 Simulated annealing vs gradient descent: Which one should I use?
    17.2 Simulated annealing + traveling salesman
        17.2.1 Exact vs approximated solutions
        17.2.2 Visualizing cost
        17.2.3 Pruning the domain
        17.2.4 State transitions
        17.2.5 Adjacent vs random swaps
        17.2.6 Applications of TSP
    17.3 Simulated annealing and graph embedding
        17.3.1 Minimum edge crossing
        17.3.2 Force-directed drawing
    Summary
18 Genetic algorithms: Biologically inspired, fast-converging optimization
    18.1 Genetic algorithms
        18.1.1 Inspired by nature
        18.1.2 Chromosomes
        18.1.3 Population
        18.1.4 Fitness
        18.1.5 Natural selection
        18.1.6 Selecting individuals for mating
        18.1.7 Crossover
        18.1.8 Mutations
        18.1.9 The genetic algorithm template
        18.1.10 When does the genetic algorithm work best?
    18.2 TSP
        18.2.1 Fitness, chromosomes, and initialization
        18.2.2 Mutations
        18.2.3 Crossover
        18.2.4 Results and parameters tuning
        18.2.5 Beyond TSP: Optimizing the routes of the whole fleet
    18.3 Minimum vertex cover
        18.3.1 Applications of vertex cover
        18.3.2 Implementing a genetic algorithm
    18.4 Other applications of the genetic algorithm
        18.4.1 Maximum flow
        18.4.2 Protein folding
        18.4.3 Beyond genetic algorithms
        18.4.4 Algorithms, beyond this book
    Summary
appendix A. A quick guide to pseudo-code
    A.1 Variables and basics
    A.2 Arrays
    A.3 Conditional instructions
        A.3.1 Else-if
        A.3.2 Switch
    A.4 Loops
        A.4.1 For loop
        A.4.2 While loop
        A.4.3 Break and continue
    A.5 Blocks and indent
    A.6 Functions
        A.6.1 Overloading and default arguments
        A.6.2 Tuples
        A.6.3 Tuples and destructuring objects
appendix B. Big-O notation
    B.1 Algorithms and performance
    B.2 The RAM model
    B.3 Order of magnitude
    B.4 Notation
    B.5 Examples
appendix C. Core data structures
    C.1 Core data structures
    C.2 Array
    C.3 Linked List
    C.4 Tree
        C.4.1 Binary search trees
    C.5 Hash table
        C.5.1 Storing key-value pairs
        C.5.2 Hashing
        C.5.3 Conflicts resolution in hashing
        C.5.4 Performance
    C.6 Comparative analysis of core data structures
appendix D. Containers as priority queues
    D.1 Bag
    D.2 Stack
    D.3 Queue
    D.4 A comparative analysis of containers
appendix E. Recursion
    E.1 Simple recursion
        E.1.1 Pitfalls
        E.1.2 Good recursion
    E.2 Tail recursion
    E.3 Mutual recursion
appendix F. Classification problems and randomnized algorithm metrics
    F.1 Decision problems
    F.2 Las Vegas algorithms
    F.3 Monte Carlo algorithms
    F.4 Classification metrics
        F.4.1 Accuracy
        F.4.2 Precision and recall
        F.4.3 Other metrics and recap
index
inside back cover