Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale, 2nd Edition
Every enterprise application creates data, whether it consists of log messages, metrics, user activity, or outgoing messages. Moving all this data is just as important as the data itself. With this updated edition, application architects, developers, and production engineers new to the Kafka streaming platform will learn how to handle data in motion. Additional chapters cover Kafka’s AdminClient API, transactions, new security features, and tooling changes.
Engineers from Confluent and LinkedIn responsible for developing Kafka explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream processing applications with this platform. Through detailed examples, you’ll learn Kafka’s design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the controller, and the storage layer.
- Best practices for deploying and configuring Kafka
- Kafka producers and consumers for writing and reading messages
- Patterns and use-case requirements to ensure reliable data delivery
- Best practices for building data pipelines and applications with Kafka
- How to perform monitoring, tuning, and maintenance tasks with Kafka in production
- The most critical metrics among Kafka’s operational measurements
- Kafka’s delivery capabilities for stream processing systems
Foreword to the Second Edition Foreword to the First Edition Preface Who Should Read This Book Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments 1. Meet Kafka Publish/Subscribe Messaging How It Starts Individual Queue Systems Enter Kafka Messages and Batches Schemas Topics and Partitions Producers and Consumers Brokers and Clusters Multiple Clusters Why Kafka? Multiple Producers Multiple Consumers Disk-Based Retention Scalable High Performance Platform Features The Data Ecosystem Use Cases Activity tracking Messaging Metrics and logging Commit log Stream processing Kafka’s Origin LinkedIn’s Problem The Birth of Kafka Open Source Commercial Engagement The Name Getting Started with Kafka 2. Installing Kafka Environment Setup Choosing an Operating System Installing Java Installing ZooKeeper Standalone server ZooKeeper ensemble Installing a Kafka Broker Configuring the Broker General Broker Parameters broker.id listeners zookeeper.connect log.dirs num.recovery.threads.per.data.dir auto.create.topics.enable auto.leader.rebalance.enable delete.topic.enable Topic Defaults num.partitions default.replication.factor log.retention.ms log.retention.bytes log.segment.bytes log.roll.ms min.insync.replicas message.max.bytes Selecting Hardware Disk Throughput Disk Capacity Memory Networking CPU Kafka in the Cloud Microsoft Azure Amazon Web Services Configuring Kafka Clusters How Many Brokers? Broker Configuration OS Tuning Virtual memory Disk Networking Production Concerns Garbage Collector Options Datacenter Layout Colocating Applications on ZooKeeper Summary 3. Kafka Producers: Writing Messages to Kafka Producer Overview Constructing a Kafka Producer Sending a Message to Kafka Sending a Message Synchronously Sending a Message Asynchronously Configuring Producers client.id acks Message Delivery Time max.block.ms delivery.timeout.ms request.timeout.ms retries and retry.backoff.ms linger.ms buffer.memory compression.type batch.size max.in.flight.requests.per.connection max.request.size receive.buffer.bytes and send.buffer.bytes enable.idempotence Serializers Custom Serializers Serializing Using Apache Avro Using Avro Records with Kafka Partitions Implementing a custom partitioning strategy Headers Interceptors Quotas and Throttling Summary 4. Kafka Consumers: Reading Data from Kafka Kafka Consumer Concepts Consumers and Consumer Groups Consumer Groups and Partition Rebalance Static Group Membership Creating a Kafka Consumer Subscribing to Topics The Poll Loop Thread Safety Configuring Consumers fetch.min.bytes fetch.max.wait.ms fetch.max.bytes max.poll.records max.partition.fetch.bytes session.timeout.ms and heartbeat.interval.ms max.poll.interval.ms default.api.timeout.ms request.timeout.ms auto.offset.reset enable.auto.commit partition.assignment.strategy client.id client.rack group.instance.id receive.buffer.bytes and send.buffer.bytes offsets.retention.minutes Commits and Offsets Automatic Commit Commit Current Offset Asynchronous Commit Combining Synchronous and Asynchronous Commits Committing a Specified Offset Rebalance Listeners Consuming Records with Specific Offsets But How Do We Exit? Deserializers Custom Deserializers Using Avro Deserialization with Kafka Consumer Standalone Consumer: Why and How to Use a Consumer Without a Group Summary 5. Managing Apache Kafka Programmatically AdminClient Overview Asynchronous and Eventually Consistent API Options Flat Hierarchy Additional Notes AdminClient Lifecycle: Creating, Configuring, and Closing client.dns.lookup Use of a DNS alias DNS name with multiple IP addresses request.timeout.ms Essential Topic Management Configuration Management Consumer Group Management Exploring Consumer Groups Modifying Consumer Groups Cluster Metadata Advanced Admin Operations Adding Partitions to a Topic Deleting Records from a Topic Leader Election Reassigning Replicas Testing Summary 6. Kafka Internals Cluster Membership The Controller KRaft: Kafka’s New Raft-Based Controller Replication Request Processing Produce Requests Fetch Requests Other Requests Physical Storage Tiered Storage Partition Allocation File Management File Format Indexes Compaction How Compaction Works Deleted Events When Are Topics Compacted? Summary 7. Reliable Data Delivery Reliability Guarantees Replication Broker Configuration Replication Factor Unclean Leader Election Minimum In-Sync Replicas Keeping Replicas In Sync Persisting to Disk Using Producers in a Reliable System Send Acknowledgments Configuring Producer Retries Additional Error Handling Using Consumers in a Reliable System Important Consumer Configuration Properties for Reliable Processing Explicitly Committing Offsets in Consumers Always commit offsets after messages were processed Commit frequency is a trade-off between performance and number of duplicates in the event of a crash Commit the right offsets at the right time Rebalances Consumers may need to retry Consumers may need to maintain state Validating System Reliability Validating Configuration Validating Applications Monitoring Reliability in Production Summary 8. Exactly-Once Semantics Idempotent Producer How Does the Idempotent Producer Work? Producer restart Broker failure Limitations of the Idempotent Producer How Do I Use the Kafka Idempotent Producer? Transactions Transactions Use Cases What Problems Do Transactions Solve? Reprocessing caused by application crashes Reprocessing caused by zombie applications How Do Transactions Guarantee Exactly-Once? What Problems Aren’t Solved by Transactions? Side effects while stream processing Reading from a Kafka topic and writing to a database Reading data from a database, writing to Kafka, and from there writing to another database Copying data from one Kafka cluster to another Publish/subscribe pattern How Do I Use Transactions? Transactional IDs and Fencing How Transactions Work Performance of Transactions Summary 9. Building Data Pipelines Considerations When Building Data Pipelines Timeliness Reliability High and Varying Throughput Data Formats Transformations Security Failure Handling Coupling and Agility When to Use Kafka Connect Versus Producer and Consumer Kafka Connect Running Kafka Connect Connector Example: File Source and File Sink Connector Example: MySQL to Elasticsearch Single Message Transformations A Deeper Look at Kafka Connect Connectors and tasks Workers Converters and Connect’s data model Offset management Alternatives to Kafka Connect Ingest Frameworks for Other Datastores GUI-Based ETL Tools Stream Processing Frameworks Summary 10. Cross-Cluster Data Mirroring Use Cases of Cross-Cluster Mirroring Multicluster Architectures Some Realities of Cross-Datacenter Communication Hub-and-Spoke Architecture Active-Active Architecture Active-Standby Architecture Disaster recovery planning Data loss and inconsistencies in unplanned failover Start offset for applications after failover After the failover A few words on cluster discovery Stretch Clusters Apache Kafka’s MirrorMaker Configuring MirrorMaker Multicluster Replication Topology Securing MirrorMaker Deploying MirrorMaker in Production Tuning MirrorMaker Other Cross-Cluster Mirroring Solutions Uber uReplicator LinkedIn Brooklin Confluent Cross-Datacenter Mirroring Solutions Summary 11. Securing Kafka Locking Down Kafka Security Protocols Authentication SSL Configuring TLS Security considerations SASL SASL/GSSAPI Configuring SASL/GSSAPI Security considerations SASL/PLAIN Configuring SASL/PLAIN Security considerations SASL/SCRAM Configuring SASL/SCRAM Security considerations SASL/OAUTHBEARER Configuring SASL/OAUTHBEARER Security considerations Delegation tokens Configuring delegation tokens Security considerations Reauthentication Security Updates Without Downtime Encryption End-to-End Encryption Authorization AclAuthorizer Customizing Authorization Security Considerations Auditing Securing ZooKeeper SASL SSL Authorization Securing the Platform Password Protection Summary 12. Administering Kafka Topic Operations Creating a New Topic Listing All Topics in a Cluster Describing Topic Details Adding Partitions Reducing Partitions Deleting a Topic Consumer Groups List and Describe Groups Delete Group Offset Management Export offsets Import offsets Dynamic Configuration Changes Overriding Topic Configuration Defaults Overriding Client and User Configuration Defaults Overriding Broker Configuration Defaults Describing Configuration Overrides Removing Configuration Overrides Producing and Consuming Console Producer Using producer configuration options Line-reader options Console Consumer Using consumer configuration options Message formatter options Consuming the offsets topics Partition Management Preferred Replica Election Changing a Partition’s Replicas Changing the replication factor Canceling replica reassignments Dumping Log Segments Replica Verification Other Tools Unsafe Operations Moving the Cluster Controller Removing Topics to Be Deleted Deleting Topics Manually Summary 13. Monitoring Kafka Metric Basics Where Are the Metrics? Nonapplication metrics What Metrics Do I Need? Alerting or debugging? Automation or humans? Application Health Checks Service-Level Objectives Service-Level Definitions What Metrics Make Good SLIs? Using SLOs in Alerting Kafka Broker Metrics Diagnosing Cluster Problems The Art of Under-Replicated Partitions Cluster-level problems Host-level problems Broker Metrics Active controller count Controller queue size Request handler idle ratio All topics bytes in All topics bytes out All topics messages in Partition count Leader count Offline partitions Request metrics Topic and Partition Metrics Per-topic metrics Per-partition metrics JVM Monitoring Garbage collection Java OS monitoring OS Monitoring Logging Client Monitoring Producer Metrics Overall producer metrics Per-broker and per-topic metrics Consumer Metrics Fetch manager metrics Per-broker and per-topic metrics Consumer coordinator metrics Quotas Lag Monitoring End-to-End Monitoring Summary 14. Stream Processing What Is Stream Processing? Stream Processing Concepts Topology Time State Stream-Table Duality Time Windows Processing Guarantees Stream Processing Design Patterns Single-Event Processing Processing with Local State Multiphase Processing/Repartitioning Processing with External Lookup: Stream-Table Join Table-Table Join Streaming Join Out-of-Sequence Events Reprocessing Interactive Queries Kafka Streams by Example Word Count Stock Market Statistics ClickStream Enrichment Kafka Streams: Architecture Overview Building a Topology Optimizing a Topology Testing a Topology Scaling a Topology Surviving Failures Stream Processing Use Cases How to Choose a Stream Processing Framework Summary A. Installing Kafka on Other Operating Systems Installing on Windows Using Windows Subsystem for Linux Using Native Java Installing on macOS Using Homebrew Installing Manually B. Additional Kafka Tools Comprehensive Platforms Cluster Deployment and Management Monitoring and Data Exploration Client Libraries Stream Processing Index
How to download source code?
1. Go to:
2. Search the book title:
Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale, 2nd Edition, sometime you may not get the results, please search the main title
3. Click the book title in the search results
Publisher resources section, click
Download Example Code.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.