Published on 2023 October 13 by abcxyz
Apache Kafka is a distributed streaming platform that is designed to build real-time data pipelines and streaming applications. It is horizontally scalable, fault-tolerant, and extremely fast, which makes it a popular choice for use cases ranging from collecting user activity data, logs, application metrics, stock ticker data, and more. Kafka was originally developed at LinkedIn and later donated to the Apache Software Foundation. Here are some key features and concepts associated with Kafka: 1. **Publish/Subscribe Model**: Kafka operates on a publish/subscribe model where producers publish messages and consumers subscribe to topics to retrieve these messages. 2. **Topics**: Messages in Kafka are categorized into topics. Producers write data to topics, and consumers read data from topics. 3. **Partitions**: Each topic can be split into partitions to allow for parallelism and scalability. Each partition is an ordered, immutable sequence of messages that is continually appended to. 4. **Brokers**: Kafka runs as a cluster of one or more servers called brokers. The Kafka brokers are responsible for storing data and serving client requests. 5. **ZooKeeper**: In older versions, Kafka used Apache ZooKeeper for distributed coordination, managing broker metadata, and leader election among other things. However, with Kafka 2.8.0 and later, there's an effort to remove this dependency with the introduction of the KRaft mode (Kafka Raft metadata mode). 6. **Replicas**: To ensure data durability, each partition can be replicated across multiple brokers. One of the replicas will be the leader, while the others will be followers. All writes and reads to a partition are served by the leader replica. 7. **Producers**: Producers are applications that send (or produce) records to topics in Kafka. 8. **Consumers**: Consumers are applications that read (or consume) records from topics in Kafka. 9. **Consumer Groups**: Each consumer belongs to a consumer group. This allows for parallel processing of records. If multiple consumers belong to the same group, records from topics will be load-balanced among them. 10. **Stream Processing**: Kafka also provides stream processing capabilities, allowing for real-time data transformation and analytics. 11. **Log Compaction**: This is a feature that allows Kafka to retain the latest update for each record key, ensuring that consumers always have access to the latest state of data without maintaining the entire log of changes. Kafka is used by numerous large-scale systems for a variety of use-cases like real-time analytics, monitoring, data lakes, aggregations, and more. With its distributed nature, it's designed to handle high throughput and provide durability and reliability.