Apache Kafka was developed by LinkedIn and donated to Apache.
Apache Kafka is a distributed streaming platform that can handle high volume of data.
Pull or Push?
I initially misunderstood Kafka as a push based messaging system. However Kafka has chosen traditional pull approach.
In Kafka, data is pushed to the broker by producers and pulled from the broker by the consumers.
- Kafka is a reliable messaging system which is fast and durable. We can list it's benifits as;
- Scalable - Kafka's partion model allows data to distributed across multipel servers, making it highly scalable.
- Durable - Kafka's data is written to disk making it highly durable agaisnt server failures.
- Multiple producers - Kafka can handle multpile producers which publish to the same topic.
- Multiple consumers - Kafka is designed so that multipel consumers can read messages without interfering with each other.
- High performance - All these features allows high performace distributed messaging system.
What are use cases?
- Website activity tracking - this is the original use case for Kafka. It helps to track user activities in real time.
- Messaging - Kafka can replace traditional messaging systems.
- Metrics - Kafka is used for centralized data monitoring.
- Log aggregation - Ideal for collecting application logs and analize.
- Stream processing - Operates on data in real time, as quickly as messages are produces.
- Kafka runs as a cluster.
- Can span to multiple data centers.
- Store data in topics.
- Topics has partitions.
- Each record consists of a key, a value and a timestamp.
- The unit of data within Kafka is called a message. (similar to a database row)
- A message is simply an array of bytes.
- A message can have an optional key.
- Messages are written into Kafka in batches (collection of messages) for efficiency.
Brokers and clusters
- A single Kafka server is called a broker.
- The broker receives messages form producers and also serves consumers.
- Each broker is identified with its ID (integer).
- After connecting to any broker (called a bootstrap broker), you will be connected to the entire broker.
- A good number to get started is 3 brokers.
- A Kafka cluster is composed of multiple brokers called a cluster.
- Within a cluster of brokers, one broker will also function as the cluster controller.
- A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition.
- Messages are categorized into topics. (similar to database table)
- Topic is identified by its name.
- Topics are additionally brokwn down into a number of partitions.
- Most often, a stream is considered to be a single topic of data.
- Data is kept only for a limited time (default is one week).
- Data is immutable, once written it can't be changed.
- Topics should have a replication factor (usually between 2 to 3). This way if a broker is down, another broker can serve the data.
Partition and offset
- Topics are split in partitions.
- Each partition is ordered. However, between 2 partitions we cannot guarantee any order.
- Data is assigned randomly to a partition unless a key is provided.
- Each message within a partiton get an incremental id which is called as Offset.
- Offset only have a meaning for a specific partition.
Leader for a partition
- At any time only one broker can be a leader for a given partition.
- Only that leader can receive and serve data for a partiton.
- The other brokers will synchronize the data.
- Producres create new messages.
- producers automatically know to which broker and parttion to write to.
- In case of broker failures, produces will automatically recover.
- Producers can choose to receive ack of data writes.
- Consumers read messages.
- Consumer subscribes to one or more topics and reads the messages.
- Consumers know which broker to read from.
- In case of broker failures, consumers know how to recover.
- Data is read in order within each partitions.
- Consumers read data in consumer groups.
- Each consumers within a group reads from exclusive partitions.
- If you have more consumers than partitions, some consumers will be inactive.
- Kafka stores the offsets at which a consumer group has been reading. (similar to bookmarking)
- When a consumer in a group has processed data received from Kafka, it should be commiting the offsets.
- If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
- Apache Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details.
- A Zookeeper cluster is called an ensemble.