Kafka: Introduction to core concepts

Apache Kafka was developed by LinkedIn and donated to Apache.

Apache Kafka is a distributed streaming platform that can handle high volume of data.

Pull or Push?

I initially misunderstood Kafka as a push based messaging system. However Kafka has chosen traditional pull approach.

In Kafka, data is pushed to the broker by producers and pulled from the broker by the consumers.

IMAGE1

Why Kafka?

Kafka is a reliable messaging system which is fast and durable. We can list it's benifits as;
Scalable - Kafka's partion model allows data to distributed across multipel servers, making it highly scalable.
Durable - Kafka's data is written to disk making it highly durable agaisnt server failures.
Multiple producers - Kafka can handle multpile producers which publish to the same topic.
Multiple consumers - Kafka is designed so that multipel consumers can read messages without interfering with each other.
High performance - All these features allows high performace distributed messaging system.

What are use cases?

Website activity tracking - this is the original use case for Kafka. It helps to track user activities in real time.
Messaging - Kafka can replace traditional messaging systems.
Metrics - Kafka is used for centralized data monitoring.
Log aggregation - Ideal for collecting application logs and analize.
Stream processing - Operates on data in real time, as quickly as messages are produces.

Core concepts

Kafka runs as a cluster.
Can span to multiple data centers.
Store data in topics.
Topics has partitions.
Each record consists of a key, a value and a timestamp.

IMAGE2

Messages

The unit of data within Kafka is called a message. (similar to a database row)
A message is simply an array of bytes.
A message can have an optional key.
Messages are written into Kafka in batches (collection of messages) for efficiency.

Brokers and clusters

A single Kafka server is called a broker.
The broker receives messages form producers and also serves consumers.
Each broker is identified with its ID (integer).
After connecting to any broker (called a bootstrap broker), you will be connected to the entire broker.
A good number to get started is 3 brokers.
A Kafka cluster is composed of multiple brokers called a cluster.
Within a cluster of brokers, one broker will also function as the cluster controller.
A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition.

Topics

Messages are categorized into topics. (similar to database table)
Topic is identified by its name.
Topics are additionally brokwn down into a number of partitions.
Most often, a stream is considered to be a single topic of data.
Data is kept only for a limited time (default is one week).
Data is immutable, once written it can't be changed.
Topics should have a replication factor (usually between 2 to 3). This way if a broker is down, another broker can serve the data.

Partition and offset

Topics are split in partitions.
Each partition is ordered. However, between 2 partitions we cannot guarantee any order.
Data is assigned randomly to a partition unless a key is provided.
Each message within a partiton get an incremental id which is called as Offset.
Offset only have a meaning for a specific partition.

IMAGE3

Leader for a partition

At any time only one broker can be a leader for a given partition.
Only that leader can receive and serve data for a partiton.
The other brokers will synchronize the data.

Producers

Producres create new messages.
producers automatically know to which broker and parttion to write to.
In case of broker failures, produces will automatically recover.
Producers can choose to receive ack of data writes.

Consumers

Consumers read messages.
Consumer subscribes to one or more topics and reads the messages.
Consumers know which broker to read from.
In case of broker failures, consumers know how to recover.
Data is read in order within each partitions.

Consumer groups

Consumers read data in consumer groups.
Each consumers within a group reads from exclusive partitions.
If you have more consumers than partitions, some consumers will be inactive.

Consumer offsets

Kafka stores the offsets at which a consumer group has been reading. (similar to bookmarking)
When a consumer in a group has processed data received from Kafka, it should be commiting the offsets.
If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.

Zookeeper

Apache Kafka uses Zookeeper to store metadata about the Kafka cluster, as well as consumer client details.
A Zookeeper cluster is called an ensemble.

Manju