Apache Kafka 101 With Java (Spring Boot) For Beginners

Messaging System, RabbitMQ vs Apache Kafka, Topics, Producer-Broker-Consumer, Partitions, Transactions, Error Handler

Ratings 4.17 / 5.00

Apache Kafka 101 With Java (Spring Boot) For Beginners

What You Will Learn!

Consumers vs Producers
Apache Kafka Messaging System with Spring Boot
Error Handler in Apache Kafka
Transactions in Apache Kafka

Description

What is Kafka

Kafka is a Publish/Subscribe messaging system. It allows producers to write records into Kafka that can be read by one or more consumers. These records that cannot be deleted or modified once they are sent to Kafka.

Producers

The first component of Kafka is Producers. Producers create messages and send them to topics. The data or we can say that records are published to the topics by the producers. They can also assign the partition when it is publishing data to the topic. The producer can send data in a round robin way or it can implement a priority system based on sending records to certain partitions based on the priority of the record.

Kafka Cluster

Kafka uses a distributed clustering system. If you have any background in distributed systems, you already know that a Cluster is a group of computers acting together for a common purpose. Since Kafka is a distributed system, the cluster has the same meaning for Kafka. It is merely a group of computers, each executing one instance of Kafka broker.

Kafka Broker

KAfka clusters are occurred from brokers. The broker is Kafka server. It is just a meaningful name given to Kafka server. And this title makes sense as well because all that Kafka does; is act as a message broker between producer and consumer. The producer and consumer do not interact directly. They use Kafka server as an agent or a broker to exchange messages.

Topics

Additionally, each broker has a topic or topics. Records are published into topics. Kafka is run as a cluster in one or more servers and the cluster stores or retrieves the records in a feed called Topics. Each record in the topic is stored with a key, value, and timestamp.

The topics can have one or multiple consumers or subscribers.

Partitions

The Kafka cluster uses a partitioned log for each topic. A record will be published in a single partition of a topic. Producers can choose the partition in which a record will be sent to, otherwise, the partition is chosen by Kafka.

The partition maintains the order in which data was inserted and once the record is published to the topic, it remains there depending on the retention period (which is configurable). The records are always appended at the end of the partitions. It maintains a flag called 'offsets,' which uniquely identifies each record within the partition.

The offset is controlled by the consuming applications. Using offset, consumers might backtrace to older offsets and reprocess the records if needed.

Consumers

In Kafka, like producing a message, we can also consume it. To consume it, consumers are used in Kafka. Consumers consume the records from the topic. They are based on the concept of a consumer-group, where some of the consumers are assigned in the group. The record which is published to the topic is only delivered to one instance of the consumer from one consumer-group. Kafka internally uses a mechanism of consuming records inside the consumer-group. Each instance of the consumer will get hold of the particular partition log, such that within a consumer-group, the records can be processed parallel by each consumer.

Consumer-Group?

Consumers can be organized into logic consumer groups. Consumer-Group is a group of consumers. Several Consumers form a group to share the work. Different consumer groups can share the same record or same consumer-group can share the work.

ZooKeeper

And finally, let’s talk about Zookeeper. Zookeeper is used by Kafka to store metadata. Kafka is a distributed system and is built to use Zookeeper.

It is basically used to maintain coordination between different nodes in a cluster. One of the most important things for Kafka is it uses zookeeper to periodically commit offsets so that in case of node failure it can resume from the previously committed offset.

Zookeeper also plays a vital role for serving many other purposes, such as leader detection, configuration management, synchronization, detecting when a new node joins or leaves the cluster, etc.

Future Kafka releases are planning to remove the zookeeper dependency but as of now it is an integral part of it.

Topic partitions are assigned to balance the assignments among all consumers in the group. Within a consumer group, all consumers work in a load-balanced mode; in other words, each message will be seen by one consumer in the group. If a consumer goes away, the partition is assigned to another consumer in the group. This is referred to as a rebalance. If there are more consumers in a group than partitions, some consumers will be idle. If there are fewer consumers in a group than partitions, some consumers will consume messages from more than one partition.

About consumer groups, In order to avoid two consumer instances within the same consumer groups reading the same record twice, each partition is tied to only one consumer process per consumer group.

Now, let’s talk about the advantages and disadvantages of Kafka. And why and when we should use Kafka.

Kafka is designed for holding and distributing large volumes of messages.

First and most important feature of Kafka is scalability. Kafka scales nicely up to a hundred thousands (100,000) messages per seconds even on a single server. The opponents of Kafka are far behind about it.
Kafka has an essential feature to provide resistance to node/machine failure within the cluster.
Kafka is Highly durable messaging system. Kafka offers the replication feature, which makes data or messages to persist more on the cluster over a disk. This makes it durable.
Kafka is a durable message store and clients can get a “replay” of the event stream on demand, as opposed to more traditional message brokers where once a message has been delivered, it is removed from the queue.

Kafka has a retention period, so it will store your records for the time or size you specify and can be set by topic.

Cons:

Now, Let’s talk about the drawbacks of Kafka.

First of all, we should say that to use it is not easy according to opponents. Because Kafka has some complexity about configurations.
Secondly, Apache Kafka does not contain a complete set of monitoring as well as managing tools. Thus, new startups or enterprises fear to work with Kafka.
Thirdly, it depends on Apache Zookeeper. But in the development roadmap, the kafka team aims to escape from Zookeeper.
Kafka messaging system works like that queue but it does not support priority queue like RabbitMQ.
In general, there are no issues with the individual message size. However, the brokers and consumers start compressing these messages as the size increases. Due to this, when decompressed, the node memory gets slowly used. Also, compress happens when the data flow in the pipeline. It affects throughput and also performance.
Kafka has a very simple routing approach. Publish messages to the topic then subscribe to it and receive messages. On the other hand, other messaging systems like RabbitMQ have better options if you need to route your messages in complex ways to your consumers such as fanout routing or conditional routing.