Being Code Expert: Kafka Interview Questions

Q1. What is Kafka?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Q2. Kafka Salient Features

High Throughput: Support for millions of messages with modest hardware

Scalability: Highly scalable distributed systems with no downtime.

Replication: Messages are replicated across the cluster to provide support for multiple subscribers and balance the consumers in case of failures.

Durability: Provides support for persistence of message to disk

Stream Processing: Used with real-time streaming applications like Apache Spark & Storm

Data Loss: Kafka with proper configurations can ensure zero data loss.

Q3. List the various components in Kafka.

The four major components of Kafka are:

Topic – a stream of messages belonging to the same type

Producer – that can publish messages on a topic

Brokers – a set of servers where the published messages are stored

Consumer – that subscribes to various topics and pulls data from the brokers.

Q4. Explain the role of the offset.

Messages contained in the partitions are assigned a unique ID number that is called the offset. The role of the offset is to uniquely identify every message within the partition.

Q5. What is a Consumer Group?

Consumer Groups is a concept exclusive to Kafka. Every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics.

Q6. What is the role of the ZooKeeper?

Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group.

Q7. Is it possible to use Kafka without ZooKeeper?

No, it is not possible to bypass Zookeeper and connect directly to the Kafka server. If for some reason, ZooKeeper is down, you cannot service any client request.

Q8. Explain the concept of Leader and Follower in Kafka.

In Kafka, each partition has one server that acts as a Leader and one or more servers that operate as Followers. The Leader is in charge of all read and writes requests for the partition, while the Followers are responsible for passively replicating the leader. In the case that the Leader fails, one of the Followers will assume leadership. The server's load is balanced as a result of this.

Q9. Why is Topic Replication important in Kafka? What do you mean by ISR in Kafka?

Topic replication is critical for constructing Kafka deployments that are both durable and highly available. When one broker fails, topic replicas on other brokers remain available to ensure that data is not lost and that the Kafka deployment is not disrupted. The replication factor specifies the number of copies of a topic that are kept across the Kafka cluster. It takes place at the partition level and is defined at the subject level. A replication factor of two, for example, will keep two copies of a topic for each partition.

Q10. Why are Replications critical in Kafka?

Replication ensures that published messages are not lost and can be consumed in the event of any machine error, program error, or frequent software upgrades.

Q11. If a Replica stays out of the ISR for a long time, what does it signify?

It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.

Q12. What is the process for starting a Kafka server?

Since Kafka uses ZooKeeper, it is essential to initialize the ZooKeeper server, and then fire up the Kafka server.

To start the ZooKeeper server: > bin/zookeeper-server-start.sh config/zookeeper.properties

Next, to start the Kafka server: > bin/kafka-server-start.sh config/server.properties

Q13. How do you define a Partitioning Key?

Within the Producer, the role of a Partitioning Key is to indicate the destination partition of the message. By default, a hashing-based Partitioner is used to determine the partition ID given the key. Alternatively, users can also use customized Partitions.

Q14. In the Producer, when does QueueFullException occur?

QueueFullException typically occurs when the Producer attempts to send messages at a pace that the Broker cannot handle. Since the Producer doesn’t block, users will need to add enough brokers to collaboratively handle the increased load.

Q15. Explain the role of the Kafka Producer API.

The role of Kafka’s Producer API is to wrap the two producers – kafka.producer.SyncProducer and the kafka.producer.async.AsyncProducer. The goal is to expose all the producer functionality through a single API to the client.

Q16. Explain the four core API architectures that Kafka uses.

Following are the four core APIs that Kafka uses:

Producer API: The Producer API in Kafka allows an application to publish a stream of records to one or more Kafka topics.

Consumer API: An application can subscribe to one or more Kafka topics using the Kafka Consumer API. It also enables the application to process streams of records generated in relation to such topics.

Streams API: The Kafka Streams API allows an application to use a stream processing architecture to process data in Kafka. An application can use this API to take input streams from one or more topics, process them using stream operations, and generate output streams to transmit to one or more topics. The Streams API allows you to convert input streams into output streams in this manner.

Connect API: The Kafka Connector API connects Kafka topics to applications. This opens up possibilities for constructing and managing the operations of producers and consumers, as well as establishing reusable links between these solutions. A connector, for example, may capture all database updates and ensure that they are made available in a Kafka topic.

Q17. What is the maximum size of a message that Kafka can receive?

By default, the maximum size of a Kafka message is 1MB (megabyte). The broker settings allow you to modify the size. Kafka, on the other hand, is designed to handle 1KB messages as well.

Q18. What are some of the disadvantages of Kafka?

Following are the disadvantages of Kafka :

Kafka performance degrades if there is message tweaking. When the message does not need to be updated, Kafka works well.

Wildcard topic selection is not supported by Kafka. It is necessary to match the exact topic name.

Brokers and consumers reduce Kafka's performance when dealing with huge messages by compressing and decompressing the messages. This has an impact on Kafka's throughput and performance.

Certain message paradigms, including point-to-point queues and request/reply, are not supported by Kafka.

Kafka does not have a complete set of monitoring tools.

Q19. What is the purpose of partitions in Kafka?

Partitions allow a single topic to be partitioned across numerous servers from the perspective of the Kafka broker. This allows you to store more data in a single topic than a single server can. If you have three brokers and need to store 10TB of data in a topic, one option is to construct a topic with only one partition and store all 10TB on one broker. Another alternative is to build a three-partitioned topic and distribute 10 TB of data among all brokers. A partition is a unit of parallelism from the consumer's perspective.

Q20. What do you mean by multi-tenancy in Kafka?

Multi-tenancy is a software operation mode in which many instances of one or more programs operate in a shared environment independently of one another. The instances are considered to be physically separate yet logically connected. The level of logical isolation in a system that supports multi-tenancy must be comprehensive, but the level of physical integration can vary. Kafka is multi-tenant because it allows for the configuration of many topics for data consumption and production on the same cluster.

Q21. What are the ways to tune Kafka for optimal performance?

There are mainly three ways to tune Kafka for optimal performance:

Tuning Kafka Producers

Kafka Brokers Tuning

Tuning Kafka Consumers

Q22. Exaplain Tunning of Kafka Producers.

Data that the producers have to send to brokers is stored in a batch. When the batch is ready, the producer sends it to the broker. For latency and throughput, to tune the producers, two parameters must be taken care of: batch size and linger time. The batch size has to be selected very carefully. If the producer is sending messages all the time, a larger batch size is preferable to maximize throughput. However, if the batch size is chosen to be very large, then it may never get full or take a long time to fill up and, in turn, affect the latency. Batch size will have to be determined, taking into account the nature of the volume of messages sent from the producer. The linger time is added to create a delay to wait for more records to get filled up in the batch so that larger records are sent. A longer linger time will allow more messages to be sent in one batch, but this could compromise latency. On the other hand, a shorter linger time will result in fewer messages getting sent faster - reduced latency but reduced throughput as well.

Q23. Exaplain Tuning of Kafka broker.

Each partition in a topic is associated with a leader, which will further have 0 or more followers. It is important that the leaders are balanced properly and ensure that some nodes are not overworked compared to others.

Q24. Exaplain Tuning of Kafka Consumers.

It is recommended that the number of partitions for a topic is equal to the number of consumers so that the consumers can keep up with the producers. In the same consumer group, the partitions are split up among the consumers.

Q25. What are the use cases of Kafka monitoring?

Following are the use cases of Apache Kafka monitoring:

Apache Kafka monitoring can keep track of system resource consumption such as memory, CPU, and disk utilization over time.

Apache Kafka monitoring is used to monitor threads and JVM usage. It relies on the Java garbage collector to free up memory, ensuring that it frequently runs, thereby guaranteeing that the Kafka cluster is more active.

It can be used to determine which applications are causing excessive demand, and identifying performance bottlenecks might help rapidly solve performance issues.

It always checks the broker, controller, and replication statistics to modify the partitions and replicas status if required.

Q26. When does the broker leave the ISR?

ISR is a set of message replicas that are completely synced up with the leaders. It means ISR contains all the committed messages, and ISR always includes all the replicas until it gets a real failure. An ISR can drop a replica if it deviates from the leader.

Q27. Define consumer lag in Apache Kafka.

Consumer lag refers to the lag between the Kafka producers and consumers. Consumer groups will have a lag if the data production rate far exceeds the rate at which the data is getting consumed. Consumer lag is the difference between the latest offset and the consumer offset.

Q28. What is meant by cluster id in Kafka?

Kafka clusters are assigned unique and immutable identifiers. The identifier for a particular cluster is known as the cluster id. A cluster id can have a maximum number of 22 characters and has to follow the regular expression [a-zA-Z0-9_\-]+. It is generated when a broker with version 0.10.1 or later is successfully started for the first time. The broker attempts to get the cluster id from the znode during startup. If the znode does not exist, the broker generates a new cluster id and creates a znode with this cluster id.

Q29. Mention some of the system tools available in Apache Kafka.

The three main system tools available in Apache Kafka are:

Kafka Migration Tool − This tool is used to migrate the data in a Kafka cluster from one version to another.

Kafka MirrorMaker − This tool copies data from one Kafka cluster to another.

Consumer Offset Checker − This tool displays the Consumer Group, Topic, Partitions, Off-set, logSize, and Owner for specific Topics and Consumer Group sets.

Q30. Mention some benefits of Apache Kafka.

High throughput: Kafka can handle thousands of messages per second. Due to low latency, Kafka supports incoming messages at a high volume and velocity.

Low latency: Apache Kafka offers as low as ten milliseconds. This is because it decoupled messages on the broker, allowing the consumer to pull them at any time.

Fault-tolerant: The use of replicas allows Apache Kafka to be fault-tolerant in cases of a failure within a cluster.

Durability: With the replication feature, Apache Kafka allows data to remain more persistent on the cluster rather than the disk, thus making it more durable.

Scalability: the ability of Kafka to handle messages of high volume and high velocity makes it very scalable.

Ability to handle real-time data: Kafka can handle real-time data pipelines.

Q31. What is the optimal number of partitions for a topic?

The optimal number of partitions a topic should be divided into must be equal to the number of consumers.

Q32. What is the Kafka MirrorMaker?

The Kafka MirrorMaker is a stand-alone tool that allows data to be copied from one Apache Kafka cluster to another. The Kafka MirrorMaker will read data from topics in the original cluster and write the topics to a destination cluster with the same topic name. The source and destination clusters are independent entities and can have different numbers of partitions and varying offset values.

Q33. What is the role of the Kafka Migration Tool?

The Kafka Migration tool is used to efficiently move from one environment to another. It can be used to move existing Kafka data from an older version of Kafka to a newer version.

Q34. What are the name restrictions for Kafka topics?

According to Apache Kafka, there are some legal rules to be followed to name topics, which are as follows:

The maximum length is 255 characters (symbols and letters). The length has been reduced from 255 to 249 in Kafka 0.10

. (dot), _ (underscore), - (hyphen) can be used. However, topics with a dot (.) and underscore ( _) could cause some confusion with internal data structures, and hence, it is advisable to use either but not both.

Q35. Where is the meta information about topics stored in the Kafka cluster?

Currently, in Apache Kafka, meta-information about topics is stored in the ZooKeeper. Information regarding the location of the partitions and the configuration details related to a topic are stored in the ZooKeeper in a separate Kafka cluster.

Q36. What is meant by Kafka Connect?

Kafka Connect is a tool provided by Apache Kafka to allow scalable and reliable streaming data to move between Kafka and other systems. It makes it easier to define connectors that are responsible for moving large collections of data in and out of Kafka. Kafka Connect is able to process entire databases as input. It can also collect metrics from application servers into Kafka topics so that this data can be available for Kafka stream processing.

Q37. Explain message compression in Apache Kafka.

In Apache Kafka, producer applications write data to the brokers in JSON format. The data in the JSON format is stored in string form, which can result in several duplicated records getting stored in the Kafka topic. This leads to an increased occupation of disk space. Hence, to reduce this disk space, compression of messages or lingering data is performed before sending the messages to Kafka. Message compression is done on the producer side, and hence there is no need to make any changes to the configuration of the consumer or the broker.

Q38. What is the need for message compression in Apache Kafka?

Message compression in Kafka does not require any changes in the configuration of the broker or the consumer. It is beneficial for the following reasons:

Due to reduced size, it reduces the latency in which messages are sent to Kafka.

Reduced bandwidth allows the producers to send more net messages to the broker.

When the data is stored in Kafka via cloud platforms, it can reduce the cost in cases where the cloud services are paid.

Message compression leads to reduced disk load, which will lead to faster read and write requests.

Q39. What are some disadvantages of message compression in Apache Kafka?

Producers end up using some CPU cycles for compression.

Consumers use some CPU cycles for decompression.

Compression and decompression result in greater CPU demand.

Q40. Can a consumer read more than one partition from a topic?

Yes, if the number of partitions is greater than the number of consumers in a consumer group, then a consumer will have to read more than one partition from a topic.

Q41. What do you know about log compaction in Kafka?

Log compaction is a method by which Kafka ensures that at least the last known value for each message key within the log of data is retained for a single topic partition. This makes it possible to restore the state after an application crashes or in the event of a system failure. It allows cache reloading once an application restarts during any operational maintenance. Log compaction ensures that any consumer processing the log from the start can view the final state of all records in the original order they were written.

Q42. Name the various types of Kafka producer API.

There are three types of Kafka producer API available-

Fire and Forget

Synchronous producer

Asynchronous producer

Q43. What is Kafka’s producer acknowledgment? What are the various types of acknowledgment settings that Kafka provides?

A broker sends an ack or acknowledgment to the producer to verify the reception of the message. Ack level is a configuration parameter in the Producer that specifies how many acknowledgments the producer must receive from the leader before a request is considered successful. The following types of acknowledgment are available:

acks=0

In this setting, the producer does not wait for the broker's acknowledgment. There is no way to know if the broker has received the record.

acks=1

In this situation, the leader logs the record to its local log file and answers without waiting for all of its followers to acknowledge it. The message can only be lost in this instance if the leader fails shortly after accepting the record but before the followers have copied it; otherwise, the record would be lost.

acks=all

A set leader in this situation waits for all in-sync replica sets to acknowledge the record. As long as one replica is alive, the record will not be lost, and the best possible guarantee will be provided. However, because a leader must wait for all followers to acknowledge before replying, the throughput is significantly lower.

Q44. How do you get Kafka to perform in a FIFO manner?

Kafka organizes messages into topics, which are then divided into partitions. The partition is an immutable list of ordered messages that is updated regularly. A message in the partition is uniquely recognized by a sequential number called offset. FIFO behavior is possible only within the partitions. Following the methods below will help you achieve FIFO behavior:

To begin, we first set the enable the auto-commit property to be false:

Set enable.auto.commit=false

We should not call the consumer.commitSync(); method after the messages have been processed.

Then we may "subscribe" to the topic and ensure that the consumer system's register is updated.

You should use Listener consumerRebalance, and call a consumer inside a listener.

seek(topicPartition, offset).

The offset related to the message should be kept together with the processed message once it has been processed.

Q45. How is load balancing maintained in Kafka?

Load balancing in Kafka is handled by the producers. The message load is spread out between the various partitions while maintaining the order of the message. By default, the producer selects the next partition to take up message data using a round-robin approach. If a different approach other than round-robin is to be used, users can also specify exact partitions for a message.

Exception Handling Interview Questions

DBMS Interview Questions Set -1

DBMS Interview Questions Set -2

SQL Interview Question Set -1

SQL Interview Question Set -2

JPA Interview Questions Set -1

JPA Interview Question Set -2

Hibernate Interview Questions

Spring Boot Interview Questions Set 1

Spring Boot Interview Questions Set 2

GIT Interview Questions

Redis Interview Questions