Open-source event streaming platform, Apache Kafka, is gaining popularity across developers and companies like wildfire nowadays. As a result, there is a considerable demand for Kafka professionals in the IT field. Companies like Netflix, Airbnb, Goldman Sachs, Uber are already using Apache Kafka.
This article will give you insights into what Apache Kafka is, its characteristics and how it works.
What is Apache Kafka?
Developed by LinkedIn in 2011, Apache Kafka is an open-source event streaming platform. It pulls real-time data from sources such as the cloud or databases as a stream of events and then stores these events for retrieval when needed. You can also manipulate the process of the event streams as per your need. Kafka is written in Scala and Java. Stream of events is nothing but endless data.
The three major capabilities of Kafka are:
- It provides you with the streams of events on which you can read or write.
- You can store these streams for as long as you want.
- You can manipulate or process the streams too as per your requirement and route the stream of events to the destined applications.
Apache Kafka can be deployed on any cloud or on-premises, virtual machines, and even on bare-metal hardware.
You should use Kafka to create the real-time data streaming pipelines and the applications. The application is the consumer of the stream of events, and the pipeline is used to move the stream from one system to another system or your application.
Also read: How does Netflix work? The science behind the play button
Characteristics of Apache Kafka
Let’s look at the characteristics of Apache Kafka.
- Highly Scalable: The stream of events stored inside the Kafka is distributed among various servers. Thus, making apache kafka highly scalable and distributed.
- High Availability: Apache Kafka works very fast and provides no data loss and no downtime. Thus, it has high availability.
- High Volume: Kafka can store and process huge amount of data very easily.
- Fault-Tolerant: If any kafka server fails, the other servers present in the Kafka cluster takes over them ensuring no data loss and continuous work. Thus, Kafka is fault-tolerant.
- Reliable: Since Apache Kafka is fault-tolerant, it is very reliable.
- High Throughput: Despite of huge amount of data transfer and storage taking place, Apache Kafka gives a stable performance and thus has high throughput.
- Free to use: Apacha Kafka is an open-source platform and thus, is free of cost.
How does Apache Kafka works?
Kafka uses the TCP network protocol to communicate with the servers and the clients. The communication takes place with the help of messages grouped together by the TCP network protocol, thus, reducing the round-trip time.
Kafka Servers
Apache Kafka has some servers which act as the data centres and provide the storage for the stream of events. These servers are known as Brokers.
Kafka Connect is a tool used to import and export the stream of events continuously between your application and Kafka. Some servers of Kafka make sure that the Kafka connect is running.
Kafka Clients
You can develop applications or microservices that can read, write, or process the stream of events using Kafka clients. The Kafka clients are available for various programming languages such as Java, Scala, Python, Go, etc.
The clients that write events into Kafka are known as Producers. The clients who read and process the events are known as Consumers.
Messaging
The producers write and send the message to the broker, and then on a FIFO basis, the broker delivers the message to the consumers. This whole process is known as messaging.
Topic
The messages inside the broker are stored on a topic on a First In First Out (FIFO) basis. Each topic is divided into Partitions from where the read and write of event streams take place. You can consider the topic as a folder in your system and partitions as further sub-folders inside your folder.
You can find out how many partitions are there in each broker with the help of the Replication Factor. For example, if the Replication Factor is four and you have four brokers, the topic has four partitions. This means that one broker will be responsible for one partition.
While sending a message, a key must be given with the message. The key ensures that the message reaches its destined partition. If no key is specified, the message reaches each partition in a round-robin way.
The message sent is stored inside the broker disk, or partition with a unique identifier called an Offset, which helps distinguish the messages.
Data Replication
As we know that Kafka has a good fault-tolerant, let’s now understand how this fault tolerance works.
Each broker on the Apache Kafka cluster has a partition of a topic. Partition Leader is a concept where a partition of a topic acts like the leader of the partition and receives the messages from the client application. The data of the leader partition is replicated and shared with the replica partitions present inside the other brokers. These replicas are the followers of the partition leader.
As you can see in the image above, broker 1 is the leader for partition 0 of topic 1. Broker 2 contains the replica of partition 0 of topic 1 of broker 1. Thus, broker 2 is the follower of broker 1. Similarly, partition 2 of topic 1 in broker 3 is the leader partition, and broker 1 is the follower and contains the replica of partition 2 of topic 1 of broker3.
In any case, if the leader broker dies or doesn’t work, then one of the follower broker nodes is picked to become the leader by Zookeeper. For example, if broker 2 stops working, then the follower of broker 2, that is, broker 3, will be picked up to be a leader partition by the zookeeper.
Zookeeper is a service that stores metadata information such as the details of the partition leader. Thus, Kafka heavily depends on the zookeeper. The client request the brokers for the data they want, and then the brokers communicate with the zookeeper to find out the details.
This data replication makes sure that even if any broker is down, it won’t affect the communication between the client and the server.
Architecture of Apache Kafka
Apache Kafka has five APIs.
- Admin API: As the name suggests, the admin API manages the Kafka objects such as brokers, topics, etc.
- Producer API: The Produce API is used to write/publish a stream of events into the Kafka topics.
- Consumer API: The Consumer API helps in reading or subscribing to the topics and processes the stream of events.
- Streams API: In order to implement the applications and microservices, the Kafka Streams API is used.
- Connect API: Connectors are required to read or write the stream of events from or to the applications, external systems, or Kafka. The Kafka Connect API helps in building and running those connectors which help in the import or export of the data.
Apache Kafka is a highly growing event streaming platform that provides you with no data loss, high throughput, and much more.
Also read: What are Favicons? Are they tracking you on the Internet?