KAFKA : Fastest messaging system

Why KAFKA is so fast - 

 1. Low-Latency I/O: There are two possible places which can be used for storing and caching the data: Random Access Memory (RAM) and Disk.

  • An orthodox way to achieve low latency while delivering messages is to use the RAM. It’s preferred over the disk because disks have high seek-time, thus making them slower.
  • The downside of this approach is that it can be expensive to use the RAM when the data flowing through your system is around 10 to 500 GB per second or even more


2. Kafka Avoids the Seek Time: Yes! Kafka smartly avoids the seek time by using a concept called Sequential I/O.
    • It uses a data structure called ‘log’ which is an append-only sequence of records, ordered by time. The log is basically a queue and it can be appended at its end by the producer and the subscribers can process the messages in their own accord by maintaining pointers.
    • The first record published gets an offset of 0, the second gets an offset of 1 and so on.
    • The data is consumed by the consumers by accessing the position specified by an offset. The consumers save their position in a log periodically.
    • This also makes Kafka a fault-tolerant system since the stored offsets can be used by other consumers to read the new records in case the current consumer instance fails. This approach removes the need for disk seeks as the data is present in a sequential manner as depicted below:

3. Zero Copy Principle: The most common way to send data over a network requires multiple context switches between the Kernel mode and the User mode, which results in the consumption of memory bandwidth and CPU cycles. The Zero Copy Principle aims to reduce this by requesting the kernel to move the data directly to the response socket rather than moving it via the application. Kafka’s speed is tremendously improved by the implementation of the zero-copy principle.

4. Optimal Data Structure: Tree vs. Queue: The tree seems to be the data structure of choice when it comes to data storage. Most of the modern databases use some form of the tree data structure. Eg. MongoDB uses BTree.

  • Kafka, on the other hand, is not a database but a messaging system and hence it experiences more read/write operations compared to a database.
  • Using a tree for this may lead to random I/O, eventually resulting in a disk seeks – which is catastrophic in terms of performance.

Thus, it uses a queue since all the data is appended at the end and the reads are very simple by the use of pointers. These operations are O(1) thereby confirming the efficiency of the queue data structure for Kafka.

5. Horizontal Scaling: Kafka has the ability to have multiple partitions for a single topic that can be spread across thousands of machines. This enables it to maintain the high-throughput and provide low latency.

6. Compression & Batching of Data: Kafka batches the data into chunks which helps in reducing the network calls and converting most of the random writes to sequential ones. It’s more efficient to compress a batch of data as compared to compressing individual messages.

Hence, Kafka compresses a batch of messages and sends them to the server where they’re written in the compressed form itself. They are decompressed when consumed by the subscriber. GZIP & Snappy compression protocols are supported by Kafka.


KAFKA Commands 

#Navigate to KAFKA HOME directory

cd $KAFKA_HOME

##Kafka commands

#Create Topic

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test

#List Topics

bin/kafka-topics.sh --list --bootstrap-server localhost:9092

#Describe Topic

bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic test

#roduce Message

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

#Consume Message (from current subscription)

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test

#Consume Message (from beginning)

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

#Consume Message (with consumer group)

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --group test-group

#Consume Message (from specific partition)

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --partition 0 --offset earliest

#Consume Message (from specific offset)

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --partition 0 --offset 3

#Alter Topic

bin/kafka-topics.sh --alter --bootstrap-server localhost:9092 --topic test --partitions 3

#Delete Topic

bin/kafka-topics.sh --delete --bootstrap-server localhost:2181 --topic test

#Consumer Groups

bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group test-group

bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --all-groups

#Kafka Manager

~/kafka-manager/bin/kafka-manager &

#Kafka Connect

bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

#Kafka Stream

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic streams-plaintext-input

bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic streams-wordcount-output --config cleanup.policy=compact

bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

bin/kafka-run-class.sh org.apache.kafka.streams.examples.wordcount.WordCountDemo

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-plaintext-input

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic streams-wordcount-output --from-beginning --formatter kafka.tools.DefaultMessageFormatter --property print.key=true --property print.value=true --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer


Comments

Popular posts from this blog

Spark Cluster

DORA Metrics