Apache Kafka – Message Compression
Kafka Producers are going to write data to topics and topics are made of partitions. Now the producers in Kafka will automatically know to which broker and partition to write based on your message and in case there is a Kafka broker failure in your cluster the producers will automatically recover from it which makes Kafka resilient and which makes Kafka so good and used today. So if we look at a diagram to have the data in our topic partitions we’re going to have a producer on the left-hand side sending data into each of the partitions of our topics.
So here is another setting that’s so important which is Message Compression. Before that let’s understand the Kafka Message Anatomy first.
Kafka Message Anatomy
The Kafka messages are created by the producer and the first fundamental concept we discussed is the Key. The key can be null and the type of the key is binary. So binary is 0 and 1, but it can be strings and numbers and we’ll see how this happens to convert a string or a number into a binary.
Please refer to the above image. So we have the key which is a binary field that can be null and then we have the value which is the content of your message and again this can be null as well. So the Key-Value is some of the two most important things in your message but there are other things that go into your message. For example, your message can be compressed and so the compression type can be indicated as part of your message. For example, none means no compression but we have four different kinds of compressions available in Kafka that are mentioned below.
Apache Kafka Message Compression
Basically, our producer usually sends data in the text-based form. For example, most of the time the producers are sending some JSON data. And JSON is text. In this case, it’s important that you apply compression to the producer. JSON is very text heavy and it’s big in size So we must compress it.
Compression types can have multiple values. It can be none, which is a default, no compression, gzip, lz4, and snappy that we have discussed above. Compression is more useful when we send a bigger batch of messages. So the more data you send to Kafka the more compression is going to be helpful. So here’s how it works.
We have our producer batch and a producer batch is basically Kafka batching messages on its own. So it will have Message 1, Message 2, Message 3, up to, Message 100. It’s because our producer sends a lot of messages and it wants to send them altogether if possible. Now the producer batch will get compressed because the producer, before sending the batch to Kafka, will start compressing the batch to make it much smaller. That only happens when you enable compression. Now when we send this to Kafka, well we have a big decrease in size and automatically, sending to Kafka and replicating it across brokers is so much quicker. So you have decreased latency in this size. So that’s why compression is so important. And because you decrease stuff in size and so Kafka brokers have to do less replication, you use less network bandwidth. So the advantages to compress a batch are those.
Advantages of Kafka Message Compression
- We get a much smaller producer request size when it sends data to Kafka.
- It’s also faster to transfer data over the network which leads to less latency and better throughput.
- We also get better disk utilization in Kafka because in Kafka on the brokers, our messages will be stored in a compressed format. So, our disk has now more capacity for more messages.
Disadvantages of Kafka Message Compression
- When you do compression, producers must commit some CPU cycles to complete that compression.
- Similarly, the consumers must commit some CPU cycles to decompress the data.
Which Compression Type You Should Choose?
So as we have discussed above there are mainly four different kinds of compressions available in Kafka, gzip, snappy, lz4, and zstd. It is recommended to use snappy or lz4 because both have the same optimal speed or compression ratio. On the other hand, Gzip is going to have the highest compression ratio, but it’s not very fast. So choose, and test, it’s super simple. You just change one setting and everything works. There’s not one algorithm that works for everyone, so you just try them based on the kind of plan that you have and see the one that works best for you. And finally. it is highly recommended that always use compression in production, especially if you have a high throughput.
Advantages of snappy over other message compressions:
- snappy is very useful if your messages are text-based, for example, JSON documents or logs
- snappy has a good balance of compression ratio or CPU.
Please Login to comment...