What is Apache Kafka and How Does it Work?
When you start with your company it’s very simple you have your source system and you have your target system and maybe you need to move data between, for example, your source system is a database and your target system is an analytics system and so you want to move data from A to B and it’s very simple.
All you need to do is just create an integration but then when your company gets bigger you’re going to have a lot of source systems and a lot of target systems and integrating them all together is going to be very complicated. You’re going to have many more integrations to write and then each integration is going to come with its own set of challenges.
For example, if you have 6 source systems and 10 target systems and you want to all integrate them together you will need to write 60 integrations and each integration will have difficulties around
- Protocol: Choosing how the data will be transported (For example over TCP, HTTP, REST, FTP, etc)
- Data Format: How is the data parsed (For example Binary, CSV, JSON, etc)
- Database Schema and Evolution: How the data is shaped and how it’s going to change in the future
Then each source system every time you integrate it with a target system because there will be some processes querying some data and getting data out of the source system. The source item will have an increased load from the connections which may be a problem. So this is not a new problem this is something very very old in IT Industry and Apache Kafka is here to solve it for you.
What is Apache Kafka?
Apache Kafka allows you to decouple your data streams and systems. So the idea is that the source systems will have the responsibility to send their data into Apache Kafka, and then any target systems that want to get access to this data feed this data stream will have to query and read from Apache Kafka to get the stream of data from these 3 systems and so by having this decoupling we are putting the responsibility of receiving and sending the data all on Apache Kafka.
So this is not a new way of doing things this is called pub-sub, but Apache Kafka is revolutionary because it scales really well and it can really really handle big amounts of messages per second. So what could be the source systems and the target systems? For example, your source system could be website events, pricing data, financial transactions, or user interaction, and then the target systems may be a database, analytics system, email system, or audit.
Why Apache Kafka?
- This was a project that originated within LinkedIn and it was very successful. It was open-source and then this open-source project found its home under the Apache Software Foundation (ASF) and so this is why Kafka is called Apache Kafka.
- And so this is an open-source project but there are some private corporations maintaining the project, some of them may be Confluent, IBM, and Cloudera but many others as well but the main organization supporting the Kafka project is Confluent. Confluent is a private organization and they have a whole business model around Apache Kafka bringing their own enterprise software on top of the project.
- Apache Kafka is very very good and very very popular because it is distributed has a resilient type of architecture and is fault-tolerant
- It also has some very nice scalability because it is horizontally scalable, which means that to just add the capacity you need to add more servers, and in Apache Kafka, a Server is called a broker so Apache Kafka can scale to hundreds of brokers and it can scale to millions tens of millions of messages per second and actually Twitter is having hundreds of millions per session per second.
- It has very very high performance with a latency of fewer than 10 milliseconds which makes it a real-time system.
- It is used by thousands of firms including 60% of the Fortune 100 firms in the world and so some of the big names using Apache Kafka that you may know to include Linkedin, Airbnb, Netflix, Uber, and Walmart.
Use Cases of Apache Kafka
- It could be used as a messaging system.
- Activity Tracking.
- It could be used to gather metrics from many different locations.
- It can be used to gather application logs at scale. And the metrics and the logs were actually one of the first use cases of Apache Kafka for LinkedIn.
- It can be used for stream processing as we’ll see with the Kafka streams API for example, it can be used to decouple the system dependencies in the microservice architectures
and also he has a lot of integration with big data technology such as Spark, Flink, Storm and Hadoop in order to perform big data.
Real-World Usage of Apache Kafka
- Netflix uses Kafka to apply a recommendation in real-time while you’re watching TV shows.
- Uber uses Kafka to gather taxi user and trip data in real time and they will use it to compute and forecast demand and then they can compute the infamous search pricing in real-time to
know how much to charge you for a ride in case there is a lot of high demand.
- LinkedIn uses Kafka to prevent spam and collect user interactions to make better connection recommendations in real time.
So finally we can say, Kafka is here only as a transportation mechanism, people still need to write their applications and so on but Kafka is the big giant pipe in the middle that allows your data to flow from your source systems to your target systems in real-time.