Let’s start with a Wikipedia quote:
KISS is an acronym for “Keep it simple, stupid” as a design principle noted by the U.S. Navy in 1960”  The KISS principle states that most systems work best if they are kept simple rather than made complicated; therefore, simplicity should be a key goal in design, and that unnecessary complexity should be avoided.
We are using the same acronym with the same meaning, Kafka keeps it simple and it is the Key to Streaming Success 😉
K.I.S.S., Kafka Is Streaming Success
If you have not heard about Kafka: is an Apache project for a streaming/messaging platform:
A streaming platform has three key capabilities:
- To publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- To store streams of records in a fault-tolerant durable way.
- To process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
You can find more info here: Kafka Website.
Surely your mind is now blowing up with ideas of new projects or reengineering existing ones. And that is quite normal, but it will broaden when the capabilities of Kafka start showing up.
One simple advice that should be given to anyone starting with Kafka is: forget all previous constraints, remember your needs. Obviously, this is an exaggeration, but developers, admins, operators and architects will find in Kafka a new way of doing things:
- Need to reprocess records? No problem, you can keep them in the brokers for as long as you need. By default two weeks, but it can change to months paying only for the disk storage, not the performance. You are the owner of the offset, all records are there for you.
- Need to increase the capacity of the system? No worries, add more brokers, more disks. Rebalance the load. This is the philosophical stone of system administrators: high availability and load balancing.
- Your application need to scale out consumers? Add them to the same group and they will be automatically coordinated.
- Your records must be received in order? Shhh, not everything in Kafka is good news, but there are certain ways to solve this 😉
- Faster records and streaming? Check!
Indeed, these four out of five features are the ones that shape the most successful Kafka use cases. Those that range from activity tracking, metrics and logging to commit log; all of them lead to stream processing.
This is thanks to a near real-time record delivery in Kafka, once a producer and a consumer are connected to Kafka, its architecture makes the information flow nearly instantaneous.
Kafka guys made an interesting study showing that writing and reading from disk storage sources can be as performant as memory if things are done the right way, and it is also much cheaper. Two hints on this: Kafka does not automatically delete/reorganize already consumed records (uses sequential I/O for writing them) and avoids copying the message among a lot of software layers (using the zero-copy approach via sendfile call system).
Kafka also has Streams API, its own framework for stream processing. It has many nice features but one that differs from traditional messaging systems is that, instead of working on pull-requests that deliver batches of messages in order to achieve better performance, Streams API push a continuous stream of records within milliseconds of latency.
But Kafka Streams do not have a real need for streaming, you can use Spark Streaming or Flink, able to work with streams of data or batches. Hortonworks made an interesting comparison of streaming frameworks.There’s also nice documentation and videos on Streams API here.
Are there any real streaming alternatives to Kafka?
There is one project named NATS Streaming that is an extension of NATS that provides classical streaming, though very recent and loaded with new technologies and messaging concepts. Nevertheless, version 0.10.0 still feels a bit immature. The rest of available products are simple better or worse implementations of the classical messaging systems.
And let’s not forget that Kafka is open source and surely you’ll want some support, there are a few companies over there able to do it (Big Data players like Cloudera or Hortonworks which are merging right now, Confluent, TIBCO…)
Author: Juan Tavira
Santander Global Tech