Streaming Big Data with Flink
Earlier this year our team was tasked with enhancing a legacy application to support real-time streaming updates and considerably beefed up scaling requirements. This legacy application was responsible for normalizing & enriching all the transactional events across multiple trading systems here at Liquidnet. There was also a batch process that ran periodically to link these events so that downstream client applications – reporting apps, UIs, etc., could easily reconstitute the complex hierarchy of orders, execs, replaces, fills, parent and child cancels, and so on.
As you may have already figured out, the need to stream real-time updates and have a batch process to enrich the data are at odds with each other. An earlier approach to the problem would have been to perform the enrichment in real-time per event, using a centralized datastore for persistence and look-up (i.e. state management). That would have meant introducing anywhere between 10-100 milliseconds of processing time latency per event, severely limiting our ability to scale.
Taking a step back and applying first principles, a problem like this can be solved by writing an application that reads this stream and performs stateful computation in-memory. However, there are multiple issues with this simple approach:
You cannot run a second instance of this application and expect to double your throughput. Because without properly sharding the incoming data, these instances will see an inconsistent view of the incoming stream, producing incorrect/non-deterministic results.
Even if you address the sharding problem, you still need to address fault tolerance, since the state is in-memory and you need a robust mechanism to recover from node failures.
If the application is performing real-time analytics, you would have to deal with event-time processing semantics and late/out-of-order events.
A case for Flink
In our search to find the right tool for this task, we researched and considered several options, including (but not limited to); Apache Spark, Kafka Streams, Akka Streams, and Apache Flink. But it was quickly apparent Flink was way ahead of the competition, in terms of serving our current and future needs.
The following are some of the key features that tilted the scales in favor of Flink:
Flink was built from the ground up as a streaming processing engine. Batch processing is treated as stream processing on bounded streams.
Native support for event-time processing. This is crucial if you are performing any real-time analytics, without event-time processing, results from streaming analytics are bound to be incorrect when dealing with late events.
Exactly-once state consistency (can be extended to end-to-end exactly once processing)
Distributed asynchronous checkpoints – crucial for fault tolerance, deployments, A/B testing, etc.
Rich Windowing API.
Extensive support for the most sought after Streaming features (as illustrated by this Beam Capability Matrix).
Support for large in-memory state (there are examples of people using Flink to manage several TBs of state in production).
Needless to say, with Flink the problems described in the previous section are already handled for you, allowing you to scale as you need and still be completely fault-tolerant. As a developer you only worry about your computational/business logic, and the framework takes care of converting it into a parallel execution plan, handling the rest of the complexities for you.
Messaging System vs Distributed Log
Once we settled on Flink to be our stateful stream processing engine, it didn’t take long for us to realize that using a distributed log like Apache Kafka made a lot of sense.
Unlike a traditional messaging system, Apache Kafka is a distributed log that not only supports the functionality offered by a traditional messaging system it also doubles up as storage (i.e. event store and that can be treated as the “single source of truth”). This log can be used as long-term storage. So it’s not only messaging but also storage – storage that can be scaled linearly without the inherent dangers of shared, mutable state.
Kafka consumer offsets are stored as part of Flink checkpoint/savepoint state; this enables us to replay events from the past – useful for A/B testing, backfilling the data after a bug fix is applied, etc.
The Stream Team
Here is the software stack that powers our streaming platform:
Kafka – Message bus
Flink – Stateful stream processing engine
Avro – Wire format
Pure FlashBlade – Kafka storage and Flink checkpoints/savepoints
Prometheus / InfluxDB / Grafana – Metrics and monitoring
Ignite – Distributed in-memory cache
Akka / Akka Streams – Client APIs built on top of Kafka streams
We already have a couple of applications in production using this stack and there is at least one more in the incubation phase. This is just the beginning for us.
Future posts will cover the rest of the components in the line-up.
So to sum up…
Streaming is becoming increasingly important in Big Data for the following reasons:
Timeliness of data is important for businesses to improve their agility in making key decisions. Switching to streaming is a great way to achieve lower latency.
Massive, unbounded datasets are becoming quite common in most modern businesses. It’s easier to handle the complexity of such datasets using systems designed specifically to handle unbounded streams of data.
Processing data as it arrives spreads workloads out evenly over time. This results in a more consistent and predictable consumption of resources.
Having said that, handling streaming data at scale poses a lot of technical challenges unique to stream processing. The tandem of Flink & Kafka has positioned us well to deal with these challenges and build elegant solutions without letting any of these complexities creep into our implementation/domain.