The Evolution of Data Streaming and Its Significance By Srinivasulu Grandhi, VP of Engineering and Site Leader, Confluent

The Evolution of Data Streaming and Its Significance

Srinivasulu Grandhi, VP of Engineering and Site Leader, Confluent

Speaking exclusively to CIOTechOutlook, Grandhi explores how real-time data is making an impact on today’s modern world and how digital natives are best placed to capitalize on data streaming. Srinivasulu is VP of Engineering and Site Leader at Confluent, based in Bangalore. He has over three decades of software experience spanning development, architecture, as well as product and technology strategy.

Why is stream processing often considered a complex challenge?

Stream processing started with the rise of digital natives. If you think about it, successful e-commerce sites handle massive numbers of users at any given time with people both browsing and shopping.

So, to create a recommendation engine, for example, you need to be able to process the clickstream data from millions of users — each of whom may be performing several activities all at once. This poses two fundamental challenges. First, given the volume of data, how do you store and make that data available in real-time for processing?

The second challenge is how to process the stream of data in real-time.

These are challenges that e-commerce sites face - especially the really popular ones where customer experience is so important. For them to succeed, they need a highly scalable, low-latency streaming platform and stream processing system. This is where Kafka comes in.

Kafka is a hugely scalable and reliable streaming platform that allows companies in different industries to process large data streams in real time.

For example, imagine you’re about to buy something via your banking app. For it to provide real-time fraud detection, it has to stream data from multiple systems, applications and databases making the necessary checks so that you can go ahead and make that purchase.

As consumers, we don’t give this a second thought. But behind the scenes, the processing is carried out in real time. Not only does it have to be done quickly — so as not to hold up the transaction — but it also has to be done accurately to prevent fraud.

To make things even more complex, such transactions have to be done at scale with thousands of transactions completed each second. Only real-time data processing can do this. Batch processing — where data is processed periodically — is simply not up to the task.

Given the shift towards real-time data processing with the rise of data streams, how do you see this impacting traditional batch processing methods in the near future, particularly in industries like finance or healthcare where data accuracy and timeliness are crucial?

Batch processing is simply not up to the task of providing up-to-date, accurate information. That applies to finance and healthcare — and just about every other industry I can think of.

The ‘always on, always available’ nature of today’s world means people expect answers now — not in a week’s time or even “end of day”.

In the past, data from a call center or CRM system would have been analyzed and it may have, in turn, generated a report. But this would have taken time. And the information would have probably just sat there until someone requested it. And even then, people would have to wait to receive it.

But expectations have changed. In today’s world, customers expect immediate answers to their questions. They expect their queries about health insurance, for example, to be answered there and then. If not, you have a dissatisfied customer. And if the problem isn’t addressed near real-time, they will simply take their business elsewhere.

Beyond data accuracy and timeliness, there is also the need for data governance. Like in the case of the HR back-office. While you would want to empower them with data, you wouldn’t want to reveal sensitive data like PII information or salary information to all the HR employees. Real-time data processing can help determine which datasets are protected, encrypted and made available only to the people who have access.

In a world increasingly dominated by unbounded data streams, how would you design a real-time analytics system capable of handling the limitless length, continuous flow, high velocity, and great variability of these data streams effectively?

When a company introduces a new application, it starts from scratch. While it may have ambitious plans, it doesn’t know exactly how much data it will generate or how the use of that data will evolve over time.

So, it needs a platform capable of growing from something small, to something extremely big over a relatively quick timeframe. The systems must be fundamentally distributed on a large scale. And it needs to handle infinite storage on an elastic scale.

Let’s take an online food ordering and delivery service, as an example. When customers place an order, it gets shared with a number of delivery executives before the order is confirmed, allocated, and fulfilled. All of this ordering, confirmation, and fulfillment has to be done in real time.

Not only does this have to be done at scale, it also has to be capable of meeting periods of fluctuating demand. For instance, demand for food delivery peaks during dinner time. And handling a huge volume requires a large system to handle the load. But when peak time ends, the load drastically decreases, eliminating the need for bigger infrastructure.

To meet these peaks and troughs, it’s essential to have a super-elastic streaming platform that can accommodate the variable demand. With digital native businesses, it’s far easier to do this from scratch.

Does this also highlight the importance of taking a platform approach?

Absolutely. Taking a platform approach means events can be written once, allowing distributed functions within an organization to react in real time. For instance, people interacting with their bank may require a number of different services. They may want to get cash out from an ATM, enquire about a credit card transaction or wish to apply for a loan.

Since all three functions may be relevant to one another, it would be odd to handle all of these processes separately in siloes. As I pointed out earlier, Kafka — a distributed event store and stream-processing platform — can reliably store and process these events at a large scale.

And Flink — an open-source, unified stream-processing and batch-processing framework — has emerged as the de facto standard for batch and stream processing of events.

Together, these allow data to be processed in real time. And it’s why companies are opting for a single platform that is able to store, scale and flex these events in real time to meet the needs of today’s on-demand world.