Streaming: a skill gap?

It's not just for big data

Sunday February 2nd, 2020

I've noticed a bit of a skill gap: I think a lot of developers are not able to code up "streaming" solutions to problems.

However, streaming can often be useful, even needed, in what are now run-of-the-mill web applications; and wonderfully, we often don't need anything fancier than the tools already being used: we just need to know how to use them.

What is streaming?

Any situation when you process data concurrently with receiving it. This process can be analyzing the data, or just forwarding it onwards.

What are the benefits of streaming?

There are two main [potential] benefits.

Speed

If you start processing the data sooner, before its all received, then you [might] finish sooner.

Support higher concurrency / size limits

Say you would like users to be able to upload 500mb files: in these days of video and hi-res images, this isn't a far-fetched requirement, even for a standard web application. If you don't forward the uploaded data onwards while it's still being uploaded, just a few users uploading concurrently could use all the memory on a server.

[You can upload directly from a browser to the underlying data store. For example, to S3 using presigned URLs. However, this has its own set of drawbacks, ommitted here for brevity.]

What are typical problems with streaming?

Streaming is not a perfect/one-size-fits-all solution: it does have its downsides.

Testing

You're testing an upload with a 5KB file, and it works. Are you sure it's streaming and will work with a 5GB file? There are two options that I'm aware of.

Actually test a 5GB file [making you have less than 5GB of memory available]. While this is quite a good "real" test, it can be slow.
Hook into both sides of the streaming process, and ensure that the target receives data before the source has sent all of its. You can do this with smaller data, and so such a test can be quick. However, this can be more brittle with respect to refactorings, i.e. the test can fail while the production behaviour continues to work.

Errors

Handling errors, i.e. communicating and responding to them, can be more difficult.

Conveniently, HTTP has some of this built-in. If streaming an HTTP body with a content-length header specifying the number of bytes, if the receiver doesn't receive that amount by the time the connection has closed, they know an error has occurred. If transfer-encoding: chunked is used, if the receiver doesn't receive a 0-length chunk at the end, they know there has been an error.

It's not perfect though: there is no way to send an HTTP status code once the body has begun to stream. But for many situations, this is enough.

What to do when an error has occured may be more tricky. With a non-streaming multi-stage pipeline, if one part fails, you can usually retry because you have the source bytes to retry with. However if streaming, the bytes have gone. To retry, have to build in a mechanism to re-retrieve them from the source.

Complexity

Especially when considering error handling, retrying, or say, efficiently dealing with bandwidth differences/variation in different parts of the stream, there could be more complexity compared to a non-streaming solution.

This being said, a) you may not need to implement such things [e.g. OS-provided TCP buffers may adequately compensate for bandwidth variation], and b) I suspect the complexity is sometimes overstated, and conflated with unfamiliarity [although it would be naive to think this is isn't a problem, as mentioned below].

Performance

Ironically, there might be a performance penalty compared to non-streaming solutions due to what could be radically different operations / orders of operations. This could be especially true if using streaming for smaller amounts of data.

Homogeneity

Each part of the pipeline needs to support streaming. It's not the default in a lot of cases: which is unfortunate since you can use code that supports streaming to process data in a non-streaming way [by just using a single "chunk"], but it's impossible to do the opposite.

Unfamiliarity

Streaming has an unfortunate problem: it's the skill gap itself.

Since fewer developers are familiar with it, issues are less likely to be spotted in code reviews, streaming behaviour may be accidentially broken [if there aren't appropriate tests on it], there are fewer people to ask for help, and unfortunately any help that is given has a higher chance of being misleading.

This is admittedly a bit of a chicken/egg situation!

What can I do?

I keep hearing my mother say, practice, Harry, practice!
Harry Kim, Star Trek Voyager

Wonderfully, I think you can get a lot of valuable experience from just a few small practice web-based projects.

A GET endpoint that responds with a generated HTTP response of several GBs, just of some fake data.
A GET endpoint that responds with a file from the filesystem of several GBs. Try with with both transfer-encoding: chunked and with a specified content-length.
Proxying a file to or from S3 through a server. Try with a plain HTTP client, not just one that is AWS-aware such as Boto3.
Downloading a Postgres table of several GBs. Try with just a single query. Try responding with CSV or JSON.
Accept a large CSV upload and calculate some basic stats on the columns while its being uploaded, e.g. min, max, mean, standard deviation.

Once you have done these, you would be in a much better place to weigh up the trade-offs to know if a streaming solution is right for any given real-world project. At the very least, you'll be in a better place to review collegues' streaming-based code.