Suppose you have a dataset big enough that it needs to be spread across a storag...

Suppose you have a dataset big enough that it needs to be spread across a storage cluster. Now you'd like to run some kind of operation that's either completely parallel, or fits the map/reduce model. Maybe you have a petabyte worth of video files and you want to generate thumbnails, or extract metadata, or find all frames of video with text in them.

If compute is separated from storage, then all of that video data has to be streamed over the network from a storage node to a compute node before computation can even begin; the data is "shipped" to compute.

Presumably the function you want to execute is vastly smaller than the data. It would require much less time and bandwidth to run the function on the same node as the data it's accessing; no network overhead. Assuming you have an adequate balance between compute and storage, you get much lower latency access to the data.

Some downsides include - running arbitrary code on your storage node means trusting your users or having very good sandboxing - you now have to balance compute and storage on any given node