Building an open data pipeline in 2024

RadiozRadioz · 2024-04-26T22:18:53

> And if you’re dealing with truly massive datasets you can take advantage of GPUs for your data jobs.

I don't think scale is the key deciding factor for whether GPUs are applicable for a given dataset.

I don't think this is a particularly insightful article. Read the first paragraph of the "Cost" section.

bradford · 2024-04-27T00:03:04

> I don't think this is a particularly insightful article.

Data engineering can be lonely. I like seeing the approach that others are taking, and this article gives me a good idea of the implementation stack.

reply

marcyb5st · 2024-04-27T10:06:10

After reading this I suggest having a look at apache beam if you are not using it already. I have the feeling that you can achieve the same with way fewer elements in the stack.

Also, were you to decide to run it on another "runner".

Additionally, you can truly reuse your apache beam logic for streaming and batch jobs. Other tools perhaps can do that, but from some experiments I ran some time ago it's not as straightforward.

And finally, if one or more of your processing steps need access to GPUs you can request that (granted that your runner supports that: https://beam.apache.org/documentation/runtime/resource-hints... ).

reply

fredguth · 2024-04-27T02:46:37

Same here. I just wished OP pointed to an example repo with a minimum working example.

gchamonlive · 2024-04-26T22:42:38

Yes I also believe both the dataset and the transformation algorithms have to lend themselves well to parallelization for GPUs to be useful. GPUs don't do magic they are just really good at parallel computing.

winwang · 2024-04-26T23:31:37

That's right, and that means most transforms in big data. The fact that the dataset can be distributed at all typically implies that the task is parallel.

nyokodo · 2024-04-27T02:25:19

> The fact that the dataset can be distributed at all typically implies that the task is parallel.

It depends, big data tasks aren’t necessarily CPU bound but IO bound so you won’t see any speed up throwing GPUs at the problem but you will see a speed up throwing more worker nodes that come with their own network bandwidth and memory. CPU bound problems aren’t all appropriate for GPUs either. I suspect the article author is thinking of ML pipelines where at large scales GPUs are definitely necessary but at lower scales you can get away with ordinary CPUs.

reply

winwang · 2024-04-27T02:46:10

If they aren't network/disk bound, then GPUs have significantly higher memory bandwidth (per $). If you're so network bound that your nodes must be <8 cores, then sure. Otherwise, adding a cheap GPU to your nodes would likely be cheaper.

AnthonyMouse · 2024-04-27T09:14:55

> If you're so network bound that your nodes must be <8 cores, then sure. Otherwise, adding a cheap GPU to your nodes would likely be cheaper.

Suppose you're so network bound that your nodes stop being compute bound at 64 cores; more than 8. Then any GPU is useless because you're still network bound and the CPU and GPU both have more memory bandwidth than the network. A cheap GPU is even worse because the CPU may have >100 PCIe lanes to use for network I/O but a PCIe GPU would only have 16 -- and it would take 16 PCIe lanes the system could otherwise have used for more network I/O.

reply

winwang · 2024-04-27T20:48:09

Let's assume a cloud datacenter context, using GCP as an example for real, verifiable numbers.

An L4 GPU node (with 4 CPU cores) is approximately the cost of a 16-core VM. So for the price of 64 cores, you could have ~4 L4 nodes (or perhaps one VM with 4 L4s). Each GPU has 300 GB/s of memory bandwidth, which is approximately 60 cores worth (at ~5GB/s per core). In fact, the 64 "cores" are typically 32 physical cores only, so you would expect lower than 300 GB/s for a 64-core VM.

Let's say network I/O is between 10 Gbits/s and 800 Gbits/s. PCIe 4.0x16 is 64 GB/s, i.e. 512 Gbits/s. PCIe 5.0x16 is double that, which would easily saturate the network link. Also, L4 GPUs have RDMA enabled, bypassing the CPU when ingesting data. I also highly doubt that a 64-(v)core VM can access ">100 PCIe lanes" unless you have some kind of sole tenancy setup.

Then, what if: instead of 4 L4 GPUs (which would have a similar cost to the 64-core node), you used 2 L4 GPUs and found that setup to process just as fast as your 64-core node? You'd have just halved your cost.

Of course, this depends on whether or not the GPU nodes actually process your data with the same SLA, but generally GPU network ingest > CPU network ingest (per $ of processing).

Here's some benchmark results from Voltron Data (I'm not affiliated with them): https://voltrondata.com/benchmarks/theseus

Here's more data from PayPal: https://medium.com/paypal-tech/leveraging-spark-3-and-nvidia...

reply

AnthonyMouse · 2024-04-28T02:19:44

> Let's assume a cloud datacenter context

That would be a bad fit for this kind of workload because they're not optimized for I/O.

> Each GPU has 300 GB/s of memory bandwidth, which is approximately 60 cores worth (at ~5GB/s per core).

CPU cores don't have a specific amount of bandwidth, sockets do. For example, Epyc Genoa with 12 channels of DDR5-4800 has ~460GB/s per socket. This is the same regardless of whether the physical CPU in the socket has 32 cores or 128. But also, the memory bandwidth is still irrelevant, because you're I/O bound. If the network is 800Gbits/s, i.e. 100GB/s, it doesn't matter if 64 CPU cores have 200GB/s or 900GB/s or if the GPU has more or less because none of these are the bottleneck when they're all faster than the network.

> I also highly doubt that a 64-(v)core VM can access ">100 PCIe lanes" unless you have some kind of sole tenancy setup.

Epyc systems have 128 PCIe lanes, dual socket Epyc systems can have slightly more, common Xeon systems slightly less. A handful of these are used for system devices, but there would be ~100 remaining. That is assuming that you would have the entire machine, but that's what you'd want here. Otherwise you'd need more partial machines to get more I/O.

For network-bound workloads you would then fill the machine's I/O with network interfaces. For storage-bound workloads you would do the same thing with SSDs. 100 PCIe 5.0 lanes thereby get you up to 6400Gbits/sec, i.e. 800GB/s. Even a quarter of that would not allow a PCIe 5.0x16 GPU to saturate the network/storage interface, you would need four GPUs to get to half and more than four GPUs would have you out of PCIe lanes.

That kind of configuration would cost more per instance than getting some Google Cloud VMs, but those VMs max out at 200Gbits/s, and then how many more of them do you need?

Probably the interesting use case for cloud VMs when you're I/O bound is the one you excluded initially, when you're thoroughly I/O bound and hardly need any compute at all, and then you can get cheap VMs with low core counts and a medium amount of I/O and just use a ton of them distributed across separate physical machines to get high aggregate I/O bandwidth.

The situation you keep trying to find is the one where you're not actually I/O bound. If you're compute or memory bound on the types of computation GPUs are suited to then obviously GPUs will be cheaper.

reply

winwang · 2024-04-28T06:26:25

Sure, everything I said likely goes out the window if we're talking on-prem. No idea how I/O works in that setting, and in fact I am completely unfamiliar with workloads which saturate 800 GB/s network, especially since that approaches the limit of main memory bandwidth. Really appreciate your response! And you're definitely right in noting that the situation I'm talking about is when we're not thoroughly I/O bound.

dangoldin · 2024-04-26T23:20:39

Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy. Audience here isn't everyone and the goal was to give less experienced data engineers and folk a sense of modern data tools and a possible approach.

But what did you mean by "Read the first paragraph of the `Cost` section"?

reply

llm_trw · 2024-04-27T00:58:33

>Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy.

It isn't though.

What matters is the memory footprint of the algorithm during execution.

If you're doing transformation that take constant time per item regardless of data size, sure, go for a GPU. If you're doing linear work you can't fit more than 24gb on a desktop card and prices go to the moon quickly after that.

Junior devs doing the equivalent of an outer product on data is the number one reason I've seen data pipelines explode in production.

reply

dangoldin · 2024-04-27T02:23:27

Yes but most data-heavy tasks are parallelizable. SQL itself is naturally parallelizable. There's a reason Apache RAPIDs, Voltron, Kinetica, Sqream, etc exist.

Full transparency I don't have huge amount of experience at working on this massive scale and to your point you need to understand the problem and constraints before you propose a solution.

reply

llm_trw · 2024-04-27T22:34:55

There are more asterisks attached to each assertion you're making than you can shake a stick at.

There is always a 'simple' transformation that the business requires which turns out to need n^2 space that kills the server it's running on because people believe everything you said above.

Or in other words: most of the time you don't need a seat belt in a car either.

reply

cgio · 2024-04-27T08:46:28

You have to revisit the assertion that SQL is naturally paralleliseable. As a guide have a look at the semantics around Spark shuffles.

zX41ZdbW · 2024-04-27T00:03:27

In fact, the opposite is true. While small datasets can be handled on GPU (although there are no good GPU databases comparable in performance to ClickHouse), large datasets don't fit, and unless there is a large amount of computation per byte of data, moving data around will eat the performance.

lmeyerov · 2024-04-27T05:11:33

Sort of

Databricks etc architectures are mostly slowly moving data from A to B and doing little work, and worse when that describes the distributed compute part too. I read a paper awhile back where that was often half the time.

GPU architectures end up being explicitly about scaling the bandwidth. One of my favorite projects to do with teams tackling something like using GPU RAPIDS to scale event/log processing is to get GPU direct storage going. Imagine an SSD array reading back at 100GB/s, which feeds 1-2 GPUs at the same (PCI cards), and then TB/s for loaded data cross-GPU and mind-blowing many FLOPS. Modern GPUs get you 0.5TB+ per-node GPU RAM. So a single GPU node, when you do the IO bandwidth right for such fat streaming, is insane.

So yeah, taking a typical Spark cloud cluster and putting GPUs in vs the above is the difference between drinking from a childrens twirly straw vs a firehose.

reply

datadrivenangel · 2024-04-27T16:41:46

A100 only has 40GB GPU RAM, so inter-node memory can be a bandwidth issue.

lmeyerov · 2024-04-27T20:24:39

I don't follow. Maybe the point is a lot of people are not balancing their systems, so by sticking with wimpy-era IO architectures, they're not feeding their GPUs?

I think about balancing nodes differently when designing older Spark CPU clusters vs modern GPU systems. (New spark clusters changed again to look more GPU/vertical, another story.)

In the databricks wonder years, horizontal scale made sense. Lots of cheap wimpy nodes with minimal compute per node was cost effective for a lot of problems. It was faster because the comparison point was older hadoop jobs that didn't run in-memory. But every byte moves far, and each node does very little... slow, energy costs, etc.. Makes sense when vertically scaled components are more expensive for same power budget etc, which used to be true before multicore & GPU chips got a lot cheaper and same with memory & IO (and software caught up too.)

As soon as you jump to GPU nodes, you're back to vertical scaling thinking. Instead of chaining a lot of single-GPU A100 boxes, and waiting on internode IO, you go multi-GPU (intra-node) and bring data closer/wider. One PCI card on a consumer devices might be say 8-30GB/s, and much faster if you go server grade. Similar multiples for IO, like 4-15 SSDs at 2GB/s each, or whatever network you can get (GDS, ...), or getting more CPU RAM (TB is a lot cheaper now!) to feed the local GPUs.

It takes a lot to saturate a single GPU node that looks like those. Foundation model teams like OpenAI & Facebook's core ones doing massive training runs will use hundreds/thousands of GPUs and need those nodes. But people doing fine-tuning, serving inferencing, and 400GB/s ETL... won't. Replace your roomful of Spark racks with a GPU rack or two. E.g., we have a customer who had a big graph database over many CPU nodes, but nowadays we can fit their biggest in 1 GPU's memory. They have more smaller graphs, so we can add a second GPU on the same server, and keep all of their graphs in CPU RAM. So a 2-GPU node with a bunch of CPU RAM can replace a rack of the CPU-era vendor. So not even a rack, just a single node. Nvidia's success stories on cutting down Pixar render farms worked similarly at way more impressive scales.

And for folks who haven't been following... Nvidia RAM increases have been impressive. An H100 doubles the A100's RAM 40GB => 80GB, and the H200s OpenAI started using have 141GB. For a lot of workloads, we see bursty use vs always on, so we actually often price out based on $ per GPU RAM: <3 T4 GPUs, despite being old!

reply

amadio · 2024-04-27T07:09:45

I take issue with this part of the article:

> In general, managed tools will give you stronger governance and access controls compared to open source solutions. For businesses dealing with sensitive data that requires a robust security model, commercial solutions may be worth investing in, as they can provide an added layer of reassurance and a stronger audit trail.

There are definitely open source solutions capable of managing vast amounts of data securely. The storage group at CERN develops EOS (a distributed filesystem based on the XRootD framework), and CERNBox, which puts a nice web interface on top. See https://github.com/xrootd/xrootd and https://github.com/cern-eos/eos for more information. See also https://techweekstorage.web.cern.ch, a recent event we had along with CS3 at CERN.

reply

AnthonyMouse · 2024-04-27T08:11:31

Not only that, open source and proprietary software both generally handle the common case well, because otherwise nobody would use it.

It's when you start doing something outside the norm that you notice a difference. Neither of them will be perfect when you're the first person trying to do something with the software, but for proprietary software that's game over, because you can't fix it yourself.

reply

datadrivenangel · 2024-04-27T17:39:05

Your options are to use off the shelf and end up with a brittle and janky setup, or use open source and end up with a brittle and janky setup that is more customized to your workflows... It's a tradeoff though, and all the hosting and security work of open source can be a huge time sink.

AnthonyMouse · 2024-04-27T19:08:13

You don't actually have to do any of that work if you don't want to. Half the open source software companies have that as their business model -- you can take the code and do it yourself or you can buy a support contract and they do it for you. But then you can make your own modifications even if you're paying someone to handle the rest of it.

victor106 · 2024-04-27T15:14:39

> Cloudflare R2 (better than AWS S3). the article links to [1]

Is R2 really better than S3?

https://dansdatathoughts.substack.com/p/from-s3-to-r2-an-eco...

reply

dangoldin · 2024-04-28T02:40:53

From egress + storage cost standpoint absolutely which ends up being a big factor for these large scale data systems.

There’s a prior discussion on HN about that post: https://news.ycombinator.com/item?id=38118577

And full disclosure but I’m author of both posts - just shifted my writing to be more focused on the company one.

reply

esafak · 2024-04-27T18:31:50

By what metric? They're worth consideration and getting better: https://blog.cloudflare.com/r2-events-gcs-migration-infreque...

esafak · 2024-04-27T18:33:28

Can someone explain this "semantic layer" business (cube.dev)? Is it just a signal registry that helps you keep track of and query your ETL pipeline outputs?

dangoldin · 2024-04-28T02:35:55

Author here. Basic idea is you want some way of defining metrics. So something like “revenue = sum(sales) - sum(discount)” or “retention = whatever” which need to be generated via SQL at query time vs built in to a table. Then you can have higher confidence multiple access paths all have the same definitions for the metrics.

Phlogi · 2024-04-27T18:26:42

Why do I need sqlmesh if i use dbt/snowflake?

dangoldin · 2024-04-28T02:41:28

You don’t need to. dbt/sqlmesh are competitive. I just like the model of sqlmesh over dbt but dbt is much more dominant.