Hacker News new | past | comments | ask | show | jobs | submit login
Why Fugaku, Japan's fastest supercomputer, went virtual on AWS (amazon.com)
89 points by panrobo 13 days ago | hide | past | favorite | 42 comments





There is a big interest of what Fugaku and Dr. Matsuoka are doing here and it seems that this article is missing it entirely.

HPC development is not your standard dev workflow where your software can be easily developed and tested locally on your laptop.

Most software will requires a proper MPI environment with a parallel file system and (often) a beefy GPU.

Most development on a supercomputer is done on a debug partition. A small subset of the supercomputer reserved for interactive usage. That allows to test the scalability of the program, hunts Heisenbug related to concurrency, access large datasets, etc...

But Debug partitions are problematic: Make it too small and your scientists & devs loose productivity. Make it too large and you are wasting your supercomputer resources to something else than production jobs.

The Cloud solves this issue. You can spawn your cluster, debug and test your jobs, destroy your cluster. You do not need very large scale nor extreme performance, you need flexibility, isolation and interactivity. The Cloud gives you that because of the virtualization.


> Most software will requires a proper MPI environment with a parallel file system and (often) a beefy GPU

I’m but a tourist in this domain, but can you dig into this a bit more and compare/contrast w “traditional” development? I presume the MPI you’re talking about is OpenMPI or MPICH, which need to be dealt with directly - but what are the considerations/requirements for a parallel FS? Hardware is hardware, and I guess what you’re saying re: GPUs is that you can’t fake The Real Thing (fair enough), but what other interesting “quirks” do you run into in this HPC env vs non-HPC?


The interconnect and network topology is also a big component of the hardware where you can't "fake The Real Thing" in practice. You can often get fairly confident in program correctness for toy problem runs by scaling 1-~40 ranks on your local machine, but you can't tell much about the performance until you start running on a real distributed system where you can see how much your communication pattern stresses the cluster.

Or if you run into bugs / crashes that needs 1000s of processes or a full scale problem instance to reproduce, god help you and your SLURM queue times.


Depends on the software being used but it's probably OpenMPI (or a variant). However, OpenMP is also used especially in a hybrid mode where on a node OpenMP is used for shared memory parallelism and OpenMPI is used for inter-node parallelism.

The parallel FS stuff is mainly to handle large number of nodes streaming data in and out. E.g. a few thousand nodes all reading from data sets or saving checkpoint data.

One big difference you'll see in HPC is large scale, fine grained parallelism. E.g. a lot of nodes simulating some process where you need to resync all the nodes and exchange data between them at each time step. Also checkpointing, i.e. since simulations may take weeks to run, most apps support saving application state to disk periodically so that if something crashes, you'll only lose a few hours or a day of computation. The checkpointing also causes a bunch of FS io since you need to save the application state from all the nodes to storage periodically so you'll see really high io spikes when that happens.


Lots of legacy HPC code assumes POSIX file I/O, which means a parallel file system, which means getting the interconnect topology right, which is not easy.

The almost single most important feature of supercomputers is guaranteed low latency interconn3ct between two nodes of a supercomputer, which guarantees high performance for even very talkative workloads. This is why supercomputers document their network topology so thoroughly and allow people to reserve nodes in a single rack for example.

The title is a little bit misleading, Fugaku stills exists phisically. This project is about being able to replicate the software environment used in Fugaku on AWS.

Is this running on Gravitron, or Fujitsu's custom ARM?

Graviton3

The goal seems to be: "Virtual Fugaku provides a way to run software developed for the supercomputer, unchanged, on the AWS Cloud". Is AWS running A64FX's? Is there more info on all the details of this somewhere? I think it is a compelling goal to have portability across compute and am curious how they are going about it and what tradeoffs they've made. Software portability is one level of hard and performance portability is another. I wish we could get an OCI spec for running a job that could fill this purpose

This article has some more details:

https://www.hpcwire.com/2023/01/26/riken-deploys-virtual-fug...

It looks like they want to make the transition between Fugaku and AWS and vice versa easier.


If anyone wants to dabble with (HPC) compute clusters, ElastiCluster is a handy tool to spin up nodes using various cloud APIs:

* https://elasticluster.github.io/elasticluster/


I can't even imagine the hourly bill. This seems like a great way for Amazon to gobble up institutional research budgets.

> I can't even imagine the hourly bill

Yes, and no. As a replacement for a reasonably well used cluster, its going to be super expensive. Something like 3-5x the price.

For providing burst capacity, its significantly cheaper/faster, and can be billed directly.

But would I recommend running your stuff directly on AWS for compute heavy batch processing? no. if you are thinking about hiring machines in a datacenter, then yes.


I currently work on a research group that does all of its bioinformatics analysis in AWS. The bill is expensive, sure, 6 figures regularly, maybe 7 in its lifetime, but provisioning our own cluster (let alone a supercomputer) would've been way more costly and time consuming. You also now need people to maintain it, etc... And at the end of the day, it could well be the case that you still end up using AWS for something (hosting, storage, w/e)

I think it's a W, so far.


For a startup I ran, we too built an HPC cluster on AWS. It was very much a big win, as sourcing and installing that amount of kit would have been far too much upfront.

I should have really mentioned your(our) use case, as smallish department without dedicated access to a cluster or people who can run a real steel cluster.

AWS's one click lustre clustre service (FSx) combined with Batch is a really quick and powerful basic HPC cluster.

Before we got bought out, we had planned for creating a "base load" of real steel machines to process the continuous jobs, and spin up the rest of the cluster as and when demand needed


because we essentially did spend exactly that to build a cluster: what amount of ressources are you using (coreh/instances/storage)?

There are tradeoffs in both cases.

Some issues with building a cluster are, you're locked into the tech you bought, need space, expertise to manage it, cost to run/maintain.

You can potentially recoup some of that investment by selling it later, usually a quickly depreciating asset though.

But, no AWS or AWS premium.


There's also the third way, of going for intermediate-size cloud providers, you get the lack of Capex and actual hardware to deal with without the AWS premium. I don't understand why so many people act as if the only alternative was AWS or self-hosted.

Never tried to suggest that, but tbh, the value added services on both AWS and GCP are hard to emulate, and they "just work" already.

Sure, I could spend a few weeks (months?) compiling and trying some random GNU-certified "alternatives" for Cloud Services but ... just nah ...


The main value is that you can pay money to have decent network and storage with the hyperscalers. Which none of the intermediate-size cloud services offer to my knowledge?

> I could spend a few weeks (months?) compiling

why would you compile infra-stuff?! Usually this is nicely packaged...

> random GNU-certified "alternatives" for Cloud Services but

For application software you have to take care of this also with the cloud or are you just using the one precompiled environment they offer? Which cloud services do you need anyway - in our area (MatSci), things are very, very POSIX-centered and that is simple to set up.


>Usually this is nicely packaged...

That has almost never been my experience with "alternatives" but if you can provide a few links I would like to learn about them.


well, I don't know what you use? however

- ZFS is in Ubuntu now, you can host a nice NFS-server easily. MiniIO is a single Go binary, databases are packaged too.

- Slurm is packaged too! I have to admit this gets hairy if you want to have something like https://github.com/NVIDIA/pyxis but still this is far from arcane "alternative" software but the standard. If you buy Nvidia, it comes with your DGX-server just like its rented out by Amazon...

the main remaining pain point I would see is actually netbooting/autoprovisioning the machines (at least this was annoying with us).


What kind of “services” do they provide that provide value in the kind of HPC setting you were talking about?

> trying some random GNU-certified "alternatives" for Cloud Services but ... just nah ...

A significant fraction, if not the majority, of those services are in fact hosted versions of open-source projets[1] so there's no need to be snobbish.

[1]: and that's why things like Teraform and Redis are going source-available in recent days, to fight against cloud vendors parasitic behavior


It sounds like you're spending 7 figures of your research money without having done the most basic investigation of alternatives.

I spend low 6 figures of research money on hardware + staff each year, and this avoids us spending 7 figures on cloud costs + staff.


I guess having built the physical system I am aware of the tradeoffs... Though due to the funding structure and 0-cost colocation (for our unit), there was not a lot to be discussed and thus I'd be interested in actual numbers for comparison!

> For providing burst capacity, its significantly cheaper/faster, and can be billed directly.

Cloud is cheaper for such a workload, yes. But you still wouldn't want to pay the AWS premium for that.

But I guess nobody ever got fired for choosing AWS.


Depends.

If you want high speed storage, thats reliable and zero cost to setup (ie doesn't require staff to do) then AWS and spot instances is very much a cheap option.

If you have in house expertise and can knock up a reliable high speed storage cluster (no porting to s3 is not the right option, that takes years and means you can't edit files inline.) then running on one of the other cloud providers is an option.

But they often can't cope with the scale, or don't provide the right support.


> and zero cost to setup (ie doesn't require staff to do)

Ah, the fallacy of the cloud not requiring staff to magically work. Funnily enough, every time I heard this IRL, it was coming from dudes who was in fact paid to manage AWS instances and services for their company as their day job but somehow never included his salary and the one of his colleagues.

> But they often can't cope with the scale

Most people vastly overestimate the scale they need or underestimate the scale of even second tier cloud providers but there's not many workload that could cause storage trouble anywhere. For the record a single rack can host 5/10 PB, how many people needs multiple exabytes of storage again?

> or don't provide the right support.

I've never been a whole-data-centers-scale customer but my experience with AWS support didn't leave me in awe.


> Ah, the fallacy of the cloud not requiring staff to magically work.

its not a fallacy, its a tradeoff. If you want to have 100nodes doing batch work in a datacentre, you need to buy, install, commission and monitor hardware, storage and networking.

Now, That is possible, but hard to do on your own. So realistically you'd get a managed service to install that stuff in a data centre. As your experience will show you, the level of service you get is "patchy". You really need some one who has worked on real steel to specify that kind of work. Those people are incredibly rare, especially if you want someone who can talk the rest of the stack as well. (That is one massive downside of the cloud, there are hardly any new versatile sysadmins being created.)

As a former VFX sysadmin who looked after a number of large HPC clusters, its fairly simple to get something working, but something running reliably and fast is another matter.

> Most people vastly overestimate the scale they need

Yes, but we are talking about an HPC cluster that is >100k nodes. Something fast enough to cause hilarious QoS issues on most storage systems. especially single rack arrays with undersized metadata controllers. Even on isilon with its "scale out" clustering, one job can easily nail the entire cluster by doing too many metadata operations. (protip single namespace clusters are rarely worth the effort outside of a strict set of problems. Its much better to have a bunch of smaller fileservers grouped together in an automount. Think of it like DB sharding, but with folders.)

Second tier cloud providers are rarely able to provide 10k GPUs on the spot, certainly not at a reasonable price. Most don't have the concept of spot pricing, so you're deep into contract negotiations.

> AWS support didn't leave me in awe.

Its not great unless you pay. but you will eventually get to someone who can answer your question. Azure on the other hand, less so.


> its not a fallacy, its a tradeoff. If you want to have 100nodes doing batch work in a datacentre, you need to buy, install, commission and monitor hardware, storage and networking.

The fallacy is that it costs zero staff. But in reality it doesn't, it reduces the amount of staff you need, but only to an extend, because you still need plenty of people to run and monitor your cloud infra.

> Yes, but we are talking about an HPC cluster that is >100k nodes.

It is also much higher requirements than anything AWS is used to run, so you either have the know-how in house or don't have it at all.

> Second tier cloud providers are rarely able to provide 10k GPUs on the spot,

If you need 10k GPUs on the spot, then you have a serious planning issue… I know that many companies have incompetent management, but that's not the kind of thing you're supposed to brag about in an internet discussion. And using AWS to cope with management and planning deficiency is only going to get you this far, at some point you'll need to address them eventually.


I guess it really depends how many hours per year the Fugaku is used at its max capacity. Also in this case, it could grow progressively.

Someone needs to share utilization data on these super computer clusters. Most have a long queue of jobs waiting to be ran, and you have to specify how long you think your job is gonna take to properly schedule it.

Fugaku has a public dashboard (Grafana)

https://status.fugaku.r-ccs.riken.jp/


From the article:

> When Fugaku was running on premises, it was “at 95 percent capacity and constantly full,” said Dr. Matsuoka.


At this scale, I would expect they have better terms than the retail pricing.

The article doesn't mention this, but I imagine having an AWS version of the supercomputer would be extremely helpful for software development and performance optimization, especially if the same code could be tested with fewer nodes.

As someone not versed in the field, can anyone explain the types of workloads that are ideal for HPC machines, as opposed to, for example, a huge number of networked GPUs?

E.g. it reportedly cost over $100 million to train GPT 4. My understanding is this was done in a huge number of high performance GPUs, e.g. A100s, etc. So I guess my question then is what would a dedicated "supercomputer" be used for today that couldn't be accomplished on one a more traditional network?

To emphasize, none of these questions are meant to be rhetorical or leading - I honestly don't know and am curious.


That’s basically what super computers have been for the last twenty years: large computer optimized data centers. The main differences in hardware are usually infiniband (and its RDMA capabilities) paired with a really powerful parallel file system cluster. Occasionally they’ll be exotic compute accelerators or lately just variants of the gpus specific to the cluster. On the software side it’s usually slurm managing code that leverages MPI, openMP and CUDA. Having a large homogeneous cluster that has a ~10 year lifespan means that you you have a lot of tuning up and down the stack from specific MPI implementations for your infiniband hardware to optimization tricks in the science code tuned for the specific gpu and cpu models. All of this has gotten even less specialized since GPUs started to take off and those trends have accelerated with ML having the same needs. ML is also prompting clouds to provide the same kinds of hard ware with infiniband and rdma.

Algorithms like differential equation solvers for extremely fast wave speed physical systems. Molecular dynamics, electronic structure calculations. Any algorithm that requires FFT over 1 billion cells.

Today "supercomputer" translates to "really high AWS spend".

They haven’t said a word about costs. Does it cost as much/less/more?

One challenge would be deciding on the unit to compare the costs. They could pick cost per Peta/ExaFLOPs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: