Parquet-WASM: Rust-based WebAssembly bindings to read and write Parquet data

m_d_ · 2024-04-23T00:55:21

I'd like to point out that fastparquet has been built for wasm (pydide/pyscript) for some time and works fine, producing pandas dataframes. Unfortunately, the thread/socket/async nature of fsspec means you have to get the files yourself into the "local filesystem" (meaning: the wasm sandbox). (I am the fastparquet author)

jasonjmcghee · 2024-04-22T17:10:11

Seeing as the popular alternative here would be DuckDB-WASM, which (last time I checked) is on the order of 50MB, this is comparatively super lightweight.

leeoniya · 2024-04-22T17:15:18

i think duckdb-wasm is closer to 6MB over wire, but ~36MB once decompressed. (see net panel when loading https://shell.duckdb.org/)

the decompressed size should be okay since it's not the same as parsing and JITing 36MB of JS.

reply

leeoniya · 2024-04-22T17:11:08

in my [albeit outdated] experience ArrowJS is quite a bit slower than using native JS types. i feel like crossing the WASM<>JS boundary is very expensive, especially for anything other than numbers/typed arrays.

what are people's experiences with this?

reply

kylebarron · 2024-04-22T20:25:41

Arrow JS is just ArrayBuffers underneath. You do want to amortize some operations to avoid unnecessary conversions. I.e. Arrow JS stores strings as UTF-8, but native JS strings are UTF-16 I believe.

Arrow is especially powerful across the WASM <--> JS boundary! In fact, I wrote a library to interpret Arrow from Wasm memory into JS without any copies [0]. (Motivating blog post [1])

[0]: https://github.com/kylebarron/arrow-js-ffi

[1]: https://observablehq.com/@kylebarron/zero-copy-apache-arrow-...

reply

lmeyerov · 2024-04-23T02:53:32

Yeah, we built it to essentially stream columnar record batches from server GPUs to browser GPUs with minimal touching of any of the array buffers. It was very happy-path for that kind of fast bulk columnar processing, and we donated it to the community to grow to use cases beyond that. So it sounds like the client code may have been doing more than that.

For high performance code, I'd have expected overhead in %s, not Xs. And not surprised to hear slowdowns for any straying beyond that -- cool to see folks have expanded further! More recently, we've been having good experiences more recently here in Perspective <-arrow-> Loaders, enough so that we haven't had to dig deeper. Our current code is targeting < 24 FPS, as genAI data analytics is more about bigger volumes than velocity, so unsure. However, it's hard to imagine going much faster though given it's bulk typed arrays without copying, especially on real code.

reply

domoritz · 2024-04-22T20:45:02

One of the ArrowJS committers here. We have fixed a few significant performance bottlenecks over the last few versions so try again. Also, I'm also ways curious to see specific use cases that are slow so we can make ArrowJS even better. Some limitations are fundamental and you may be better off converting to the corresponding JS types (which should be fast).

leeoniya · 2024-04-23T00:10:10

it's been about 4 years, but in Grafana at the time we were using something like ArrowJS + Arrow Flight + protobuf.js to render datasets into dashboards on Canvas, especially for streaming at ~20hz.

when i benchmarked the fastest lib to simply run the protobuf decode (https://github.com/mapbox/pbf), it was 5x slower than native JSON parsing in browsers for dataframe-like structures (e.g. a few dozen 2k-long arrays of floats). this is before even hitting any ArrowJS iterators, etc).

Grafana's Go backend uses Arrow dataframes internally, so using the same on the frontend seemed like a logical initial choice back then, but the performance simply didn't pan out.

reply

ingenieroariel · 2024-04-22T17:18:37

I'll let Kyle chime in but I tested it a few months ago with millions of polygons on an M2 16GB of RAM laptop and it worked very well.

There is a library by the same author called lonboard that provides the JS bits inside JupyterLab. https://github.com/developmentseed/lonboard

<speculation>I think it is based on the Kepler.gl / Deck.gl data loaders that go straight to GPU from network.</speculation>

reply

FridgeSeal · 2024-04-22T23:46:25

@dang we have a mass spam incursion in this comment thread.

seanw444 · 2024-04-23T05:14:20

It's site-wide.

rubenvanwyk · 2024-04-22T19:24:17

Can this read and write Parquet files to S3-compatible storage?

kylebarron · 2024-04-22T20:29:11

It can read from HTTP urls, but you'd need to manage signing the URLs yourself. On the writing side, it currently writes to an ArrayBuffer, which then you could upload to a server or save on the user's machine.

nickfs · 2024-04-23T09:58:20

;8y aiu;khjbvnvxzg;o9